주니어 머신러닝 엔지니어 면접 질문: 완벽 가이드

소개

머신러닝 엔지니어는 프로덕션 환경에서 ML 시스템을 구축, 배포 및 유지 관리합니다. 주니어 ML 엔지니어는 탄탄한 프로그래밍 기술, ML 알고리즘에 대한 이해, ML 프레임워크 경험, 배포 방식에 대한 지식을 갖추어야 합니다.

이 가이드는 주니어 머신러닝 엔지니어를 위한 필수 면접 질문을 다룹니다. 파이썬 프로그래밍, ML 알고리즘, 모델 훈련 및 평가, 배포 기본 사항, MLOps 기본 사항을 살펴보고 첫 번째 ML 엔지니어링 역할에 대비할 수 있도록 돕습니다.

Python & 프로그래밍 (5개 질문)

1. 메모리에 맞지 않는 대용량 데이터 세트를 어떻게 처리합니까?

답변: 사용 가능한 RAM보다 큰 데이터를 처리하는 몇 가지 기술이 있습니다.

배치 처리: 데이터를 덩어리로 처리
제너레이터: 필요에 따라 데이터 생성
Dask/Ray: 분산 컴퓨팅 프레임워크
데이터베이스 쿼리: 필요한 데이터만 로드
메모리 매핑 파일: 디스크를 메모리처럼 액세스
데이터 스트리밍: 데이터가 도착하는 대로 처리

import pandas as pd
import numpy as np

# 나쁨: 전체 데이터 세트를 메모리에 로드
# df = pd.read_csv('large_file.csv')  # 충돌 가능

# 좋음: 덩어리로 읽기
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    # 각 덩어리 처리
    processed = chunk[chunk['value'] > 0]
    # 결과 저장 또는 집계
    processed.to_csv('output.csv', mode='a', header=False)

# 제너레이터 사용
def data_generator(filename, batch_size=32):
    while True:
        batch = []
        with open(filename, 'r') as f:
            for line in f:
                batch.append(process_line(line))
                if len(batch) == batch_size:
                    yield np.array(batch)
                    batch = []

# 분산 컴퓨팅을 위한 Dask
import dask.dataframe as dd
ddf = dd.read_csv('large_file.csv')
result = ddf.groupby('category').mean().compute()

희귀도: 매우 흔함 난이도: 중간

2. Python 데코레이터에 대해 설명하고 ML 사용 사례를 제시하십시오.

답변: 데코레이터는 코드를 변경하지 않고 함수를 수정하거나 향상시킵니다.

ML의 사용 사례:
- 함수 실행 시간 측정
- 예측 로깅
- 결과 캐싱
- 입력 유효성 검사

import time
import functools

# 시간 측정 데코레이터
def timer(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"{func.__name__} took {end - start:.2f} seconds")
        return result
    return wrapper

# 로깅 데코레이터
def log_predictions(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        predictions = func(*args, **kwargs)
        print(f"Made {len(predictions)} predictions")
        print(f"Prediction distribution: {np.bincount(predictions)}")
        return predictions
    return wrapper

# 사용법
@timer
@log_predictions
def predict_batch(model, X):
    return model.predict(X)

# 캐싱 데코레이터 (메모이제이션)
def cache_results(func):
    cache = {}
    @functools.wraps(func)
    def wrapper(*args):
        if args not in cache:
            cache[args] = func(*args)
        return cache[args]
    return wrapper

@cache_results
def expensive_feature_engineering(data_id):
    # 비용이 많이 드는 계산
    return processed_features

희귀도: 흔함 난이도: 중간

3. `@staticmethod`과 `@classmethod`의 차이점은 무엇입니까?

답변: 둘 다 인스턴스가 필요 없는 메서드를 정의합니다.

@staticmethod: 클래스 또는 인스턴스에 액세스 불가
@classmethod: 클래스를 첫 번째 인수로 받음

class MLModel:
    model_type = "classifier"
    
    def __init__(self, name):
        self.name = name
    
    # 일반 메서드 - 인스턴스 필요
    def predict(self, X):
        return self.model.predict(X)
    
    # 정적 메서드 - 유틸리티 함수
    @staticmethod
    def preprocess_data(X):
        # self 또는 cls에 액세스 불가
        return (X - X.mean()) / X.std()
    
    # 클래스 메서드 - 팩토리 패턴
    @classmethod
    def create_default(cls):
        # cls에 액세스 가능
        return cls(name=f"default_{cls.model_type}")
    
    @classmethod
    def from_config(cls, config):
        return cls(name=config['name'])

# 사용법
# 정적 메서드 - 인스턴스 불필요
processed = MLModel.preprocess_data(X_train)

# 클래스 메서드 - 인스턴스 생성
model = MLModel.create_default()
model2 = MLModel.from_config({'name': 'my_model'})

희귀도: 중간 난이도: 중간

4. ML 파이프라인에서 예외를 어떻게 처리합니까?

답변: 적절한 오류 처리는 파이프라인 실패를 방지하고 디버깅에 도움이 됩니다.

import logging
from typing import Optional

# 로깅 설정
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ModelTrainingError(Exception):
    """모델 훈련 실패에 대한 사용자 정의 예외"""
    pass

def train_model(X, y, model_type='random_forest'):
    try:
        logger.info(f"Starting training with {model_type}")
        
        # 입력 유효성 검사
        if X.shape[0] != y.shape[0]:
            raise ValueError("X와 y는 샘플 수가 같아야 합니다.")
        
        if X.shape[0] < 100:
            raise ModelTrainingError("훈련 데이터가 부족합니다.")
        
        # 모델 훈련
        if model_type == 'random_forest':
            from sklearn.ensemble import RandomForestClassifier
            model = RandomForestClassifier()
        else:
            raise ValueError(f"Unknown model type: {model_type}")
        
        model.fit(X, y)
        logger.info("Training completed successfully")
        return model
        
    except ValueError as e:
        logger.error(f"Validation error: {e}")
        raise
    except ModelTrainingError as e:
        logger.error(f"Training error: {e}")
        # 더 간단한 모델로 대체 가능
        return train_fallback_model(X, y)
    except Exception as e:
        logger.error(f"Unexpected error: {e}")
        raise
    finally:
        logger.info("Training attempt finished")

# 리소스 관리를 위한 컨텍스트 관리자
class ModelLoader:
    def __init__(self, model_path):
        self.model_path = model_path
        self.model = None
    
    def __enter__(self):
        logger.info(f"Loading model from {self.model_path}")
        self.model = load_model(self.model_path)
        return self.model
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        logger.info("Cleaning up resources")
        if self.model:
            del self.model
        return False  # 예외를 억제하지 않음

# 사용법
with ModelLoader('model.pkl') as model:
    predictions = model.predict(X_test)

희귀도: 흔함 난이도: 중간

5. Python 제너레이터란 무엇이며 ML에서 왜 유용합니까?

답변: 제너레이터는 값을 한 번에 하나씩 생성하여 메모리를 절약합니다.

장점:
- 메모리 효율적
- 지연 평가
- 무한 시퀀스
ML 사용 사례:
- 데이터 로딩
- 배치 처리
- 데이터 증강

import numpy as np

# 배치 처리를 위한 제너레이터
def batch_generator(X, y, batch_size=32):
    n_samples = len(X)
    indices = np.arange(n_samples)
    np.random.shuffle(indices)
    
    for start_idx in range(0, n_samples, batch_size):
        end_idx = min(start_idx + batch_size, n_samples)
        batch_indices = indices[start_idx:end_idx]
        yield X[batch_indices], y[batch_indices]

# 훈련에서의 사용법
for epoch in range(10):
    for X_batch, y_batch in batch_generator(X_train, y_train):
        model.train_on_batch(X_batch, y_batch)

# 데이터 증강 제너레이터
def augment_images(images, labels):
    for img, label in zip(images, labels):
        # 원본
        yield img, label
        # 뒤집기
        yield np.fliplr(img), label
        # 회전
        yield np.rot90(img), label

# 훈련을 위한 무한 제너레이터
def infinite_batch_generator(X, y, batch_size=32):
    while True:
        indices = np.random.choice(len(X), batch_size)
        yield X[indices], y[indices]

# steps_per_epoch와 함께 사용
gen = infinite_batch_generator(X_train, y_train)
# model.fit(gen, steps_per_epoch=100, epochs=10)

희귀도: 흔함 난이도: 중간

ML 알고리즘 & 이론 (5개 질문)

6. 배깅과 부스팅의 차이점을 설명하십시오.

답변: 둘 다 앙상블 방법이지만 다르게 작동합니다.

배깅 (Bootstrap Aggregating):
- 랜덤 부분 집합에 대한 병렬 훈련
- 분산 감소
- 예: 랜덤 포레스트
부스팅:
- 순차적 훈련, 각 모델은 이전 오류 수정
- 편향 감소
- 예: AdaBoost, Gradient Boosting, XGBoost

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score

# 데이터 생성
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# 배깅 - 랜덤 포레스트
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_scores = cross_val_score(rf, X, y, cv=5)
print(f"Random Forest CV: {rf_scores.mean():.3f} (+/- {rf_scores.std():.3f})")

# 부스팅 - Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_scores = cross_val_score(gb, X, y, cv=5)
print(f"Gradient Boosting CV: {gb_scores.mean():.3f} (+/- {gb_scores.std():.3f})")

# XGBoost (고급 부스팅)
import xgboost as xgb
xgb_model = xgb.XGBClassifier(n_estimators=100, random_state=42)
xgb_scores = cross_val_score(xgb_model, X, y, cv=5)
print(f"XGBoost CV: {xgb_scores.mean():.3f} (+/- {xgb_scores.std():.3f})")

희귀도: 매우 흔함 난이도: 중간

7. 불균형 데이터 세트를 어떻게 처리합니까?

답변: 불균형 데이터는 모델이 다수 클래스 쪽으로 편향될 수 있습니다.

기술:
- 재샘플링: SMOTE, 언더샘플링
- 클래스 가중치: 오분류에 페널티 부여
- 앙상블 방법: Balanced Random Forest
- 평가: 정확도가 아닌 F1, 정밀도, 재현율 사용
- 임계값 조정: 결정 임계값 최적화

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

# 불균형 데이터 세트 생성
X, y = make_classification(
    n_samples=1000, n_features=20,
    weights=[0.95, 0.05],  # 95% 클래스 0, 5% 클래스 1
    random_state=42
)

print(f"Original distribution: {Counter(y)}")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# 1. 클래스 가중치
model_weighted = RandomForestClassifier(class_weight='balanced', random_state=42)
model_weighted.fit(X_train, y_train)
print("\nWith class weights:")
print(classification_report(y_test, model_weighted.predict(X_test)))

# 2. SMOTE (소수 클래스 오버샘플링)
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print(f"After SMOTE: {Counter(y_train_smote)}")

model_smote = RandomForestClassifier(random_state=42)
model_smote.fit(X_train_smote, y_train_smote)
print("\nWith SMOTE:")
print(classification_report(y_test, model_smote.predict(X_test)))

# 3. 임계값 조정
y_proba = model_weighted.predict_proba(X_test)[:, 1]
threshold = 0.3  # 낮은 임계값은 소수 클래스에 유리
y_pred_adjusted = (y_proba >= threshold).astype(int)
print("\nWith adjusted threshold:")
print(classification_report(y_test, y_pred_adjusted))

희귀도: 매우 흔함 난이도: 중간

8. 교차 검증이란 무엇이며 왜 중요합니까?

답변: 교차 검증은 단일 훈련-테스트 분할보다 더 안정적으로 모델 성능을 평가합니다.

유형:
- K-Fold: k개의 폴드로 분할
- Stratified K-Fold: 클래스 분포 유지
- Time Series Split: 시간 순서 존중
장점:
- 더 강력한 성능 추정
- 훈련 및 유효성 검사에 모든 데이터 사용
- 과적합 감지

from sklearn.model_selection import (
    cross_val_score, KFold, StratifiedKFold,
    TimeSeriesSplit, cross_validate
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# 데이터 로드
data = load_iris()
X, y = data.data, data.target

model = RandomForestClassifier(random_state=42)

# 표준 K-Fold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold)
print(f"K-Fold scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std():.3f})")

# Stratified K-Fold (클래스 분포 유지)
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
stratified_scores = cross_val_score(model, X, y, cv=stratified_kfold)
print(f"\nStratified K-Fold: {stratified_scores.mean():.3f}")

# Time Series Split (시간 데이터용)
tscv = TimeSeriesSplit(n_splits=5)
ts_scores = cross_val_score(model, X, y, cv=tscv)
print(f"Time Series CV: {ts_scores.mean():.3f}")

# 다중 메트릭
cv_results = cross_validate(
    model, X, y, cv=5,
    scoring=['accuracy', 'precision_macro', 'recall_macro', 'f1_macro'],
    return_train_score=True
)

print(f"\nAccuracy: {cv_results['test_accuracy'].mean():.3f}")
print(f"Precision: {cv_results['test_precision_macro'].mean():.3f}")
print(f"Recall: {cv_results['test_recall_macro'].mean():.3f}")
print(f"F1: {cv_results['test_f1_macro'].mean():.3f}")

희귀도: 매우 흔함 난이도: 쉬움

9. 정밀도, 재현율 및 F1-점수를 설명하십시오.

답변: 모델 성능을 평가하기 위한 분류 메트릭:

정밀도: 예측된 양성 중 얼마나 정확한가
- 공식: TP / (TP + FP)
- 사용 시기: 거짓 양성이 비용이 많이 들 때
재현율: 실제 양성 중 얼마나 많이 찾았는가
- 공식: TP / (TP + FN)
- 사용 시기: 거짓 음성이 비용이 많이 들 때
F1-점수: 정밀도와 재현율의 조화 평균
- 공식: 2 × (정밀도 × 재현율) / (정밀도 + 재현율)
- 사용 시기: 정밀도와 재현율 사이의 균형이 필요할 때

from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

# 예제 예측
y_true = np.array([0, 1, 1, 0, 1, 1, 0, 1, 0, 0])
y_pred = np.array([0, 1, 0, 0, 1, 1, 0, 1, 1, 0])

# 메트릭 계산
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Precision: {precision:.3f}")  # 0.800
print(f"Recall: {recall:.3f}")        # 0.800
print(f"F1-Score: {f1:.3f}")          # 0.800

# 혼동 행렬
cm = confusion_matrix(y_true, y_pred)
print(f"\nConfusion Matrix:\n{cm}")
# [[4 1]
#  [1 4]]

# 분류 보고서 (모든 메트릭)
print("\nClassification Report:")
print(classification_report(y_true, y_pred))

# 상충 관계 예제
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

# 확률 예측 가져오기
y_proba = model.predict_proba(X_test)[:, 1]

# 다양한 임계값에서 정밀도-재현율 계산
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)

# 최적의 임계값 찾기 (F1 최대화)
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]
print(f"Optimal threshold: {optimal_threshold:.3f}")

희귀도: 매우 흔함 난이도: 쉬움

10. 정규화란 무엇이며 언제 사용합니까?

답변: 정규화는 모델 복잡성에 페널티를 부여하여 과적합을 방지합니다.

유형:
- L1 (Lasso): 계수의 절대값 추가
- L2 (Ridge): 계수의 제곱 추가
- Elastic Net: L1과 L2 결합
사용 시기:
- 높은 분산 (과적합)
- 많은 특징
- 다중 공선성

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# 많은 특징이 있는 데이터 생성
X, y = make_regression(
    n_samples=100, n_features=50,
    n_informative=10, noise=10, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# 정규화 없음
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_score = r2_score(y_test, lr.predict(X_test))
print(f"Linear Regression R²: {lr_score:.3f}")

# L2 정규화 (Ridge)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
ridge_score = r2_score(y_test, ridge.predict(X_test))
print(f"Ridge R²: {ridge_score:.3f}")

# L1 정규화 (Lasso)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
lasso_score = r2_score(y_test, lasso.predict(X_test))
print(f"Lasso R²: {lasso_score:.3f}")
print(f"Lasso non-zero coefficients: {np.sum(lasso.coef_ != 0)}")

# Elastic Net (L1 + L2)
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
elastic_score = r2_score(y_test, elastic.predict(X_test))
print(f"Elastic Net R²: {elastic_score:.3f}")

# alpha에 대한 하이퍼파라미터 튜닝
from sklearn.model_selection import GridSearchCV

param_grid = {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0]}
grid = GridSearchCV(Ridge(), param_grid, cv=5)
grid.fit(X_train, y_train)
print(f"\nBest alpha: {grid.best_params_['alpha']}")
print(f"Best CV score: {grid.best_score_:.3f}")

희귀도: 매우 흔함 난이도: 중간

모델 훈련 및 배포 (5개 질문)

11. 프로덕션 환경에서 모델을 저장하고 로드하는 방법은 무엇입니까?

답변: 모델 직렬화는 배포 및 재사용을 가능하게 합니다.

import joblib
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# 모델 훈련
X, y = load_iris(return_X_y=True)
model = RandomForestClassifier()
model.fit(X, y)

# 방법 1: Joblib (scikit-learn에 권장)
joblib.dump(model, 'model.joblib')
loaded_model = joblib.load('model.joblib')

# 방법 2: Pickle
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

with open('model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

# TensorFlow/Keras의 경우
import tensorflow as tf

# 전체 모델 저장
# keras_model.save('model.h5')
# loaded_model = tf.keras.models.load_model('model.h5')

# 가중치만 저장
# keras_model.save_weights('model_weights.h5')
# new_model.load_weights('model_weights.h5')

# PyTorch의 경우
import torch

# 모델 상태 사전 저장
# torch.save(model.state_dict(), 'model.pth')
# model.load_state_dict(torch.load('model.pth'))

# 전체 모델 저장
# torch.save(model, 'model_complete.pth')
# model = torch.load('model_complete.pth')

# 모델 버전 관리
import datetime

model_version = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
model_path = f'models/model_{model_version}.joblib'
joblib.dump(model, model_path)
print(f"Model saved to {model_path}")

희귀도: 매우 흔함 난이도: 쉬움

12. 모델 서비스를 위한 REST API를 만드는 방법은 무엇입니까?

답변: REST API는 애플리케이션에서 모델에 액세스할 수 있도록 합니다.

from flask import Flask, request, jsonify
import joblib
import numpy as np

app = Flask(__name__)

# 시작 시 모델 로드
model = joblib.load('model.joblib')

@app.route('/predict', methods=['POST'])
def predict():
    try:
        # 요청에서 데이터 가져오기
        data = request.get_json()
        features = np.array(data['features']).reshape(1, -1)
        
        # 예측 수행
        prediction = model.predict(features)
        probability = model.predict_proba(features)
        
        # 응답 반환
        return jsonify({
            'prediction': int(prediction[0]),
            'probability': probability[0].tolist(),
            'status': 'success'
        })
    
    except Exception as e:
        return jsonify({
            'error': str(e),
            'status': 'error'
        }), 400

@app.route('/health', methods=['GET'])
def health():
    return jsonify({'status': 'healthy'})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

# FastAPI 대안 (최신, 더 빠름)
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

app = FastAPI()

class PredictionRequest(BaseModel):
    features: list

class PredictionResponse(BaseModel):
    prediction: int
    probability: list
    status: str

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    try:
        features = np.array(request.features).reshape(1, -1)
        prediction = model.predict(features)
        probability = model.predict_proba(features)
        
        return PredictionResponse(
            prediction=int(prediction[0]),
            probability=probability[0].tolist(),
            status="success"
        )
    except Exception as e:
        raise HTTPException(status_code=400, detail=str(e))

# 사용법:
# curl -X POST "http://localhost:5000/predict" \
#      -H "Content-Type: application/json" \
#      -d '{"features": [5.1, 3.5, 1.4, 0.2]}'

희귀도: 매우 흔함 난이도: 중간

13. Docker란 무엇이며 ML 배포에 유용한 이유는 무엇입니까?

답변: Docker 컨테이너는 모든 종속성과 함께 애플리케이션을 패키징합니다.

장점:
- 재현성
- 환경 전반의 일관성
- 쉬운 배포
- 격리

# ML 모델용 Dockerfile
FROM python:3.9-slim

WORKDIR /app

# 요구 사항 복사
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# 모델 및 코드 복사
COPY model.joblib .
COPY app.py .

# 포트 노출
EXPOSE 5000

# 애플리케이션 실행
CMD ["python", "app.py"]

# Docker 이미지 빌드
docker build -t ml-model:v1 .

# 컨테이너 실행
docker run -p 5000:5000 ml-model:v1

# 다중 컨테이너 설정을 위한 Docker Compose

# docker-compose.yml
version: '3.8'
services:
  model:
    build: .
    ports:
      - "5000:5000"
    environment:
      - MODEL_PATH=/app/model.joblib
    volumes:
      - ./models:/app/models

희귀도: 흔함 난이도: 중간

14. 프로덕션 환경에서 모델 성능을 어떻게 모니터링합니까?

답변: 모니터링은 모델 성능 저하를 감지하고 안정성을 보장합니다.

모니터링 대상:
- 예측 메트릭: 정확도, 대기 시간
- 데이터 드리프트: 입력 분포 변경
- 모델 드리프트: 성능 저하
- 시스템 메트릭: CPU, 메모리, 오류

import logging
from datetime import datetime
import numpy as np

class ModelMonitor:
    def __init__(self, model):
        self.model = model
        self.predictions = []
        self.actuals = []
        self.latencies = []
        self.input_stats = []
        
        # 로깅 설정
        logging.basicConfig(
            filename='model_monitor.log',
            level=logging.INFO,
            format='%(asctime)s - %(message)s'
        )
    
    def predict(self, X):
        import time
        
        # 입력 통계 추적
        self.input_stats.append({
            'mean': X.mean(),
            'std': X.std(),
            'min': X.min(),
            'max': X.max()
        })
        
        # 대기 시간 측정
        start = time.time()
        prediction = self.model.predict(X)
        latency = time.time() - start
        
        self.predictions.append(prediction)
        self.latencies.append(latency)
        
        # 예측 로깅
        logging.info(f"Prediction: {prediction}, Latency: {latency:.3f}s")
        
        # 대기 시간이 너무 높으면 경고
        if latency > 1.0:
            logging.warning(f"High latency detected: {latency:.3f}s")
        
        return prediction
    
    def log_actual(self, y_true):
        self.actuals.append(y_true)
        
        # 데이터가 충분하면 정확도 계산
        if len(self.actuals) >= 100:
            accuracy = np.mean(
                np.array(self.predictions[-100:]) == np.array(self.actuals[-100:])
            )
            logging.info(f"Rolling accuracy (last 100): {accuracy:.3f}")
            
            if accuracy < 0.8:
                logging.error(f"Model performance degraded: {accuracy:.3f}")
    
    def check_data_drift(self, reference_stats):
        if not self.input_stats:
            return
        
        current_stats = self.input_stats[-1]
        
        # 간단한 드리프트 감지 (평균 비교)
        mean_diff = abs(current_stats['mean'] - reference_stats['mean'])
        if mean_diff > 2 * reference_stats['std']:
            logging.warning(f"Data drift detected: mean diff = {mean_diff:.3f}")

# 사용법
monitor = ModelMonitor(model)
prediction = monitor.predict(X_new)
# 나중에 실제 레이블을 사용할 수 있는 경우
monitor.log_actual(y_true)

희귀도: 흔함 난이도: 중간

15. 머신러닝을 위한 CI/CD란 무엇입니까?

답변: CI/CD는 ML 모델의 테스트 및 배포를 자동화합니다.

Continuous Integration:
- 자동화된 테스트
- 코드 품질 검사
- 모델 유효성 검사
Continuous Deployment:
- 자동화된 배포
- 롤백 기능
- A/B 테스팅

# .github/workflows/ml-pipeline.yml
name: ML Pipeline

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Set up

목차

주니어 머신러닝 엔지니어 면접 질문: 완벽 가이드

소개

Python & 프로그래밍 (5개 질문)

1. 메모리에 맞지 않는 대용량 데이터 세트를 어떻게 처리합니까?

2. Python 데코레이터에 대해 설명하고 ML 사용 사례를 제시하십시오.

3. `@staticmethod`과 `@classmethod`의 차이점은 무엇입니까?

4. ML 파이프라인에서 예외를 어떻게 처리합니까?

5. Python 제너레이터란 무엇이며 ML에서 왜 유용합니까?

ML 알고리즘 & 이론 (5개 질문)

6. 배깅과 부스팅의 차이점을 설명하십시오.

7. 불균형 데이터 세트를 어떻게 처리합니까?

8. 교차 검증이란 무엇이며 왜 중요합니까?

9. 정밀도, 재현율 및 F1-점수를 설명하십시오.

10. 정규화란 무엇이며 언제 사용합니까?

모델 훈련 및 배포 (5개 질문)

11. 프로덕션 환경에서 모델을 저장하고 로드하는 방법은 무엇입니까?

12. 모델 서비스를 위한 REST API를 만드는 방법은 무엇입니까?

13. Docker란 무엇이며 ML 배포에 유용한 이유는 무엇입니까?

14. 프로덕션 환경에서 모델 성능을 어떻게 모니터링합니까?

15. 머신러닝을 위한 CI/CD란 무엇입니까?

지원을 멈추세요. 채용되기 시작하세요.

이 게시물 공유

실제로 효과가 있는 주간 커리어 팁

실제로 효과가 있는 주간 커리어 팁

관련 게시물

주니어 DevOps 엔지니어 면접 질문: 완벽 가이드

주니어 네트워크 엔지니어 면접 질문: 완벽 가이드

주니어 클라우드 엔지니어 Azure 면접 질문: 완벽 가이드

지원을 멈추세요. 채용되기 시작하세요.

이 게시물 공유

이력서 작성 시간을 90% 단축하세요

목차

소개

Python & 프로그래밍 (5개 질문)

1. 메모리에 맞지 않는 대용량 데이터 세트를 어떻게 처리합니까?

2. Python 데코레이터에 대해 설명하고 ML 사용 사례를 제시하십시오.

3. @staticmethod과 @classmethod의 차이점은 무엇입니까?

4. ML 파이프라인에서 예외를 어떻게 처리합니까?

5. Python 제너레이터란 무엇이며 ML에서 왜 유용합니까?

ML 알고리즘 & 이론 (5개 질문)

6. 배깅과 부스팅의 차이점을 설명하십시오.

7. 불균형 데이터 세트를 어떻게 처리합니까?

8. 교차 검증이란 무엇이며 왜 중요합니까?

9. 정밀도, 재현율 및 F1-점수를 설명하십시오.

10. 정규화란 무엇이며 언제 사용합니까?

모델 훈련 및 배포 (5개 질문)

11. 프로덕션 환경에서 모델을 저장하고 로드하는 방법은 무엇입니까?

12. 모델 서비스를 위한 REST API를 만드는 방법은 무엇입니까?

13. Docker란 무엇이며 ML 배포에 유용한 이유는 무엇입니까?

14. 프로덕션 환경에서 모델 성능을 어떻게 모니터링합니까?

15. 머신러닝을 위한 CI/CD란 무엇입니까?

지원을 멈추세요. 채용되기 시작하세요.

이 게시물 공유

실제로 효과가 있는 주간 커리어 팁

실제로 효과가 있는 주간 커리어 팁

관련 게시물

주니어 DevOps 엔지니어 면접 질문: 완벽 가이드

주니어 네트워크 엔지니어 면접 질문: 완벽 가이드

주니어 클라우드 엔지니어 Azure 면접 질문: 완벽 가이드

지원을 멈추세요. 채용되기 시작하세요.

이 게시물 공유

이력서 작성 시간을 90% 단축하세요

3. `@staticmethod`과 `@classmethod`의 차이점은 무엇입니까?