Junior Machine Learning Engineer Interview Questions: Complete Guide

Milad Bonakdar
Author
Master ML engineering fundamentals with essential interview questions covering Python, ML algorithms, model training, deployment basics, and MLOps for junior machine learning engineers.
Introduction
Machine Learning Engineers build, deploy, and maintain ML systems in production. Junior ML engineers are expected to have strong programming skills, understanding of ML algorithms, experience with ML frameworks, and knowledge of deployment practices.
This guide covers essential interview questions for Junior Machine Learning Engineers. We explore Python programming, ML algorithms, model training and evaluation, deployment basics, and MLOps fundamentals to help you prepare for your first ML engineering role.
Python & Programming (5 Questions)
1. How do you handle large datasets that don't fit in memory?
Answer: Several techniques handle data larger than available RAM:
- Batch Processing: Process data in chunks
- Generators: Yield data on-demand
- Dask/Ray: Distributed computing frameworks
- Database Queries: Load only needed data
- Memory-Mapped Files: Access disk as if in memory
- Data Streaming: Process data as it arrives
import pandas as pd
import numpy as np
# Bad: Load entire dataset into memory
# df = pd.read_csv('large_file.csv') # May crash
# Good: Read in chunks
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
# Process each chunk
processed = chunk[chunk['value'] > 0]
# Save or aggregate results
processed.to_csv('output.csv', mode='a', header=False)
# Using generators
def data_generator(filename, batch_size=32):
while True:
batch = []
with open(filename, 'r') as f:
for line in f:
batch.append(process_line(line))
if len(batch) == batch_size:
yield np.array(batch)
batch = []
# Dask for distributed computing
import dask.dataframe as dd
ddf = dd.read_csv('large_file.csv')
result = ddf.groupby('category').mean().compute()Rarity: Very Common Difficulty: Medium
2. Explain decorators in Python and give an ML use case.
Answer: Decorators modify or enhance functions without changing their code.
- Use Cases in ML:
- Timing function execution
- Logging predictions
- Caching results
- Input validation
import time
import functools
# Timing decorator
def timer(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
end = time.time()
print(f"{func.__name__} took {end - start:.2f} seconds")
return result
return wrapper
# Logging decorator
def log_predictions(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
predictions = func(*args, **kwargs)
print(f"Made {len(predictions)} predictions")
print(f"Prediction distribution: {np.bincount(predictions)}")
return predictions
return wrapper
# Usage
@timer
@log_predictions
def predict_batch(model, X):
return model.predict(X)
# Caching decorator (memoization)
def cache_results(func):
cache = {}
@functools.wraps(func)
def wrapper(*args):
if args not in cache:
cache[args] = func(*args)
return cache[args]
return wrapper
@cache_results
def expensive_feature_engineering(data_id):
# Expensive computation
return processed_featuresRarity: Common Difficulty: Medium
3. What is the difference between @staticmethod and @classmethod?
Answer: Both define methods that don't require an instance.
- @staticmethod: No access to class or instance
- @classmethod: Receives class as first argument
class MLModel:
model_type = "classifier"
def __init__(self, name):
self.name = name
# Regular method - needs instance
def predict(self, X):
return self.model.predict(X)
# Static method - utility function
@staticmethod
def preprocess_data(X):
# No access to self or cls
return (X - X.mean()) / X.std()
# Class method - factory pattern
@classmethod
def create_default(cls):
# Has access to cls
return cls(name=f"default_{cls.model_type}")
@classmethod
def from_config(cls, config):
return cls(name=config['name'])
# Usage
# Static method - no instance needed
processed = MLModel.preprocess_data(X_train)
# Class method - creates instance
model = MLModel.create_default()
model2 = MLModel.from_config({'name': 'my_model'})Rarity: Medium Difficulty: Medium
4. How do you handle exceptions in ML pipelines?
Answer: Proper error handling prevents pipeline failures and aids debugging.
import logging
from typing import Optional
# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ModelTrainingError(Exception):
"""Custom exception for model training failures"""
pass
def train_model(X, y, model_type='random_forest'):
try:
logger.info(f"Starting training with {model_type}")
# Validate inputs
if X.shape[0] != y.shape[0]:
raise ValueError("X and y must have same number of samples")
if X.shape[0] < 100:
raise ModelTrainingError("Insufficient training data")
# Train model
if model_type == 'random_forest':
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
else:
raise ValueError(f"Unknown model type: {model_type}")
model.fit(X, y)
logger.info("Training completed successfully")
return model
except ValueError as e:
logger.error(f"Validation error: {e}")
raise
except ModelTrainingError as e:
logger.error(f"Training error: {e}")
# Could fallback to simpler model
return train_fallback_model(X, y)
except Exception as e:
logger.error(f"Unexpected error: {e}")
raise
finally:
logger.info("Training attempt finished")
# Context manager for resource management
class ModelLoader:
def __init__(self, model_path):
self.model_path = model_path
self.model = None
def __enter__(self):
logger.info(f"Loading model from {self.model_path}")
self.model = load_model(self.model_path)
return self.model
def __exit__(self, exc_type, exc_val, exc_tb):
logger.info("Cleaning up resources")
if self.model:
del self.model
return False # Don't suppress exceptions
# Usage
with ModelLoader('model.pkl') as model:
predictions = model.predict(X_test)Rarity: Common Difficulty: Medium
5. What are Python generators and why are they useful in ML?
Answer: Generators yield values one at a time, saving memory.
- Benefits:
- Memory efficient
- Lazy evaluation
- Infinite sequences
- ML Use Cases:
- Data loading
- Batch processing
- Data augmentation
import numpy as np
# Generator for batch processing
def batch_generator(X, y, batch_size=32):
n_samples = len(X)
indices = np.arange(n_samples)
np.random.shuffle(indices)
for start_idx in range(0, n_samples, batch_size):
end_idx = min(start_idx + batch_size, n_samples)
batch_indices = indices[start_idx:end_idx]
yield X[batch_indices], y[batch_indices]
# Usage in training
for epoch in range(10):
for X_batch, y_batch in batch_generator(X_train, y_train):
model.train_on_batch(X_batch, y_batch)
# Data augmentation generator
def augment_images(images, labels):
for img, label in zip(images, labels):
# Original
yield img, label
# Flipped
yield np.fliplr(img), label
# Rotated
yield np.rot90(img), label
# Infinite generator for training
def infinite_batch_generator(X, y, batch_size=32):
while True:
indices = np.random.choice(len(X), batch_size)
yield X[indices], y[indices]
# Use with steps_per_epoch
gen = infinite_batch_generator(X_train, y_train)
# model.fit(gen, steps_per_epoch=100, epochs=10)Rarity: Common Difficulty: Medium
ML Algorithms & Theory (5 Questions)
6. Explain the difference between bagging and boosting.
Answer: Both are ensemble methods but work differently:
- Bagging (Bootstrap Aggregating):
- Parallel training on random subsets
- Reduces variance
- Example: Random Forest
- Boosting:
- Sequential training, each model corrects previous errors
- Reduces bias
- Examples: AdaBoost, Gradient Boosting, XGBoost
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
# Generate data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
# Bagging - Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_scores = cross_val_score(rf, X, y, cv=5)
print(f"Random Forest CV: {rf_scores.mean():.3f} (+/- {rf_scores.std():.3f})")
# Boosting - Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_scores = cross_val_score(gb, X, y, cv=5)
print(f"Gradient Boosting CV: {gb_scores.mean():.3f} (+/- {gb_scores.std():.3f})")
# XGBoost (advanced boosting)
import xgboost as xgb
xgb_model = xgb.XGBClassifier(n_estimators=100, random_state=42)
xgb_scores = cross_val_score(xgb_model, X, y, cv=5)
print(f"XGBoost CV: {xgb_scores.mean():.3f} (+/- {xgb_scores.std():.3f})")Rarity: Very Common Difficulty: Medium
7. How do you handle imbalanced datasets?
Answer: Imbalanced data can bias models toward majority class.
- Techniques:
- Resampling: SMOTE, undersampling
- Class weights: Penalize misclassification
- Ensemble methods: Balanced Random Forest
- Evaluation: Use F1, precision, recall (not accuracy)
- Threshold adjustment: Optimize decision threshold
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
# Create imbalanced dataset
X, y = make_classification(
n_samples=1000, n_features=20,
weights=[0.95, 0.05], # 95% class 0, 5% class 1
random_state=42
)
print(f"Original distribution: {Counter(y)}")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# 1. Class weights
model_weighted = RandomForestClassifier(class_weight='balanced', random_state=42)
model_weighted.fit(X_train, y_train)
print("\nWith class weights:")
print(classification_report(y_test, model_weighted.predict(X_test)))
# 2. SMOTE (oversampling minority class)
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print(f"After SMOTE: {Counter(y_train_smote)}")
model_smote = RandomForestClassifier(random_state=42)
model_smote.fit(X_train_smote, y_train_smote)
print("\nWith SMOTE:")
print(classification_report(y_test, model_smote.predict(X_test)))
# 3. Threshold adjustment
y_proba = model_weighted.predict_proba(X_test)[:, 1]
threshold = 0.3 # Lower threshold favors minority class
y_pred_adjusted = (y_proba >= threshold).astype(int)
print("\nWith adjusted threshold:")
print(classification_report(y_test, y_pred_adjusted))Rarity: Very Common Difficulty: Medium
8. What is cross-validation and why is it important?
Answer: Cross-validation evaluates model performance more reliably than single train-test split.
- Types:
- K-Fold: Split into k folds
- Stratified K-Fold: Preserves class distribution
- Time Series Split: Respects temporal order
- Benefits:
- More robust performance estimate
- Uses all data for training and validation
- Detects overfitting
from sklearn.model_selection import (
cross_val_score, KFold, StratifiedKFold,
TimeSeriesSplit, cross_validate
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
# Load data
data = load_iris()
X, y = data.data, data.target
model = RandomForestClassifier(random_state=42)
# Standard K-Fold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold)
print(f"K-Fold scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std():.3f})")
# Stratified K-Fold (preserves class distribution)
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
stratified_scores = cross_val_score(model, X, y, cv=stratified_kfold)
print(f"\nStratified K-Fold: {stratified_scores.mean():.3f}")
# Time Series Split (for temporal data)
tscv = TimeSeriesSplit(n_splits=5)
ts_scores = cross_val_score(model, X, y, cv=tscv)
print(f"Time Series CV: {ts_scores.mean():.3f}")
# Multiple metrics
cv_results = cross_validate(
model, X, y, cv=5,
scoring=['accuracy', 'precision_macro', 'recall_macro', 'f1_macro'],
return_train_score=True
)
print(f"\nAccuracy: {cv_results['test_accuracy'].mean():.3f}")
print(f"Precision: {cv_results['test_precision_macro'].mean():.3f}")
print(f"Recall: {cv_results['test_recall_macro'].mean():.3f}")
print(f"F1: {cv_results['test_f1_macro'].mean():.3f}")Rarity: Very Common Difficulty: Easy
9. Explain precision, recall, and F1-score.
Answer: Classification metrics for evaluating model performance:
- Precision: Of predicted positives, how many are correct
- Formula: TP / (TP + FP)
- Use when: False positives are costly
- Recall: Of actual positives, how many were found
- Formula: TP / (TP + FN)
- Use when: False negatives are costly
- F1-Score: Harmonic mean of precision and recall
- Formula: 2 × (Precision × Recall) / (Precision + Recall)
- Use when: Need balance between precision and recall
from sklearn.metrics import (
precision_score, recall_score, f1_score,
classification_report, confusion_matrix
)
import numpy as np
# Example predictions
y_true = np.array([0, 1, 1, 0, 1, 1, 0, 1, 0, 0])
y_pred = np.array([0, 1, 0, 0, 1, 1, 0, 1, 1, 0])
# Calculate metrics
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
print(f"Precision: {precision:.3f}") # 0.800
print(f"Recall: {recall:.3f}") # 0.800
print(f"F1-Score: {f1:.3f}") # 0.800
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(f"\nConfusion Matrix:\n{cm}")
# [[4 1]
# [1 4]]
# Classification report (all metrics)
print("\nClassification Report:")
print(classification_report(y_true, y_pred))
# Trade-off example
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# Get probability predictions
y_proba = model.predict_proba(X_test)[:, 1]
# Calculate precision-recall at different thresholds
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
# Find optimal threshold (maximize F1)
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]
print(f"Optimal threshold: {optimal_threshold:.3f}")Rarity: Very Common Difficulty: Easy
10. What is regularization and when would you use it?
Answer: Regularization prevents overfitting by penalizing model complexity.
- Types:
- L1 (Lasso): Adds absolute value of coefficients
- L2 (Ridge): Adds squared coefficients
- Elastic Net: Combines L1 and L2
- When to use:
- High variance (overfitting)
- Many features
- Multicollinearity
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Generate data with many features
X, y = make_regression(
n_samples=100, n_features=50,
n_informative=10, noise=10, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# No regularization
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_score = r2_score(y_test, lr.predict(X_test))
print(f"Linear Regression R²: {lr_score:.3f}")
# L2 Regularization (Ridge)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
ridge_score = r2_score(y_test, ridge.predict(X_test))
print(f"Ridge R²: {ridge_score:.3f}")
# L1 Regularization (Lasso)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
lasso_score = r2_score(y_test, lasso.predict(X_test))
print(f"Lasso R²: {lasso_score:.3f}")
print(f"Lasso non-zero coefficients: {np.sum(lasso.coef_ != 0)}")
# Elastic Net (L1 + L2)
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
elastic_score = r2_score(y_test, elastic.predict(X_test))
print(f"Elastic Net R²: {elastic_score:.3f}")
# Hyperparameter tuning for alpha
from sklearn.model_selection import GridSearchCV
param_grid = {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0]}
grid = GridSearchCV(Ridge(), param_grid, cv=5)
grid.fit(X_train, y_train)
print(f"\nBest alpha: {grid.best_params_['alpha']}")
print(f"Best CV score: {grid.best_score_:.3f}")Rarity: Very Common Difficulty: Medium
Model Training & Deployment (5 Questions)
11. How do you save and load models in production?
Answer: Model serialization enables deployment and reuse.
import joblib
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
# Train model
X, y = load_iris(return_X_y=True)
model = RandomForestClassifier()
model.fit(X, y)
# Method 1: Joblib (recommended for scikit-learn)
joblib.dump(model, 'model.joblib')
loaded_model = joblib.load('model.joblib')
# Method 2: Pickle
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
with open('model.pkl', 'rb') as f:
loaded_model = pickle.load(f)
# For TensorFlow/Keras
import tensorflow as tf
# Save entire model
# keras_model.save('model.h5')
# loaded_model = tf.keras.models.load_model('model.h5')
# Save weights only
# keras_model.save_weights('model_weights.h5')
# new_model.load_weights('model_weights.h5')
# For PyTorch
import torch
# Save model state dict
# torch.save(model.state_dict(), 'model.pth')
# model.load_state_dict(torch.load('model.pth'))
# Save entire model
# torch.save(model, 'model_complete.pth')
# model = torch.load('model_complete.pth')
# Version control for models
import datetime
model_version = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
model_path = f'models/model_{model_version}.joblib'
joblib.dump(model, model_path)
print(f"Model saved to {model_path}")Rarity: Very Common Difficulty: Easy
12. How do you create a REST API for model serving?
Answer: REST APIs make models accessible to applications.
from flask import Flask, request, jsonify
import joblib
import numpy as np
app = Flask(__name__)
# Load model at startup
model = joblib.load('model.joblib')
@app.route('/predict', methods=['POST'])
def predict():
try:
# Get data from request
data = request.get_json()
features = np.array(data['features']).reshape(1, -1)
# Make prediction
prediction = model.predict(features)
probability = model.predict_proba(features)
# Return response
return jsonify({
'prediction': int(prediction[0]),
'probability': probability[0].tolist(),
'status': 'success'
})
except Exception as e:
return jsonify({
'error': str(e),
'status': 'error'
}), 400
@app.route('/health', methods=['GET'])
def health():
return jsonify({'status': 'healthy'})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
# FastAPI alternative (modern, faster)
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
app = FastAPI()
class PredictionRequest(BaseModel):
features: list
class PredictionResponse(BaseModel):
prediction: int
probability: list
status: str
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
try:
features = np.array(request.features).reshape(1, -1)
prediction = model.predict(features)
probability = model.predict_proba(features)
return PredictionResponse(
prediction=int(prediction[0]),
probability=probability[0].tolist(),
status="success"
)
except Exception as e:
raise HTTPException(status_code=400, detail=str(e))
# Usage:
# curl -X POST "http://localhost:5000/predict" \
# -H "Content-Type: application/json" \
# -d '{"features": [5.1, 3.5, 1.4, 0.2]}'Rarity: Very Common Difficulty: Medium
13. What is Docker and why is it useful for ML deployment?
Answer: Docker containers package applications with all dependencies.
- Benefits:
- Reproducibility
- Consistency across environments
- Easy deployment
- Isolation
# Dockerfile for ML model
FROM python:3.9-slim
WORKDIR /app
# Copy requirements
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy model and code
COPY model.joblib .
COPY app.py .
# Expose port
EXPOSE 5000
# Run application
CMD ["python", "app.py"]# Build Docker image
docker build -t ml-model:v1 .
# Run container
docker run -p 5000:5000 ml-model:v1
# Docker Compose for multi-container setup# docker-compose.yml
version: '3.8'
services:
model:
build: .
ports:
- "5000:5000"
environment:
- MODEL_PATH=/app/model.joblib
volumes:
- ./models:/app/modelsRarity: Common Difficulty: Medium
14. How do you monitor model performance in production?
Answer: Monitoring detects model degradation and ensures reliability.
- What to Monitor:
- Prediction metrics: Accuracy, latency
- Data drift: Input distribution changes
- Model drift: Performance degradation
- System metrics: CPU, memory, errors
import logging
from datetime import datetime
import numpy as np
class ModelMonitor:
def __init__(self, model):
self.model = model
self.predictions = []
self.actuals = []
self.latencies = []
self.input_stats = []
# Setup logging
logging.basicConfig(
filename='model_monitor.log',
level=logging.INFO,
format='%(asctime)s - %(message)s'
)
def predict(self, X):
import time
# Track input statistics
self.input_stats.append({
'mean': X.mean(),
'std': X.std(),
'min': X.min(),
'max': X.max()
})
# Measure latency
start = time.time()
prediction = self.model.predict(X)
latency = time.time() - start
self.predictions.append(prediction)
self.latencies.append(latency)
# Log prediction
logging.info(f"Prediction: {prediction}, Latency: {latency:.3f}s")
# Alert if latency too high
if latency > 1.0:
logging.warning(f"High latency detected: {latency:.3f}s")
return prediction
def log_actual(self, y_true):
self.actuals.append(y_true)
# Calculate accuracy if we have enough data
if len(self.actuals) >= 100:
accuracy = np.mean(
np.array(self.predictions[-100:]) == np.array(self.actuals[-100:])
)
logging.info(f"Rolling accuracy (last 100): {accuracy:.3f}")
if accuracy < 0.8:
logging.error(f"Model performance degraded: {accuracy:.3f}")
def check_data_drift(self, reference_stats):
if not self.input_stats:
return
current_stats = self.input_stats[-1]
# Simple drift detection (compare means)
mean_diff = abs(current_stats['mean'] - reference_stats['mean'])
if mean_diff > 2 * reference_stats['std']:
logging.warning(f"Data drift detected: mean diff = {mean_diff:.3f}")
# Usage
monitor = ModelMonitor(model)
prediction = monitor.predict(X_new)
# Later, when actual label is available
monitor.log_actual(y_true)Rarity: Common Difficulty: Medium
15. What is CI/CD for machine learning?
Answer: CI/CD automates testing and deployment of ML models.
- Continuous Integration:
- Automated testing
- Code quality checks
- Model validation
- Continuous Deployment:
- Automated deployment
- Rollback capabilities
- A/B testing
# .github/workflows/ml-pipeline.yml
name: ML Pipeline
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.9
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pytest
- name: Run tests
run: pytest tests/
- name: Train model
run: python train.py
- name: Validate model
run: python validate_model.py
deploy:
needs: test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- name: Deploy to production
run: |
docker build -t ml-model:latest .
docker push ml-model:latest# tests/test_model.py
import pytest
import numpy as np
from sklearn.datasets import load_iris
def test_model_accuracy():
from train import train_model
X, y = load_iris(return_X_y=True)
model, accuracy = train_model(X, y)
assert accuracy > 0.9, f"Model accuracy {accuracy} below threshold"
def test_model_prediction_shape():
model = load_model('model.joblib')
X_test = np.random.rand(10, 4)
predictions = model.predict(X_test)
assert predictions.shape == (10,), "Unexpected prediction shape"
def test_model_prediction_range():
model = load_model('model.joblib')
X_test = np.random.rand(10, 4)
predictions = model.predict(X_test)
assert all(p in [0, 1, 2] for p in predictions), "Invalid predictions"Rarity: Medium Difficulty: Hard




