Junior Data Scientist Interview Questions: Complete Guide

Milad Bonakdar
Author
Master data science fundamentals with essential interview questions covering statistics, Python, machine learning basics, data manipulation, and visualization for junior data scientists.
Introduction
Data science combines statistics, programming, and domain knowledge to extract insights from data. Junior data scientists are expected to have a solid foundation in Python, statistics, machine learning basics, and data manipulation tools.
This guide covers essential interview questions for Junior Data Scientists. We explore Python programming, statistics fundamentals, data manipulation with pandas, machine learning concepts, data visualization, and SQL to help you prepare for your first data science role.
Python Fundamentals (5 Questions)
1. What is the difference between a list and a tuple in Python?
Answer:
- List: Mutable (can be modified), defined with square brackets
[] - Tuple: Immutable (cannot be modified), defined with parentheses
() - Performance: Tuples are slightly faster and use less memory
- Use Cases:
- Lists: When you need to modify data
- Tuples: For fixed collections, dictionary keys, function returns
# List - mutable
my_list = [1, 2, 3]
my_list[0] = 10 # Works
my_list.append(4) # Works
print(my_list) # [10, 2, 3, 4]
# Tuple - immutable
my_tuple = (1, 2, 3)
# my_tuple[0] = 10 # Error: tuples are immutable
# my_tuple.append(4) # Error: no append method
# Tuple unpacking
x, y, z = (1, 2, 3)
print(x, y, z) # 1 2 3Rarity: Very Common Difficulty: Easy
2. Explain list comprehension and give an example.
Answer: List comprehension provides a concise way to create lists based on existing iterables.
- Syntax:
[expression for item in iterable if condition] - Benefits: More readable, often faster than loops
# Traditional loop
squares = []
for i in range(10):
squares.append(i ** 2)
# List comprehension
squares = [i ** 2 for i in range(10)]
print(squares) # [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
# With condition
even_squares = [i ** 2 for i in range(10) if i % 2 == 0]
print(even_squares) # [0, 4, 16, 36, 64]
# Nested comprehension
matrix = [[i * j for j in range(3)] for i in range(3)]
print(matrix) # [[0, 0, 0], [0, 1, 2], [0, 2, 4]]
# Dictionary comprehension
squares_dict = {i: i ** 2 for i in range(5)}
print(squares_dict) # {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}Rarity: Very Common Difficulty: Easy
3. What are lambda functions and when would you use them?
Answer: Lambda functions are anonymous, single-expression functions.
- Syntax:
lambda arguments: expression - Use Cases: Short functions, callbacks, sorting, filtering
# Regular function
def square(x):
return x ** 2
# Lambda function
square_lambda = lambda x: x ** 2
print(square_lambda(5)) # 25
# With map
numbers = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x ** 2, numbers))
print(squared) # [1, 4, 9, 16, 25]
# With filter
evens = list(filter(lambda x: x % 2 == 0, numbers))
print(evens) # [2, 4]
# Sorting with key
students = [('Alice', 85), ('Bob', 92), ('Charlie', 78)]
sorted_students = sorted(students, key=lambda x: x[1], reverse=True)
print(sorted_students) # [('Bob', 92), ('Alice', 85), ('Charlie', 78)]Rarity: Very Common Difficulty: Easy
4. Explain the difference between append() and extend() for lists.
Answer:
- append(): Adds a single element to the end of the list
- extend(): Adds multiple elements from an iterable to the end
# append - adds single element
list1 = [1, 2, 3]
list1.append(4)
print(list1) # [1, 2, 3, 4]
list1.append([5, 6])
print(list1) # [1, 2, 3, 4, [5, 6]] - list as single element
# extend - adds multiple elements
list2 = [1, 2, 3]
list2.extend([4, 5, 6])
print(list2) # [1, 2, 3, 4, 5, 6]
# Alternative to extend
list3 = [1, 2, 3]
list3 += [4, 5, 6]
print(list3) # [1, 2, 3, 4, 5, 6]Rarity: Common Difficulty: Easy
5. What are *args and **kwargs?
Answer: They allow functions to accept variable numbers of arguments.
*args: Variable number of positional arguments (tuple)**kwargs: Variable number of keyword arguments (dictionary)
# *args - positional arguments
def sum_all(*args):
return sum(args)
print(sum_all(1, 2, 3)) # 6
print(sum_all(1, 2, 3, 4, 5)) # 15
# **kwargs - keyword arguments
def print_info(**kwargs):
for key, value in kwargs.items():
print(f"{key}: {value}")
print_info(name="Alice", age=25, city="NYC")
# name: Alice
# age: 25
# city: NYC
# Combined
def flexible_function(*args, **kwargs):
print("Positional:", args)
print("Keyword:", kwargs)
flexible_function(1, 2, 3, name="Alice", age=25)
# Positional: (1, 2, 3)
# Keyword: {'name': 'Alice', 'age': 25}Rarity: Common Difficulty: Medium
Statistics & Probability (5 Questions)
6. What is the difference between mean, median, and mode?
Answer:
- Mean: Average of all values (sum / count)
- Median: Middle value when sorted
- Mode: Most frequently occurring value
- When to use:
- Mean: Normally distributed data
- Median: Skewed data or outliers present
- Mode: Categorical data
import numpy as np
from scipy import stats
data = [1, 2, 2, 3, 4, 5, 100]
# Mean - affected by outliers
mean = np.mean(data)
print(f"Mean: {mean}") # 16.71
# Median - robust to outliers
median = np.median(data)
print(f"Median: {median}") # 3
# Mode
mode = stats.mode(data, keepdims=True)
print(f"Mode: {mode.mode[0]}") # 2Rarity: Very Common Difficulty: Easy
7. Explain variance and standard deviation.
Answer:
- Variance: Average squared deviation from the mean
- Standard Deviation: Square root of variance (same units as data)
- Purpose: Measure spread/dispersion of data
import numpy as np
data = [2, 4, 4, 4, 5, 5, 7, 9]
# Variance
variance = np.var(data, ddof=1) # ddof=1 for sample variance
print(f"Variance: {variance}") # 4.57
# Standard deviation
std_dev = np.std(data, ddof=1)
print(f"Std Dev: {std_dev}") # 2.14
# Manual calculation
mean = np.mean(data)
variance_manual = sum((x - mean) ** 2 for x in data) / (len(data) - 1)
print(f"Manual Variance: {variance_manual}")Rarity: Very Common Difficulty: Easy
8. What is a p-value and how do you interpret it?
Answer: The p-value is the probability of obtaining results at least as extreme as observed, assuming the null hypothesis is true.
- Interpretation:
- p < 0.05: Reject null hypothesis (statistically significant)
- p ≥ 0.05: Fail to reject null hypothesis
- Note: p-value doesn't measure effect size or importance
from scipy import stats
# Example: Testing if a coin is fair
# Null hypothesis: coin is fair (p = 0.5)
# We got 65 heads out of 100 flips
observed_heads = 65
n_flips = 100
expected_proportion = 0.5
# Binomial test
p_value = stats.binom_test(observed_heads, n_flips, expected_proportion)
print(f"P-value: {p_value}") # 0.0018
if p_value < 0.05:
print("Reject null hypothesis - coin is likely biased")
else:
print("Fail to reject null hypothesis - coin appears fair")Rarity: Very Common Difficulty: Medium
9. What is the Central Limit Theorem?
Answer: The Central Limit Theorem states that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population's distribution.
- Key Points:
- Works for any distribution (if sample size is large enough)
- Typically n ≥ 30 is considered sufficient
- Enables hypothesis testing and confidence intervals
import numpy as np
import matplotlib.pyplot as plt
# Population with non-normal distribution (exponential)
population = np.random.exponential(scale=2, size=10000)
# Take many samples and calculate their means
sample_means = []
for _ in range(1000):
sample = np.random.choice(population, size=30)
sample_means.append(np.mean(sample))
# Sample means are normally distributed (CLT)
print(f"Population mean: {np.mean(population):.2f}")
print(f"Mean of sample means: {np.mean(sample_means):.2f}")
print(f"Std of sample means: {np.std(sample_means):.2f}")Rarity: Common Difficulty: Medium
10. What is correlation vs causation?
Answer:
- Correlation: Statistical relationship between two variables
- Causation: One variable directly causes changes in another
- Key Point: Correlation does NOT imply causation
- Reasons:
- Confounding variables
- Reverse causation
- Coincidence
import numpy as np
import pandas as pd
# Example: Ice cream sales and drowning deaths are correlated
# But ice cream doesn't cause drowning (confounding variable: temperature)
# Correlation coefficient
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
correlation = np.corrcoef(x, y)[0, 1]
print(f"Correlation: {correlation:.2f}") # 0.82
# Pearson correlation
from scipy.stats import pearsonr
corr, p_value = pearsonr(x, y)
print(f"Pearson r: {corr:.2f}, p-value: {p_value:.3f}")Rarity: Very Common Difficulty: Easy
Data Manipulation with Pandas (5 Questions)
11. How do you read a CSV file and display basic information?
Answer: Use pandas to read and explore data.
import pandas as pd
# Read CSV
df = pd.read_csv('data.csv')
# Basic information
print(df.head()) # First 5 rows
print(df.tail()) # Last 5 rows
print(df.shape) # (rows, columns)
print(df.info()) # Data types and non-null counts
print(df.describe()) # Statistical summary
# Column names and types
print(df.columns)
print(df.dtypes)
# Check for missing values
print(df.isnull().sum())
# Specific columns
print(df[['column1', 'column2']].head())Rarity: Very Common Difficulty: Easy
12. How do you handle missing values in a DataFrame?
Answer: Multiple strategies for handling missing data:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12]
})
# Check missing values
print(df.isnull().sum())
# Drop rows with any missing values
df_dropped = df.dropna()
# Drop columns with any missing values
df_dropped_cols = df.dropna(axis=1)
# Fill with specific value
df_filled = df.fillna(0)
# Fill with mean
df['A'] = df['A'].fillna(df['A'].mean())
# Fill with median
df['B'] = df['B'].fillna(df['B'].median())
# Forward fill (use previous value)
df_ffill = df.fillna(method='ffill')
# Backward fill (use next value)
df_bfill = df.fillna(method='bfill')
# Interpolate
df_interpolated = df.interpolate()Rarity: Very Common Difficulty: Easy
13. How do you filter and select data in pandas?
Answer: Multiple ways to filter and select data:
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, 35, 28],
'salary': [50000, 60000, 75000, 55000],
'department': ['IT', 'HR', 'IT', 'Finance']
})
# Select columns
print(df['name']) # Single column (Series)
print(df[['name', 'age']]) # Multiple columns (DataFrame)
# Filter rows
high_salary = df[df['salary'] > 55000]
print(high_salary)
# Multiple conditions
it_high_salary = df[(df['department'] == 'IT') & (df['salary'] > 50000)]
print(it_high_salary)
# Using .loc (label-based)
print(df.loc[0:2, ['name', 'age']])
# Using .iloc (position-based)
print(df.iloc[0:2, 0:2])
# Query method
result = df.query('age > 28 and salary > 55000')
print(result)
# isin method
it_or_hr = df[df['department'].isin(['IT', 'HR'])]
print(it_or_hr)Rarity: Very Common Difficulty: Easy
14. How do you group and aggregate data?
Answer:
Use groupby() for aggregation operations:
import pandas as pd
df = pd.DataFrame({
'department': ['IT', 'HR', 'IT', 'Finance', 'HR', 'IT'],
'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'salary': [50000, 45000, 60000, 55000, 48000, 65000],
'age': [25, 30, 35, 28, 32, 40]
})
# Group by single column
dept_avg_salary = df.groupby('department')['salary'].mean()
print(dept_avg_salary)
# Multiple aggregations
dept_stats = df.groupby('department').agg({
'salary': ['mean', 'min', 'max'],
'age': 'mean'
})
print(dept_stats)
# Custom aggregation
dept_custom = df.groupby('department').agg({
'salary': lambda x: x.max() - x.min(),
'employee': 'count'
})
print(dept_custom)
# Multiple group by columns
result = df.groupby(['department', 'age'])['salary'].sum()
print(result)Rarity: Very Common Difficulty: Medium
15. How do you merge or join DataFrames?
Answer:
Use merge(), join(), or concat():
import pandas as pd
# Sample DataFrames
df1 = pd.DataFrame({
'employee_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'employee_id': [1, 2, 3, 5],
'salary': [50000, 60000, 75000, 55000]
})
# Inner join (only matching rows)
inner = pd.merge(df1, df2, on='employee_id', how='inner')
print(inner)
# Left join (all rows from left)
left = pd.merge(df1, df2, on='employee_id', how='left')
print(left)
# Right join (all rows from right)
right = pd.merge(df1, df2, on='employee_id', how='right')
print(right)
# Outer join (all rows from both)
outer = pd.merge(df1, df2, on='employee_id', how='outer')
print(outer)
# Concatenate vertically
df3 = pd.concat([df1, df2], ignore_index=True)
print(df3)
# Concatenate horizontally
df4 = pd.concat([df1, df2], axis=1)
print(df4)Rarity: Very Common Difficulty: Medium
Machine Learning Basics (5 Questions)
16. What is the difference between supervised and unsupervised learning?
Answer:
- Supervised Learning:
- Has labeled training data (input-output pairs)
- Goal: Learn mapping from inputs to outputs
- Examples: Classification, Regression
- Algorithms: Linear Regression, Decision Trees, SVM
- Unsupervised Learning:
- No labeled data (only inputs)
- Goal: Find patterns or structure in data
- Examples: Clustering, Dimensionality Reduction
- Algorithms: K-Means, PCA, Hierarchical Clustering
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans
import numpy as np
# Supervised Learning - Linear Regression
X_train = np.array([[1], [2], [3], [4], [5]])
y_train = np.array([2, 4, 6, 8, 10])
model = LinearRegression()
model.fit(X_train, y_train)
prediction = model.predict([[6]])
print(f"Supervised prediction: {prediction[0]}") # 12
# Unsupervised Learning - K-Means Clustering
X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(X)
print(f"Cluster assignments: {clusters}")Rarity: Very Common Difficulty: Easy
17. What is overfitting and how do you prevent it?
Answer: Overfitting occurs when a model learns training data too well, including noise, and performs poorly on new data.
- Signs:
- High training accuracy, low test accuracy
- Model too complex for the data
- Prevention:
- More training data
- Cross-validation
- Regularization (L1, L2)
- Simpler models
- Early stopping
- Dropout (neural networks)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
# Generate data
X = np.random.rand(100, 1) * 10
y = 2 * X + 3 + np.random.randn(100, 1) * 2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Overfitting example - high degree polynomial
poly = PolynomialFeatures(degree=15)
X_poly = poly.fit_transform(X_train)
# Regularization to prevent overfitting
# Ridge (L2 regularization)
ridge = Ridge(alpha=1.0)
ridge.fit(X_poly, y_train)
# Lasso (L1 regularization)
lasso = Lasso(alpha=0.1)
lasso.fit(X_poly, y_train)
print(f"Ridge score: {ridge.score(X_poly, y_train)}")
print(f"Lasso score: {lasso.score(X_poly, y_train)}")Rarity: Very Common Difficulty: Medium
18. Explain the train-test split and why it's important.
Answer: Train-test split divides data into training and testing sets to evaluate model performance on unseen data.
- Purpose: Prevent overfitting, estimate real-world performance
- Typical Split: 70-30 or 80-20 (train-test)
- Cross-Validation: More robust evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
# Train model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
# Evaluate
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Training accuracy: {train_score:.2f}")
print(f"Test accuracy: {test_score:.2f}")
# Cross-validation (more robust)
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"CV scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.2f}")Rarity: Very Common Difficulty: Easy
19. What evaluation metrics do you use for classification?
Answer: Different metrics for different scenarios:
- Accuracy: Overall correctness (good for balanced datasets)
- Precision: Of predicted positives, how many are correct
- Recall: Of actual positives, how many were found
- F1-Score: Harmonic mean of precision and recall
- Confusion Matrix: Detailed breakdown of predictions
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.3, random_state=42
)
# Train model
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:\n{cm}")
# Classification report
print(f"\n{classification_report(y_test, y_pred)}")Rarity: Very Common Difficulty: Medium
20. What is the difference between classification and regression?
Answer:
- Classification:
- Predicts discrete categories/classes
- Output: Class label
- Examples: Spam detection, image classification
- Algorithms: Logistic Regression, Decision Trees, SVM
- Metrics: Accuracy, Precision, Recall, F1
- Regression:
- Predicts continuous numerical values
- Output: Number
- Examples: House price prediction, temperature forecasting
- Algorithms: Linear Regression, Random Forest Regressor
- Metrics: MSE, RMSE, MAE, R²
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Regression example
X_reg = np.array([[1], [2], [3], [4], [5]])
y_reg = np.array([2.1, 3.9, 6.2, 7.8, 10.1])
reg_model = LinearRegression()
reg_model.fit(X_reg, y_reg)
y_pred_reg = reg_model.predict([[6]])
print(f"Regression prediction: {y_pred_reg[0]:.2f}") # Continuous value
# Classification example
X_clf = np.array([[1], [2], [3], [4], [5]])
y_clf = np.array([0, 0, 1, 1, 1]) # Binary classes
clf_model = LogisticRegression()
clf_model.fit(X_clf, y_clf)
y_pred_clf = clf_model.predict([[3.5]])
print(f"Classification prediction: {y_pred_clf[0]}") # Class label (0 or 1)Rarity: Very Common Difficulty: Easy




