Junior Data Scientist Interview Questions: Complete Guide

Introduction

Data science combines statistics, programming, and domain knowledge to extract insights from data. Junior data scientists are expected to have a solid foundation in Python, statistics, machine learning basics, and data manipulation tools.

This guide covers essential interview questions for Junior Data Scientists. We explore Python programming, statistics fundamentals, data manipulation with pandas, machine learning concepts, data visualization, and SQL to help you prepare for your first data science role.

Python Fundamentals (5 Questions)

1. What is the difference between a list and a tuple in Python?

Answer:

List: Mutable (can be modified), defined with square brackets []
Tuple: Immutable (cannot be modified), defined with parentheses ()
Performance: Tuples are slightly faster and use less memory
Use Cases:
- Lists: When you need to modify data
- Tuples: For fixed collections, dictionary keys, function returns

# List - mutable
my_list = [1, 2, 3]
my_list[0] = 10  # Works
my_list.append(4)  # Works
print(my_list)  # [10, 2, 3, 4]

# Tuple - immutable
my_tuple = (1, 2, 3)
# my_tuple[0] = 10  # Error: tuples are immutable
# my_tuple.append(4)  # Error: no append method

# Tuple unpacking
x, y, z = (1, 2, 3)
print(x, y, z)  # 1 2 3

Rarity: Very Common Difficulty: Easy

2. Explain list comprehension and give an example.

Answer: List comprehension provides a concise way to create lists based on existing iterables.

Syntax: [expression for item in iterable if condition]
Benefits: More readable, often faster than loops

# Traditional loop
squares = []
for i in range(10):
    squares.append(i ** 2)

# List comprehension
squares = [i ** 2 for i in range(10)]
print(squares)  # [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

# With condition
even_squares = [i ** 2 for i in range(10) if i % 2 == 0]
print(even_squares)  # [0, 4, 16, 36, 64]

# Nested comprehension
matrix = [[i * j for j in range(3)] for i in range(3)]
print(matrix)  # [[0, 0, 0], [0, 1, 2], [0, 2, 4]]

# Dictionary comprehension
squares_dict = {i: i ** 2 for i in range(5)}
print(squares_dict)  # {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

Rarity: Very Common Difficulty: Easy

3. What are lambda functions and when would you use them?

Answer: Lambda functions are anonymous, single-expression functions.

Syntax: lambda arguments: expression
Use Cases: Short functions, callbacks, sorting, filtering

# Regular function
def square(x):
    return x ** 2

# Lambda function
square_lambda = lambda x: x ** 2
print(square_lambda(5))  # 25

# With map
numbers = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x ** 2, numbers))
print(squared)  # [1, 4, 9, 16, 25]

# With filter
evens = list(filter(lambda x: x % 2 == 0, numbers))
print(evens)  # [2, 4]

# Sorting with key
students = [('Alice', 85), ('Bob', 92), ('Charlie', 78)]
sorted_students = sorted(students, key=lambda x: x[1], reverse=True)
print(sorted_students)  # [('Bob', 92), ('Alice', 85), ('Charlie', 78)]

Rarity: Very Common Difficulty: Easy

4. Explain the difference between `append()` and `extend()` for lists.

Answer:

append(): Adds a single element to the end of the list
extend(): Adds multiple elements from an iterable to the end

# append - adds single element
list1 = [1, 2, 3]
list1.append(4)
print(list1)  # [1, 2, 3, 4]

list1.append([5, 6])
print(list1)  # [1, 2, 3, 4, [5, 6]] - list as single element

# extend - adds multiple elements
list2 = [1, 2, 3]
list2.extend([4, 5, 6])
print(list2)  # [1, 2, 3, 4, 5, 6]

# Alternative to extend
list3 = [1, 2, 3]
list3 += [4, 5, 6]
print(list3)  # [1, 2, 3, 4, 5, 6]

Rarity: Common Difficulty: Easy

5. What are `*args` and `**kwargs`?

Answer: They allow functions to accept variable numbers of arguments.

*args: Variable number of positional arguments (tuple)
**kwargs: Variable number of keyword arguments (dictionary)

# *args - positional arguments
def sum_all(*args):
    return sum(args)

print(sum_all(1, 2, 3))  # 6
print(sum_all(1, 2, 3, 4, 5))  # 15

# **kwargs - keyword arguments
def print_info(**kwargs):
    for key, value in kwargs.items():
        print(f"{key}: {value}")

print_info(name="Alice", age=25, city="NYC")
# name: Alice
# age: 25
# city: NYC

# Combined
def flexible_function(*args, **kwargs):
    print("Positional:", args)
    print("Keyword:", kwargs)

flexible_function(1, 2, 3, name="Alice", age=25)
# Positional: (1, 2, 3)
# Keyword: {'name': 'Alice', 'age': 25}

Rarity: Common Difficulty: Medium

Statistics & Probability (5 Questions)

6. What is the difference between mean, median, and mode?

Answer:

Mean: Average of all values (sum / count)
Median: Middle value when sorted
Mode: Most frequently occurring value
When to use:
- Mean: Normally distributed data
- Median: Skewed data or outliers present
- Mode: Categorical data

import numpy as np
from scipy import stats

data = [1, 2, 2, 3, 4, 5, 100]

# Mean - affected by outliers
mean = np.mean(data)
print(f"Mean: {mean}")  # 16.71

# Median - robust to outliers
median = np.median(data)
print(f"Median: {median}")  # 3

# Mode
mode = stats.mode(data, keepdims=True)
print(f"Mode: {mode.mode[0]}")  # 2

Rarity: Very Common Difficulty: Easy

7. Explain variance and standard deviation.

Answer:

Variance: Average squared deviation from the mean
Standard Deviation: Square root of variance (same units as data)
Purpose: Measure spread/dispersion of data

import numpy as np

data = [2, 4, 4, 4, 5, 5, 7, 9]

# Variance
variance = np.var(data, ddof=1)  # ddof=1 for sample variance
print(f"Variance: {variance}")  # 4.57

# Standard deviation
std_dev = np.std(data, ddof=1)
print(f"Std Dev: {std_dev}")  # 2.14

# Manual calculation
mean = np.mean(data)
variance_manual = sum((x - mean) ** 2 for x in data) / (len(data) - 1)
print(f"Manual Variance: {variance_manual}")

Rarity: Very Common Difficulty: Easy

8. What is a p-value and how do you interpret it?

Answer: The p-value is the probability of obtaining results at least as extreme as observed, assuming the null hypothesis is true.

Interpretation:
- p < 0.05: Reject null hypothesis (statistically significant)
- p ≥ 0.05: Fail to reject null hypothesis
Note: p-value doesn't measure effect size or importance

from scipy import stats

# Example: Testing if a coin is fair
# Null hypothesis: coin is fair (p = 0.5)
# We got 65 heads out of 100 flips

observed_heads = 65
n_flips = 100
expected_proportion = 0.5

# Binomial test
p_value = stats.binom_test(observed_heads, n_flips, expected_proportion)
print(f"P-value: {p_value}")  # 0.0018

if p_value < 0.05:
    print("Reject null hypothesis - coin is likely biased")
else:
    print("Fail to reject null hypothesis - coin appears fair")

Rarity: Very Common Difficulty: Medium

9. What is the Central Limit Theorem?

Answer: The Central Limit Theorem states that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population's distribution.

Key Points:
- Works for any distribution (if sample size is large enough)
- Typically n ≥ 30 is considered sufficient
- Enables hypothesis testing and confidence intervals

import numpy as np
import matplotlib.pyplot as plt

# Population with non-normal distribution (exponential)
population = np.random.exponential(scale=2, size=10000)

# Take many samples and calculate their means
sample_means = []
for _ in range(1000):
    sample = np.random.choice(population, size=30)
    sample_means.append(np.mean(sample))

# Sample means are normally distributed (CLT)
print(f"Population mean: {np.mean(population):.2f}")
print(f"Mean of sample means: {np.mean(sample_means):.2f}")
print(f"Std of sample means: {np.std(sample_means):.2f}")

Rarity: Common Difficulty: Medium

10. What is correlation vs causation?

Answer:

Correlation: Statistical relationship between two variables
Causation: One variable directly causes changes in another
Key Point: Correlation does NOT imply causation
Reasons:
- Confounding variables
- Reverse causation
- Coincidence

import numpy as np
import pandas as pd

# Example: Ice cream sales and drowning deaths are correlated
# But ice cream doesn't cause drowning (confounding variable: temperature)

# Correlation coefficient
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

correlation = np.corrcoef(x, y)[0, 1]
print(f"Correlation: {correlation:.2f}")  # 0.82

# Pearson correlation
from scipy.stats import pearsonr
corr, p_value = pearsonr(x, y)
print(f"Pearson r: {corr:.2f}, p-value: {p_value:.3f}")

Rarity: Very Common Difficulty: Easy

Data Manipulation with Pandas (5 Questions)

11. How do you read a CSV file and display basic information?

Answer: Use pandas to read and explore data.

import pandas as pd

# Read CSV
df = pd.read_csv('data.csv')

# Basic information
print(df.head())  # First 5 rows
print(df.tail())  # Last 5 rows
print(df.shape)   # (rows, columns)
print(df.info())  # Data types and non-null counts
print(df.describe())  # Statistical summary

# Column names and types
print(df.columns)
print(df.dtypes)

# Check for missing values
print(df.isnull().sum())

# Specific columns
print(df[['column1', 'column2']].head())

Rarity: Very Common Difficulty: Easy

12. How do you handle missing values in a DataFrame?

Answer: Multiple strategies for handling missing data:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

# Check missing values
print(df.isnull().sum())

# Drop rows with any missing values
df_dropped = df.dropna()

# Drop columns with any missing values
df_dropped_cols = df.dropna(axis=1)

# Fill with specific value
df_filled = df.fillna(0)

# Fill with mean
df['A'] = df['A'].fillna(df['A'].mean())

# Fill with median
df['B'] = df['B'].fillna(df['B'].median())

# Forward fill (use previous value)
df_ffill = df.fillna(method='ffill')

# Backward fill (use next value)
df_bfill = df.fillna(method='bfill')

# Interpolate
df_interpolated = df.interpolate()

Rarity: Very Common Difficulty: Easy

13. How do you filter and select data in pandas?

Answer: Multiple ways to filter and select data:

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 28],
    'salary': [50000, 60000, 75000, 55000],
    'department': ['IT', 'HR', 'IT', 'Finance']
})

# Select columns
print(df['name'])  # Single column (Series)
print(df[['name', 'age']])  # Multiple columns (DataFrame)

# Filter rows
high_salary = df[df['salary'] > 55000]
print(high_salary)

# Multiple conditions
it_high_salary = df[(df['department'] == 'IT') & (df['salary'] > 50000)]
print(it_high_salary)

# Using .loc (label-based)
print(df.loc[0:2, ['name', 'age']])

# Using .iloc (position-based)
print(df.iloc[0:2, 0:2])

# Query method
result = df.query('age > 28 and salary > 55000')
print(result)

# isin method
it_or_hr = df[df['department'].isin(['IT', 'HR'])]
print(it_or_hr)

Rarity: Very Common Difficulty: Easy

14. How do you group and aggregate data?

Answer: Use groupby() for aggregation operations:

import pandas as pd

df = pd.DataFrame({
    'department': ['IT', 'HR', 'IT', 'Finance', 'HR', 'IT'],
    'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'salary': [50000, 45000, 60000, 55000, 48000, 65000],
    'age': [25, 30, 35, 28, 32, 40]
})

# Group by single column
dept_avg_salary = df.groupby('department')['salary'].mean()
print(dept_avg_salary)

# Multiple aggregations
dept_stats = df.groupby('department').agg({
    'salary': ['mean', 'min', 'max'],
    'age': 'mean'
})
print(dept_stats)

# Custom aggregation
dept_custom = df.groupby('department').agg({
    'salary': lambda x: x.max() - x.min(),
    'employee': 'count'
})
print(dept_custom)

# Multiple group by columns
result = df.groupby(['department', 'age'])['salary'].sum()
print(result)

Rarity: Very Common Difficulty: Medium

15. How do you merge or join DataFrames?

Answer: Use merge(), join(), or concat():

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({
    'employee_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'employee_id': [1, 2, 3, 5],
    'salary': [50000, 60000, 75000, 55000]
})

# Inner join (only matching rows)
inner = pd.merge(df1, df2, on='employee_id', how='inner')
print(inner)

# Left join (all rows from left)
left = pd.merge(df1, df2, on='employee_id', how='left')
print(left)

# Right join (all rows from right)
right = pd.merge(df1, df2, on='employee_id', how='right')
print(right)

# Outer join (all rows from both)
outer = pd.merge(df1, df2, on='employee_id', how='outer')
print(outer)

# Concatenate vertically
df3 = pd.concat([df1, df2], ignore_index=True)
print(df3)

# Concatenate horizontally
df4 = pd.concat([df1, df2], axis=1)
print(df4)

Rarity: Very Common Difficulty: Medium

Machine Learning Basics (5 Questions)

16. What is the difference between supervised and unsupervised learning?

Answer:

Supervised Learning:
- Has labeled training data (input-output pairs)
- Goal: Learn mapping from inputs to outputs
- Examples: Classification, Regression
- Algorithms: Linear Regression, Decision Trees, SVM
Unsupervised Learning:
- No labeled data (only inputs)
- Goal: Find patterns or structure in data
- Examples: Clustering, Dimensionality Reduction
- Algorithms: K-Means, PCA, Hierarchical Clustering

from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans
import numpy as np

# Supervised Learning - Linear Regression
X_train = np.array([[1], [2], [3], [4], [5]])
y_train = np.array([2, 4, 6, 8, 10])

model = LinearRegression()
model.fit(X_train, y_train)
prediction = model.predict([[6]])
print(f"Supervised prediction: {prediction[0]}")  # 12

# Unsupervised Learning - K-Means Clustering
X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])

kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(X)
print(f"Cluster assignments: {clusters}")

Rarity: Very Common Difficulty: Easy

17. What is overfitting and how do you prevent it?

Answer: Overfitting occurs when a model learns training data too well, including noise, and performs poorly on new data.

Signs:
- High training accuracy, low test accuracy
- Model too complex for the data
Prevention:
- More training data
- Cross-validation
- Regularization (L1, L2)
- Simpler models
- Early stopping
- Dropout (neural networks)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

# Generate data
X = np.random.rand(100, 1) * 10
y = 2 * X + 3 + np.random.randn(100, 1) * 2

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Overfitting example - high degree polynomial
poly = PolynomialFeatures(degree=15)
X_poly = poly.fit_transform(X_train)

# Regularization to prevent overfitting
# Ridge (L2 regularization)
ridge = Ridge(alpha=1.0)
ridge.fit(X_poly, y_train)

# Lasso (L1 regularization)
lasso = Lasso(alpha=0.1)
lasso.fit(X_poly, y_train)

print(f"Ridge score: {ridge.score(X_poly, y_train)}")
print(f"Lasso score: {lasso.score(X_poly, y_train)}")

Rarity: Very Common Difficulty: Medium

18. Explain the train-test split and why it's important.

Answer: Train-test split divides data into training and testing sets to evaluate model performance on unseen data.

Purpose: Prevent overfitting, estimate real-world performance
Typical Split: 70-30 or 80-20 (train-test)
Cross-Validation: More robust evaluation

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

# Train model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Evaluate
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print(f"Training accuracy: {train_score:.2f}")
print(f"Test accuracy: {test_score:.2f}")

# Cross-validation (more robust)
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"CV scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.2f}")

Rarity: Very Common Difficulty: Easy

19. What evaluation metrics do you use for classification?

Answer: Different metrics for different scenarios:

Accuracy: Overall correctness (good for balanced datasets)
Precision: Of predicted positives, how many are correct
Recall: Of actual positives, how many were found
F1-Score: Harmonic mean of precision and recall
Confusion Matrix: Detailed breakdown of predictions

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.3, random_state=42
)

# Train model
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:\n{cm}")

# Classification report
print(f"\n{classification_report(y_test, y_pred)}")

Rarity: Very Common Difficulty: Medium

20. What is the difference between classification and regression?

Answer:

Classification:
- Predicts discrete categories/classes
- Output: Class label
- Examples: Spam detection, image classification
- Algorithms: Logistic Regression, Decision Trees, SVM
- Metrics: Accuracy, Precision, Recall, F1
Regression:
- Predicts continuous numerical values
- Output: Number
- Examples: House price prediction, temperature forecasting
- Algorithms: Linear Regression, Random Forest Regressor
- Metrics: MSE, RMSE, MAE, R²

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Regression example
X_reg = np.array([[1], [2], [3], [4], [5]])
y_reg = np.array([2.1, 3.9, 6.2, 7.8, 10.1])

reg_model = LinearRegression()
reg_model.fit(X_reg, y_reg)
y_pred_reg = reg_model.predict([[6]])
print(f"Regression prediction: {y_pred_reg[0]:.2f}")  # Continuous value

# Classification example
X_clf = np.array([[1], [2], [3], [4], [5]])
y_clf = np.array([0, 0, 1, 1, 1])  # Binary classes

clf_model = LogisticRegression()
clf_model.fit(X_clf, y_clf)
y_pred_clf = clf_model.predict([[3.5]])
print(f"Classification prediction: {y_pred_clf[0]}")  # Class label (0 or 1)

Rarity: Very Common Difficulty: Easy

Junior Data Scientist Interview Questions: Complete Guide

Introduction

Python Fundamentals (5 Questions)

1. What is the difference between a list and a tuple in Python?

2. Explain list comprehension and give an example.

3. What are lambda functions and when would you use them?

4. Explain the difference between `append()` and `extend()` for lists.

5. What are `*args` and `**kwargs`?

Statistics & Probability (5 Questions)

6. What is the difference between mean, median, and mode?

7. Explain variance and standard deviation.

8. What is a p-value and how do you interpret it?

9. What is the Central Limit Theorem?

10. What is correlation vs causation?

Data Manipulation with Pandas (5 Questions)

11. How do you read a CSV file and display basic information?

12. How do you handle missing values in a DataFrame?

13. How do you filter and select data in pandas?

14. How do you group and aggregate data?

15. How do you merge or join DataFrames?

Machine Learning Basics (5 Questions)

16. What is the difference between supervised and unsupervised learning?

17. What is overfitting and how do you prevent it?

18. Explain the train-test split and why it's important.

19. What evaluation metrics do you use for classification?

20. What is the difference between classification and regression?

Related Posts

Junior Data Analyst Interview Questions: Complete Guide

Junior Machine Learning Engineer Interview Questions: Complete Guide

Junior Mobile Developer (iOS) Interview Questions: Complete Guide

Recent Posts

Unlock Your Career Potential: The Power of Group Career Coaching

Supercharge Your Job Search with a Free Job Tracker

High-Paying Entry-Level Jobs: No Experience Needed!

Weekly career tips that actually work

Introduction

Python Fundamentals (5 Questions)

1. What is the difference between a list and a tuple in Python?

2. Explain list comprehension and give an example.

3. What are lambda functions and when would you use them?

4. Explain the difference between append() and extend() for lists.

5. What are *args and **kwargs?

Statistics & Probability (5 Questions)

6. What is the difference between mean, median, and mode?

7. Explain variance and standard deviation.

8. What is a p-value and how do you interpret it?

9. What is the Central Limit Theorem?

10. What is correlation vs causation?

Data Manipulation with Pandas (5 Questions)

11. How do you read a CSV file and display basic information?

12. How do you handle missing values in a DataFrame?

13. How do you filter and select data in pandas?

14. How do you group and aggregate data?

15. How do you merge or join DataFrames?

Machine Learning Basics (5 Questions)

16. What is the difference between supervised and unsupervised learning?

17. What is overfitting and how do you prevent it?

18. Explain the train-test split and why it's important.

19. What evaluation metrics do you use for classification?

20. What is the difference between classification and regression?

Related Posts

Junior Data Analyst Interview Questions: Complete Guide

Junior Machine Learning Engineer Interview Questions: Complete Guide

Junior Mobile Developer (iOS) Interview Questions: Complete Guide

Recent Posts

Unlock Your Career Potential: The Power of Group Career Coaching

Supercharge Your Job Search with a Free Job Tracker

High-Paying Entry-Level Jobs: No Experience Needed!

Weekly career tips that actually work

4. Explain the difference between `append()` and `extend()` for lists.

5. What are `*args` and `**kwargs`?