November 25, 2025
20 min read

Junior Site Reliability Engineer Interview Questions: Complete Guide

interview
career-advice
job-search
entry-level
Junior Site Reliability Engineer Interview Questions: Complete Guide
Milad Bonakdar

Milad Bonakdar

Author

Master essential SRE fundamentals with comprehensive interview questions covering monitoring, incident response, SLOs, automation, Linux troubleshooting, and reliability practices for junior SRE roles.


Introduction

Site Reliability Engineering (SRE) combines software engineering and systems administration to build and run large-scale, reliable systems. As a junior SRE, you'll focus on monitoring, incident response, automation, and learning reliability practices that keep services running smoothly.

This guide covers essential interview questions for junior SREs, organized by topic to help you prepare effectively. Each question includes detailed answers, practical examples, and hands-on scenarios.


SRE Fundamentals

1. What is Site Reliability Engineering and how does it differ from DevOps?

Answer: SRE is Google's approach to running production systems reliably at scale.

Key Principles:

  • Treat operations as a software problem
  • Maximum 50% time on operational work (toil)
  • Error budgets for balancing reliability and velocity
  • Blameless postmortems
  • Gradual rollouts and automated rollbacks

SRE vs DevOps:

AspectSREDevOps
FocusReliability and scalabilityCollaboration and automation
MetricsSLIs, SLOs, error budgetsDeployment frequency, lead time
ApproachPrescriptive (specific practices)Philosophy (cultural movement)
ToilExplicitly limited to 50%Not specifically defined

SRE implements DevOps principles with specific practices and metrics.

Rarity: Very Common
Difficulty: Easy


2. Explain SLIs, SLOs, and error budgets.

Answer: These are core SRE concepts for measuring and managing reliability:

SLI (Service Level Indicator):

  • Quantitative measure of service level
  • Examples: Latency, availability, error rate

SLO (Service Level Objective):

  • Target value for an SLI
  • Example: "99.9% of requests succeed"

Error Budget:

  • Allowed failure rate (100% - SLO)
  • Used to balance reliability vs feature velocity
# Example SLI/SLO calculation
def calculate_error_budget(total_requests, failed_requests, slo_target=0.999):
    """
    Calculate error budget consumption
    
    SLO: 99.9% success rate
    Error Budget: 0.1% allowed failures
    """
    success_rate = (total_requests - failed_requests) / total_requests
    error_rate = failed_requests / total_requests
    
    # Error budget: how much of allowed 0.1% we've used
    allowed_errors = total_requests * (1 - slo_target)
    budget_consumed = (failed_requests / allowed_errors) * 100
    
    return {
        'success_rate': success_rate,
        'error_rate': error_rate,
        'slo_target': slo_target,
        'slo_met': success_rate >= slo_target,
        'error_budget_consumed': budget_consumed,
        'remaining_budget': max(0, 100 - budget_consumed)
    }

# Example
result = calculate_error_budget(
    total_requests=1000000,
    failed_requests=500,  # 0.05% error rate
    slo_target=0.999
)

print(f"Success Rate: {result['success_rate']:.4%}")
print(f"SLO Met: {result['slo_met']}")
print(f"Error Budget Consumed: {result['error_budget_consumed']:.1f}%")

Rarity: Very Common
Difficulty: Medium


3. What is toil and how do you reduce it?

Answer: Toil is repetitive, manual operational work that:

  • Is manual (requires human action)
  • Is repetitive
  • Can be automated
  • Has no enduring value
  • Grows linearly with service growth

Examples of toil:

  • Manually restarting services
  • Copying files between servers
  • Manually scaling resources
  • Repetitive ticket responses

Toil reduction strategies:

# Example: Automate service restart
#!/bin/bash
# auto-restart-service.sh

SERVICE_NAME="myapp"
MAX_RETRIES=3
RETRY_DELAY=5

check_service() {
    systemctl is-active --quiet $SERVICE_NAME
    return $?
}

restart_service() {
    echo "$(date): Restarting $SERVICE_NAME"
    systemctl restart $SERVICE_NAME
    
    # Wait for service to stabilize
    sleep $RETRY_DELAY
    
    if check_service; then
        echo "$(date): $SERVICE_NAME restarted successfully"
        # Send notification
        curl -X POST https://alerts.company.com/webhook \
          -d "{\"message\": \"$SERVICE_NAME auto-restarted\"}"
        return 0
    else
        return 1
    fi
}

# Main logic
if ! check_service; then
    echo "$(date): $SERVICE_NAME is down"
    
    for i in $(seq 1 $MAX_RETRIES); do
        echo "Attempt $i of $MAX_RETRIES"
        if restart_service; then
            exit 0
        fi
        sleep $RETRY_DELAY
    done
    
    # All retries failed - escalate
    echo "$(date): Failed to restart $SERVICE_NAME after $MAX_RETRIES attempts"
    curl -X POST https://pagerduty.com/api/incidents \
      -d "{\"service\": \"$SERVICE_NAME\", \"severity\": \"critical\"}"
    exit 1
fi

SRE Goal: Keep toil below 50% of time, automate the rest.

Rarity: Very Common
Difficulty: Easy-Medium


Monitoring & Observability

4. What's the difference between monitoring and observability?

Answer: Monitoring: Collecting predefined metrics and alerts

  • Known-unknowns: You know what to watch for
  • Dashboards, alerts, metrics
  • Examples: CPU, memory, request rate

Observability: Understanding system state from outputs

  • Unknown-unknowns: Debug issues you didn't anticipate
  • Logs, metrics, traces combined
  • Can answer arbitrary questions
Loading diagram...

Three Pillars of Observability:

  1. Metrics: Aggregated numbers (CPU, latency)
  2. Logs: Discrete events
  3. Traces: Request flow through system

Example: Prometheus + Grafana + Loki

# Prometheus scrape config
scrape_configs:
  - job_name: 'api-server'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'
    scrape_interval: 15s

Rarity: Common
Difficulty: Medium


5. How do you set up effective alerts?

Answer: Good alerts are actionable, meaningful, and don't cause fatigue.

Alert Best Practices:

1. Alert on symptoms, not causes:

# Bad: Alert on high CPU
- alert: HighCPU
  expr: cpu_usage > 80
  
# Good: Alert on user impact
- alert: HighLatency
  expr: http_request_duration_seconds{quantile="0.95"} > 1
  for: 5m
  annotations:
    summary: "API latency is high"
    description: "95th percentile latency is {{ $value }}s"

2. Include runbook links:

- alert: DatabaseConnectionPoolExhausted
  expr: db_connection_pool_active / db_connection_pool_max > 0.9
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Database connection pool nearly exhausted"
    runbook: "https://wiki.company.com/runbooks/db-connections"

3. Use appropriate severity levels:

# Page-worthy (wakes someone up)
- alert: ServiceDown
  expr: up{job="api"} == 0
  for: 1m
  labels:
    severity: critical
    
# Ticket-worthy (handle during business hours)
- alert: DiskSpaceWarning
  expr: disk_free_percent < 20
  for: 30m
  labels:
    severity: warning

4. Avoid alert fatigue:

  • Use for: duration to avoid flapping
  • Group related alerts
  • Set appropriate thresholds
  • Review and tune regularly

Rarity: Very Common
Difficulty: Medium


Incident Response

6. Walk through your incident response process.

Answer: Structured incident response minimizes impact and recovery time:

Incident Response Steps:

Loading diagram...

1. Detection:

  • Alert fires or user reports issue
  • Acknowledge alert
  • Create incident channel

2. Triage:

# Quick assessment checklist
- What's the user impact?
- How many users affected?
- What services are impacted?
- Is it getting worse?

3. Mitigation:

# Common mitigation strategies
- Rollback recent deployment
- Scale up resources
- Disable problematic feature
- Failover to backup system
- Rate limit traffic

4. Resolution:

  • Fix root cause
  • Verify metrics return to normal
  • Monitor for recurrence

5. Postmortem (Blameless):

# Incident Postmortem Template

## Summary
Brief description of what happened

## Impact
- Duration: 2024-11-25 10:00 - 10:45 UTC (45 minutes)
- Users affected: ~10,000 (5% of total)
- Services impacted: API, Web Frontend

## Root Cause
Database connection pool exhausted due to slow queries

## Timeline
- 10:00: Alert fired for high API latency
- 10:05: On-call engineer acknowledged
- 10:10: Identified database as bottleneck
- 10:15: Increased connection pool size (mitigation)
- 10:30: Identified and killed slow queries
- 10:45: Service fully recovered

## Resolution
- Immediate: Increased connection pool from 100 to 200
- Short-term: Added query timeout (30s)
- Long-term: Optimize slow queries, add query monitoring

## Action Items
- [ ] Add alert for slow queries (Owner: Alice, Due: 2024-12-01)
- [ ] Implement query timeout in application (Owner: Bob, Due: 2024-12-05)
- [ ] Review and optimize top 10 slowest queries (Owner: Charlie, Due: 2024-12-10)

## Lessons Learned
- Connection pool monitoring was insufficient
- Query performance degradation went unnoticed

Rarity: Very Common
Difficulty: Medium


7. How do you troubleshoot a service that's experiencing high latency?

Answer: Systematic debugging approach:

# 1. Verify the problem
curl -w "@curl-format.txt" -o /dev/null -s https://api.example.com/health

# curl-format.txt:
#     time_namelookup:  %{time_namelookup}s\n
#        time_connect:  %{time_connect}s\n
#     time_appconnect:  %{time_appconnect}s\n
#    time_pretransfer:  %{time_pretransfer}s\n
#       time_redirect:  %{time_redirect}s\n
#  time_starttransfer:  %{time_starttransfer}s\n
#                     ----------\n
#          time_total:  %{time_total}s\n

# 2. Check application metrics
# - Request rate (sudden spike?)
# - Error rate (errors causing retries?)
# - Resource usage (CPU, memory)

# 3. Check dependencies
# - Database query time
# - External API calls
# - Cache hit rate

# 4. Check infrastructure
top  # CPU usage
free -h  # Memory
iostat  # Disk I/O
netstat -s  # Network stats

# 5. Check logs for errors
tail -f /var/log/app/error.log | grep -i "timeout\|slow\|error"

# 6. Profile the application
# Python example
import cProfile
cProfile.run('my_function()')

# 7. Check database
# Slow query log
SELECT * FROM mysql.slow_log ORDER BY query_time DESC LIMIT 10;

# Active queries
SELECT pid, query, state, query_start 
FROM pg_stat_activity 
WHERE state = 'active';

Common causes:

  • Database slow queries
  • External API timeouts
  • Memory pressure (GC pauses)
  • Network issues
  • Resource exhaustion
  • Inefficient code paths

Rarity: Very Common
Difficulty: Medium


Automation & Scripting

8. Write a script to check if a service is healthy and restart it if needed.

Answer: Health check and auto-remediation script:

#!/usr/bin/env python3
"""
Service health checker with auto-restart capability
"""
import requests
import subprocess
import time
import sys
from datetime import datetime

class ServiceMonitor:
    def __init__(self, service_name, health_url, max_retries=3):
        self.service_name = service_name
        self.health_url = health_url
        self.max_retries = max_retries
    
    def check_health(self):
        """Check if service is healthy"""
        try:
            response = requests.get(
                self.health_url,
                timeout=5
            )
            return response.status_code == 200
        except requests.exceptions.RequestException as e:
            print(f"{datetime.now()}: Health check failed: {e}")
            return False
    
    def restart_service(self):
        """Restart the service using systemctl"""
        try:
            print(f"{datetime.now()}: Restarting {self.service_name}")
            subprocess.run(
                ['systemctl', 'restart', self.service_name],
                check=True,
                capture_output=True
            )
            time.sleep(10)  # Wait for service to start
            return True
        except subprocess.CalledProcessError as e:
            print(f"{datetime.now()}: Restart failed: {e.stderr}")
            return False
    
    def send_alert(self, message, severity='warning'):
        """Send alert to monitoring system"""
        try:
            requests.post(
                'https://alerts.company.com/webhook',
                json={
                    'service': self.service_name,
                    'message': message,
                    'severity': severity,
                    'timestamp': datetime.now().isoformat()
                },
                timeout=5
            )
        except Exception as e:
            print(f"Failed to send alert: {e}")
    
    def monitor(self):
        """Main monitoring loop"""
        if self.check_health():
            print(f"{datetime.now()}: {self.service_name} is healthy")
            return 0
        
        print(f"{datetime.now()}: {self.service_name} is unhealthy")
        self.send_alert(f"{self.service_name} is down", severity='warning')
        
        # Attempt restart
        for attempt in range(1, self.max_retries + 1):
            print(f"Restart attempt {attempt}/{self.max_retries}")
            
            if self.restart_service() and self.check_health():
                print(f"{datetime.now()}: Service recovered")
                self.send_alert(
                    f"{self.service_name} auto-recovered after restart",
                    severity='info'
                )
                return 0
            
            time.sleep(5)
        
        # All retries failed
        print(f"{datetime.now()}: Failed to recover service")
        self.send_alert(
            f"{self.service_name} failed to recover after {self.max_retries} attempts",
            severity='critical'
        )
        return 1

if __name__ == '__main__':
    monitor = ServiceMonitor(
        service_name='myapp',
        health_url='http://localhost:8080/health'
    )
    sys.exit(monitor.monitor())

Rarity: Common
Difficulty: Medium


Reliability Practices

9. What is a runbook and why is it important?

Answer: A runbook is a documented procedure for handling operational tasks and incidents.

Runbook Structure:

# Runbook: High API Latency

## Symptoms
- API 95th percentile latency > 1 second
- User complaints about slow page loads
- Alert: "HighAPILatency" firing

## Severity
**Warning** - Degrades user experience but service is functional

## Investigation Steps

### 1. Check current metrics
```bash
# Check latency distribution
curl -s http://prometheus:9090/api/v1/query?query='histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'

# Check request rate
curl -s http://prometheus:9090/api/v1/query?query='rate(http_requests_total[5m])'

2. Identify bottleneck

  • Check database query time
  • Check external API calls
  • Check cache hit rate
  • Review recent deployments

3. Check dependencies

# Database connections
mysql -e "SHOW PROCESSLIST;"

# Redis latency
redis-cli --latency

# External APIs
curl -w "%{time_total}\n" -o /dev/null -s https://external-api.com/health

Mitigation Steps

Quick fixes (< 5 minutes)

  1. Scale up application instances
kubectl scale deployment api --replicas=10
  1. Increase cache TTL temporarily
redis-cli CONFIG SET maxmemory-policy allkeys-lru

If issue persists

  1. Rollback recent deployment
kubectl rollout undo deployment/api
  1. Enable rate limiting
kubectl apply -f rate-limit-config.yaml

Resolution

  • Fix root cause (slow query, inefficient code)
  • Deploy fix
  • Monitor for 30 minutes
  • Scale back to normal capacity

Escalation

If unable to resolve within 30 minutes:

  • Escalate to: @backend-team
  • Slack channel: #incidents
  • On-call: Use PagerDuty escalation policy

**Why runbooks matter:**
- Faster incident response
- Consistent procedures
- Knowledge sharing
- Reduced stress during incidents
- Training tool for new team members

**Rarity:** Common  
**Difficulty:** Easy

---

### 10. Explain the concept of graceful degradation.

**Answer:**
**Graceful degradation** means a system continues to operate at reduced capacity when components fail, rather than failing completely.

**Strategies:**

**1. Feature Flags:**
```python
# Disable non-critical features during high load
class FeatureFlags:
    def __init__(self):
        self.flags = {
            'recommendations': True,
            'analytics': True,
            'search_autocomplete': True
        }
    
    def is_enabled(self, feature):
        # Disable non-critical features if error budget low
        if self.error_budget_low():
            non_critical = ['analytics', 'search_autocomplete']
            if feature in non_critical:
                return False
        return self.flags.get(feature, False)
    
    def error_budget_low(self):
        # Check if error budget < 20%
        return get_error_budget() < 0.2

# Usage
flags = FeatureFlags()
if flags.is_enabled('recommendations'):
    show_recommendations()
else:
    # Gracefully degrade - show static content
    show_popular_items()

2. Circuit Breaker:

from enum import Enum
import time

class CircuitState(Enum):
    CLOSED = "closed"  # Normal operation
    OPEN = "open"      # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing recovery

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failures = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED
    
    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                # Return fallback instead of calling failing service
                return self.fallback()
        
        try:
            result = func(*args, **kwargs)
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            return self.fallback()
    
    def on_success(self):
        self.failures = 0
        self.state = CircuitState.CLOSED
    
    def on_failure(self):
        self.failures += 1
        self.last_failure_time = time.time()
        if self.failures >= self.failure_threshold:
            self.state = CircuitState.OPEN
    
    def fallback(self):
        # Return cached data or default response
        return {"status": "degraded", "data": []}

# Usage
breaker = CircuitBreaker()
result = breaker.call(external_api_call, user_id=123)

3. Timeouts and Retries:

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

def create_resilient_session():
    session = requests.Session()
    
    # Retry strategy
    retry = Retry(
        total=3,
        backoff_factor=0.3,
        status_forcelist=[500, 502, 503, 504]
    )
    
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    
    return session

# Usage with timeout
session = create_resilient_session()
try:
    response = session.get('https://api.example.com/data', timeout=5)
except requests.exceptions.Timeout:
    # Gracefully degrade - use cached data
    response = get_cached_data()

Rarity: Common
Difficulty: Medium


Containerization Basics

11. What is Docker and how does it differ from virtual machines?

Answer: Docker is a containerization platform that packages applications with their dependencies.

Containers vs Virtual Machines:

Loading diagram...

Key Differences:

FeatureContainersVirtual Machines
Startup TimeSecondsMinutes
SizeMBsGBs
Resource UsageLightweightHeavy
IsolationProcess-levelHardware-level
OSShares host OSSeparate OS per VM

Docker Basics:

# Pull an image
docker pull nginx:latest

# Run a container
docker run -d \
  --name my-nginx \
  -p 8080:80 \
  nginx:latest

# List running containers
docker ps

# View logs
docker logs my-nginx

# Execute command in container
docker exec -it my-nginx bash

# Stop container
docker stop my-nginx

# Remove container
docker rm my-nginx

Dockerfile Example:

# Base image
FROM python:3.9-slim

# Set working directory
WORKDIR /app

# Copy requirements
COPY requirements.txt .

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=3s \
  CMD curl -f http://localhost:8000/health || exit 1

# Run application
CMD ["python", "app.py"]

Build and Run:

# Build image
docker build -t myapp:1.0 .

# Run with environment variables
docker run -d \
  --name myapp \
  -p 8000:8000 \
  -e DATABASE_URL=postgres://db:5432/mydb \
  -e LOG_LEVEL=info \
  myapp:1.0

# View resource usage
docker stats myapp

Docker Compose (Multi-container):

version: '3.8'

services:
  web:
    build: .
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql://db:5432/mydb
    depends_on:
      - db
      - redis
    restart: unless-stopped
  
  db:
    image: postgres:14
    environment:
      - POSTGRES_DB=mydb
      - POSTGRES_PASSWORD=secret
    volumes:
      - postgres_data:/var/lib/postgresql/data
  
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

volumes:
  postgres_data:

Run with Docker Compose:

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f web

# Scale service
docker-compose up -d --scale web=3

# Stop all services
docker-compose down

Best Practices:

  • Use official base images
  • Minimize layer count
  • Don't run as root
  • Use .dockerignore
  • Tag images properly
  • Scan for vulnerabilities

Rarity: Very Common
Difficulty: Easy-Medium


Version Control & Deployment

12. Explain Git workflows and how you handle deployments.

Answer: Git is essential for version control and deployment automation.

Common Git Workflow:

Loading diagram...

Basic Git Commands:

# Clone repository
git clone https://github.com/company/repo.git
cd repo

# Create feature branch
git checkout -b feature/add-monitoring

# Make changes and commit
git add .
git commit -m "Add Prometheus monitoring"

# Push to remote
git push origin feature/add-monitoring

# Update from main
git checkout main
git pull origin main
git checkout feature/add-monitoring
git rebase main

# Merge feature (after PR approval)
git checkout main
git merge feature/add-monitoring
git push origin main

Branching Strategy:

1. Gitflow:

  • main: Production-ready code
  • develop: Integration branch
  • feature/*: New features
  • release/*: Release preparation
  • hotfix/*: Emergency fixes

2. Trunk-Based Development:

# Short-lived feature branches
git checkout -b feature/quick-fix
# Work for < 1 day
git push origin feature/quick-fix
# Merge to main immediately after review

Deployment Workflow:

1. CI/CD Pipeline (GitHub Actions):

name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Run tests
        run: |
          npm install
          npm test
      
      - name: Run linting
        run: npm run lint
  
  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Build Docker image
        run: |
          docker build -t myapp:${{ github.sha }} .
          docker tag myapp:${{ github.sha }} myapp:latest
      
      - name: Push to registry
        run: |
          echo ${{ secrets.DOCKER_PASSWORD }} | docker login -u ${{ secrets.DOCKER_USERNAME }} --password-stdin
          docker push myapp:${{ github.sha }}
          docker push myapp:latest
  
  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to production
        run: |
          kubectl set image deployment/myapp \
            myapp=myapp:${{ github.sha }}
          kubectl rollout status deployment/myapp

2. Deployment Script:

#!/bin/bash
# deploy.sh - Simple deployment script

set -e  # Exit on error

# Configuration
APP_NAME="myapp"
VERSION=$(git rev-parse --short HEAD)
ENVIRONMENT=${1:-staging}

echo "Deploying $APP_NAME version $VERSION to $ENVIRONMENT"

# Run tests
echo "Running tests..."
npm test

# Build application
echo "Building application..."
npm run build

# Build Docker image
echo "Building Docker image..."
docker build -t $APP_NAME:$VERSION .

# Push to registry
echo "Pushing to registry..."
docker tag $APP_NAME:$VERSION registry.company.com/$APP_NAME:$VERSION
docker push registry.company.com/$APP_NAME:$VERSION

# Deploy to Kubernetes
echo "Deploying to Kubernetes..."
kubectl set image deployment/$APP_NAME \
  $APP_NAME=registry.company.com/$APP_NAME:$VERSION \
  --namespace=$ENVIRONMENT

# Wait for rollout
kubectl rollout status deployment/$APP_NAME \
  --namespace=$ENVIRONMENT \
  --timeout=5m

echo "Deployment complete!"

# Run smoke tests
echo "Running smoke tests..."
curl -f https://$ENVIRONMENT.company.com/health || {
  echo "Health check failed! Rolling back..."
  kubectl rollout undo deployment/$APP_NAME --namespace=$ENVIRONMENT
  exit 1
}

echo "Deployment successful!"

3. Rollback Procedure:

# View deployment history
kubectl rollout history deployment/myapp

# Rollback to previous version
kubectl rollout undo deployment/myapp

# Rollback to specific revision
kubectl rollout undo deployment/myapp --to-revision=3

# Check rollback status
kubectl rollout status deployment/myapp

Deployment Best Practices:

  • Always test before deploying
  • Use feature flags for risky changes
  • Deploy during low-traffic periods
  • Have rollback plan ready
  • Monitor after deployment
  • Use blue-green or canary deployments

Rarity: Very Common
Difficulty: Medium


Linux Fundamentals

13. How do you troubleshoot a Linux server that's running slow?

Answer: Systematic approach to diagnosing performance issues:

Troubleshooting Checklist:

1. Check System Load:

# View current load average
uptime
# Output: 15:30:45 up 10 days, 3:25, 2 users, load average: 2.5, 2.0, 1.8
# Load average: 1min, 5min, 15min

# View detailed system stats
top
# Press '1' to see per-CPU stats
# Press 'M' to sort by memory
# Press 'P' to sort by CPU

# Better alternative to top
htop

2. Check CPU Usage:

# CPU usage per process
ps aux --sort=-%cpu | head -10

# Real-time CPU monitoring
mpstat 1 5  # 5 samples, 1 second apart

# CPU usage by core
lscpu

3. Check Memory:

# Memory usage
free -h
#               total        used        free      shared  buff/cache   available
# Mem:           15Gi       8.2Gi       2.1Gi       324Mi       5.0Gi       6.5Gi
# Swap:         2.0Gi       512Mi       1.5Gi

# Memory usage per process
ps aux --sort=-%mem | head -10

# Detailed memory info
cat /proc/meminfo

# Check for OOM (Out of Memory) kills
dmesg | grep -i "out of memory"
journalctl -k | grep -i "killed process"

4. Check Disk I/O:

# Disk usage
df -h

# Inode usage (can run out even with disk space)
df -i

# Disk I/O statistics
iostat -x 1 5

# Find large files
du -sh /* | sort -rh | head -10

# Find files using most disk I/O
iotop

5. Check Network:

# Network connections
netstat -tuln  # Listening ports
netstat -tun   # Established connections

# Network statistics
netstat -s

# Network interface stats
ifconfig
ip addr show

# Bandwidth usage
iftop

# Check DNS resolution
nslookup google.com
dig google.com

6. Check Running Processes:

# All processes
ps aux

# Process tree
pstree

# Find process by name
pgrep -a nginx

# Kill process
kill -9 <PID>
pkill -9 nginx

# Check zombie processes
ps aux | grep 'Z'

7. Check Logs:

# System logs
tail -f /var/log/syslog
journalctl -f  # Follow systemd logs

# Application logs
tail -f /var/log/nginx/error.log

# Search for errors
grep -i error /var/log/syslog | tail -20

# Check auth logs
tail -f /var/log/auth.log

Common Issues and Solutions:

High CPU:

# Find CPU-intensive process
top -o %CPU

# If it's a runaway process
kill -15 <PID>  # Graceful
kill -9 <PID>   # Force kill

# If it's legitimate high load
# - Scale horizontally (add servers)
# - Optimize application code
# - Add caching

High Memory:

# Clear cache (safe, kernel will rebuild)
sync
echo 3 > /proc/sys/vm/drop_caches

# Find memory leaks
valgrind --leak-check=full ./myapp

# Restart memory-hungry service
systemctl restart myapp

Disk Full:

# Find large files
find / -type f -size +100M -exec ls -lh {} \;

# Clean up logs
journalctl --vacuum-time=7d  # Keep last 7 days
find /var/log -name "*.log" -mtime +30 -delete

# Clean package cache (Ubuntu/Debian)
apt-get clean
apt-get autoclean

Slow Network:

# Test connectivity
ping -c 5 google.com

# Test bandwidth
iperf3 -c server.example.com

# Trace route
traceroute google.com

# Check firewall
iptables -L -n

Quick Diagnostic Script:

#!/bin/bash
# quick-diag.sh - Quick system diagnostics

echo "=== System Load ==="
uptime

echo -e "\n=== CPU Usage ==="
top -bn1 | head -20

echo -e "\n=== Memory Usage ==="
free -h

echo -e "\n=== Disk Usage ==="
df -h

echo -e "\n=== Top CPU Processes ==="
ps aux --sort=-%cpu | head -6

echo -e "\n=== Top Memory Processes ==="
ps aux --sort=-%mem | head -6

echo -e "\n=== Network Connections ==="
netstat -tuln | wc -l
echo "connections"

echo -e "\n=== Recent Errors ==="
journalctl -p err -n 10 --no-pager

Rarity: Very Common
Difficulty: Medium


Conclusion

Preparing for a junior SRE interview requires understanding reliability principles, hands-on troubleshooting skills, and automation mindset. Focus on:

  1. SRE Fundamentals: SLIs, SLOs, error budgets, toil reduction
  2. Monitoring: Setting up effective alerts and observability
  3. Incident Response: Structured approach to handling outages
  4. Automation: Scripting to reduce toil and improve reliability
  5. Reliability Practices: Runbooks, graceful degradation, resilience patterns

Practice these concepts in real environments, participate in on-call rotations if possible, and always think about how to make systems more reliable and automated. Good luck!

Newsletter subscription

Weekly career tips that actually work

Get the latest insights delivered straight to your inbox

Decorative doodle

Stand Out to Recruiters & Land Your Dream Job

Join thousands who transformed their careers with AI-powered resumes that pass ATS and impress hiring managers.

Start building now

Share this post

Double Your Interview Callbacks

Candidates who tailor their resumes to the job description get 2.5x more interviews. Use our AI to auto-tailor your CV for every single application instantly.