Interview - Blog | Minova - Resume AI Assistant

Junior SRE interview focus

Junior SRE interviews usually test whether you can reason from user impact to system symptoms: SLOs, error budgets, alerts, incidents, Linux signals, automation, containers, and basic Kubernetes operations. You do not need to sound like a principal engineer. You need to show that you can investigate carefully, communicate clearly during incidents, and automate repetitive operational work without hiding risk.

Use these questions to practice concise answers, then connect each answer to a real example from projects, labs, internships, or on-call shadowing.

SRE Fundamentals

1. What is Site Reliability Engineering and how does it differ from DevOps?

Answer: SRE is a reliability-focused way to run production systems by applying software engineering to operations work.

Key Principles:

Treat operations as a software problem
Keep repetitive operational work visible and reduce it with automation
Error budgets for balancing reliability and velocity
Blameless postmortems
Gradual rollouts and automated rollbacks

SRE vs DevOps:

Aspect	SRE	DevOps
Focus	Reliability and scalability	Collaboration and automation
Metrics	SLIs, SLOs, error budgets	Deployment frequency, lead time
Approach	Prescriptive (specific practices)	Philosophy (cultural movement)
Toil	Measured and reduced deliberately	Not always defined formally

A good junior answer: SRE turns reliability into measurable engineering work. DevOps is a broader culture of collaboration and automation; SRE makes that culture concrete with SLOs, error budgets, incident review, and toil reduction.

Rarity: Very Common
Difficulty: Easy

2. Explain SLIs, SLOs, and error budgets.

Answer: These are core SRE concepts for measuring reliability from the user's point of view:

SLI (Service Level Indicator):

Quantitative measure of service level
Examples: request success rate, latency, availability, freshness, throughput

SLO (Service Level Objective):

Target value for an SLI over a time window
Example: "99.9% of valid API requests succeed over 30 days"

Error Budget:

The gap between perfect reliability and the SLO
Used to decide when to slow risky changes, improve reliability, or continue shipping

# Example SLI/SLO calculation
def calculate_error_budget(total_requests, failed_requests, slo_target=0.999):
    """
    Calculate error budget consumption
    
    SLO: 99.9% success rate
    Error Budget: 0.1% allowed failures
    """
    success_rate = (total_requests - failed_requests) / total_requests
    error_rate = failed_requests / total_requests
    
    # Error budget: how much of allowed 0.1% we've used
    allowed_errors = total_requests * (1 - slo_target)
    budget_consumed = (failed_requests / allowed_errors) * 100
    
    return {
        'success_rate': success_rate,
        'error_rate': error_rate,
        'slo_target': slo_target,
        'slo_met': success_rate >= slo_target,
        'error_budget_consumed': budget_consumed,
        'remaining_budget': max(0, 100 - budget_consumed)
    }

# Example
result = calculate_error_budget(
    total_requests=1000000,
    failed_requests=500,  # 0.05% error rate
    slo_target=0.999
)

print(f"Success Rate: {result['success_rate']:.4%}")
print(f"SLO Met: {result['slo_met']}")
print(f"Error Budget Consumed: {result['error_budget_consumed']:.1f}%")

Rarity: Very Common
Difficulty: Medium

3. What is toil and how do you reduce it?

Answer: Toil is repetitive, manual operational work that:

Is manual (requires human action)
Is repetitive
Can be automated
Has no enduring value
Grows linearly with service growth

Examples of toil:

Manually restarting services
Copying files between servers
Manually scaling resources
Repetitive ticket responses

Toil reduction strategies:

# Example: Automate service restart
#!/bin/bash
# auto-restart-service.sh

SERVICE_NAME="myapp"
MAX_RETRIES=3
RETRY_DELAY=5

check_service() {
    systemctl is-active --quiet $SERVICE_NAME
    return $?
}

restart_service() {
    echo "$(date): Restarting $SERVICE_NAME"
    systemctl restart $SERVICE_NAME
    
    # Wait for service to stabilize
    sleep $RETRY_DELAY
    
    if check_service; then
        echo "$(date): $SERVICE_NAME restarted successfully"
        # Send notification
        curl -X POST https://alerts.company.com/webhook \
          -d "{\"message\": \"$SERVICE_NAME auto-restarted\"}"
        return 0
    else
        return 1
    fi
}

# Main logic
if ! check_service; then
    echo "$(date): $SERVICE_NAME is down"
    
    for i in $(seq 1 $MAX_RETRIES); do
        echo "Attempt $i of $MAX_RETRIES"
        if restart_service; then
            exit 0
        fi
        sleep $RETRY_DELAY
    done
    
    # All retries failed - escalate
    echo "$(date): Failed to restart $SERVICE_NAME after $MAX_RETRIES attempts"
    curl -X POST https://pagerduty.com/api/incidents \
      -d "{\"service\": \"$SERVICE_NAME\", \"severity\": \"critical\"}"
    exit 1
fi

Junior interview framing: Do not say every manual task is bad. First confirm the task is safe to automate, document the current steps, add logging and rollback behavior, then measure whether the automation actually reduces interruptions.

Rarity: Very Common
Difficulty: Easy-Medium

Monitoring & Observability

4. What's the difference between monitoring and observability?

Answer: Monitoring: Collecting predefined metrics and alerts

Known-unknowns: You know what to watch for
Dashboards, alerts, metrics
Examples: CPU, memory, request rate

Observability: Understanding system state from outputs

Unknown-unknowns: Debug issues you didn't anticipate
Logs, metrics, traces combined
Can answer arbitrary questions

Loading diagram...

Common telemetry signals:

Metrics: Aggregated numbers such as request rate, errors, saturation, and latency
Logs: Discrete events with enough context to explain what happened
Traces: Request flow across services, useful for dependency and latency analysis

In an interview, explain how you would combine them. For example: metrics show API latency rising, traces show most slow spans wait on the database, and logs show repeated query timeouts after a deployment.

Example: Prometheus + Grafana + Loki

# Prometheus scrape config
scrape_configs:
  - job_name: 'api-server'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'
    scrape_interval: 15s

Rarity: Common
Difficulty: Medium

5. How do you set up effective alerts?

Answer: Good alerts are actionable, tied to user impact, and tuned enough that responders trust them.

Alert Best Practices:

1. Alert on symptoms, not causes:

# Bad: Alert on high CPU
- alert: HighCPU
  expr: cpu_usage > 80
  
# Good: Alert on user impact
- alert: HighLatency
  expr: http_request_duration_seconds{quantile="0.95"} > 1
  for: 5m
  annotations:
    summary: "API latency is high"
    description: "95th percentile latency is {{ $value }}s"

2. Include runbook links:

- alert: DatabaseConnectionPoolExhausted
  expr: db_connection_pool_active / db_connection_pool_max > 0.9
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Database connection pool nearly exhausted"
    runbook: "https://wiki.company.com/runbooks/db-connections"

3. Use appropriate severity levels:

# Page-worthy (wakes someone up)
- alert: ServiceDown
  expr: up{job="api"} == 0
  for: 1m
  labels:
    severity: critical
    
# Ticket-worthy (handle during business hours)
- alert: DiskSpaceWarning
  expr: disk_free_percent < 20
  for: 30m
  labels:
    severity: warning

4. Avoid alert fatigue:

Use for: duration to avoid flapping
Group related alerts
Page on symptoms that need immediate human action
Send lower-priority causes to tickets or dashboards
Review alerts after incidents and remove noisy ones

Rarity: Very Common
Difficulty: Medium

Incident Response

6. Walk through your incident response process.

Answer: Structured incident response reduces customer impact and keeps the team aligned:

Incident Response Steps:

Loading diagram...

1. Detection:

Alert fires or user reports issue
Acknowledge alert
Create incident channel

2. Triage:

# Quick assessment checklist
- What's the user impact?
- How many users affected?
- What services are impacted?
- Is it getting worse?
- Was there a recent deploy, config change, or traffic spike?

3. Mitigation:

# Common mitigation strategies
- Rollback recent deployment
- Scale up resources
- Disable problematic feature
- Failover to backup system
- Rate limit traffic

4. Resolution:

Fix root cause
Verify metrics return to normal
Monitor for recurrence

5. Post-incident review:

# Incident Postmortem Template

## Summary
Brief description of what happened

## Impact
- Duration: 2024-11-25 10:00 - 10:45 UTC (45 minutes)
- Users affected: ~10,000 (5% of total)
- Services impacted: API, Web Frontend

## Root Cause
Database connection pool exhausted due to slow queries

## Timeline
- 10:00: Alert fired for high API latency
- 10:05: On-call engineer acknowledged
- 10:10: Identified database as bottleneck
- 10:15: Increased connection pool size (mitigation)
- 10:30: Identified and killed slow queries
- 10:45: Service fully recovered

## Resolution
- Immediate: Increased connection pool from 100 to 200
- Short-term: Added query timeout (30s)
- Long-term: Optimize slow queries, add query monitoring

## Action Items
- [ ] Add alert for slow queries (Owner: Alice, Due: 2024-12-01)
- [ ] Implement query timeout in application (Owner: Bob, Due: 2024-12-05)
- [ ] Review and optimize top 10 slowest queries (Owner: Charlie, Due: 2024-12-10)

## Lessons Learned
- Connection pool monitoring was insufficient
- Query performance degradation went unnoticed

Rarity: Very Common
Difficulty: Medium

7. How do you troubleshoot a service that's experiencing high latency?

Answer: Systematic debugging approach:

# 1. Verify the problem
curl -w "@curl-format.txt" -o /dev/null -s https://api.example.com/health

# curl-format.txt:
#     time_namelookup:  %{time_namelookup}s\n
#        time_connect:  %{time_connect}s\n
#     time_appconnect:  %{time_appconnect}s\n
#    time_pretransfer:  %{time_pretransfer}s\n
#       time_redirect:  %{time_redirect}s\n
#  time_starttransfer:  %{time_starttransfer}s\n
#                     ----------\n
#          time_total:  %{time_total}s\n

# 2. Check application metrics
# - Request rate (sudden spike?)
# - Error rate (errors causing retries?)
# - Resource usage (CPU, memory)

# 3. Check dependencies
# - Database query time
# - External API calls
# - Cache hit rate

# 4. Check infrastructure
top  # CPU usage
free -h  # Memory
iostat  # Disk I/O
netstat -s  # Network stats

# 5. Check logs for errors
tail -f /var/log/app/error.log | grep -i "timeout\|slow\|error"

# 6. Profile the application
# Python example
import cProfile
cProfile.run('my_function()')

# 7. Check database
# Slow query log
SELECT * FROM mysql.slow_log ORDER BY query_time DESC LIMIT 10;

# Active queries
SELECT pid, query, state, query_start 
FROM pg_stat_activity 
WHERE state = 'active';

Common causes:

Database slow queries
External API timeouts
Memory pressure (GC pauses)
Network issues
Resource exhaustion
Inefficient code paths

Rarity: Very Common
Difficulty: Medium

Automation & Scripting

8. Write a script to check if a service is healthy and restart it if needed.

Answer: Health check and auto-remediation script:

#!/usr/bin/env python3
"""
Service health checker with auto-restart capability
"""
import requests
import subprocess
import time
import sys
from datetime import datetime

class ServiceMonitor:
    def __init__(self, service_name, health_url, max_retries=3):
        self.service_name = service_name
        self.health_url = health_url
        self.max_retries = max_retries
    
    def check_health(self):
        """Check if service is healthy"""
        try:
            response = requests.get(
                self.health_url,
                timeout=5
            )
            return response.status_code == 200
        except requests.exceptions.RequestException as e:
            print(f"{datetime.now()}: Health check failed: {e}")
            return False
    
    def restart_service(self):
        """Restart the service using systemctl"""
        try:
            print(f"{datetime.now()}: Restarting {self.service_name}")
            subprocess.run(
                ['systemctl', 'restart', self.service_name],
                check=True,
                capture_output=True
            )
            time.sleep(10)  # Wait for service to start
            return True
        except subprocess.CalledProcessError as e:
            print(f"{datetime.now()}: Restart failed: {e.stderr}")
            return False
    
    def send_alert(self, message, severity='warning'):
        """Send alert to monitoring system"""
        try:
            requests.post(
                'https://alerts.company.com/webhook',
                json={
                    'service': self.service_name,
                    'message': message,
                    'severity': severity,
                    'timestamp': datetime.now().isoformat()
                },
                timeout=5
            )
        except Exception as e:
            print(f"Failed to send alert: {e}")
    
    def monitor(self):
        """Main monitoring loop"""
        if self.check_health():
            print(f"{datetime.now()}: {self.service_name} is healthy")
            return 0
        
        print(f"{datetime.now()}: {self.service_name} is unhealthy")
        self.send_alert(f"{self.service_name} is down", severity='warning')
        
        # Attempt restart
        for attempt in range(1, self.max_retries + 1):
            print(f"Restart attempt {attempt}/{self.max_retries}")
            
            if self.restart_service() and self.check_health():
                print(f"{datetime.now()}: Service recovered")
                self.send_alert(
                    f"{self.service_name} auto-recovered after restart",
                    severity='info'
                )
                return 0
            
            time.sleep(5)
        
        # All retries failed
        print(f"{datetime.now()}: Failed to recover service")
        self.send_alert(
            f"{self.service_name} failed to recover after {self.max_retries} attempts",
            severity='critical'
        )
        return 1

if __name__ == '__main__':
    monitor = ServiceMonitor(
        service_name='myapp',
        health_url='http://localhost:8080/health'
    )
    sys.exit(monitor.monitor())

Rarity: Common
Difficulty: Medium

Reliability Practices

9. What is a runbook and why is it important?

Answer: A runbook is a documented procedure for handling operational tasks and incidents.

Runbook Structure:

# Runbook: High API Latency

## Symptoms
- API 95th percentile latency > 1 second
- User complaints about slow page loads
- Alert: "HighAPILatency" firing

## Severity
**Warning** - Degrades user experience but service is functional

## Investigation Steps

### 1. Check current metrics
```bash
# Check latency distribution
curl -s http://prometheus:9090/api/v1/query?query='histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'

# Check request rate
curl -s http://prometheus:9090/api/v1/query?query='rate(http_requests_total[5m])'

2. Identify bottleneck

Check database query time
Check external API calls
Check cache hit rate
Review recent deployments

3. Check dependencies

# Database connections
mysql -e "SHOW PROCESSLIST;"

# Redis latency
redis-cli --latency

# External APIs
curl -w "%{time_total}\n" -o /dev/null -s https://external-api.com/health

Mitigation Steps

Quick fixes (< 5 minutes)

Scale up application instances

kubectl scale deployment api --replicas=10

Increase cache TTL temporarily

redis-cli CONFIG SET maxmemory-policy allkeys-lru

If issue persists

Rollback recent deployment

kubectl rollout undo deployment/api

Enable rate limiting

kubectl apply -f rate-limit-config.yaml

Resolution

Fix root cause (slow query, inefficient code)
Deploy fix
Monitor for 30 minutes
Scale back to normal capacity

Escalation

If unable to resolve within 30 minutes:

Escalate to: @backend-team
Slack channel: #incidents
On-call: Use PagerDuty escalation policy


**Why runbooks matter:**
- Faster incident response
- Consistent procedures
- Knowledge sharing
- Reduced stress during incidents
- Training tool for new team members

**Rarity:** Common  
**Difficulty:** Easy

---

### 10. Explain the concept of graceful degradation.

**Answer:**
**Graceful degradation** means a system continues to operate at reduced capacity when components fail, rather than failing completely.

**Strategies:**

**1. Feature Flags:**
```python
# Disable non-critical features during high load
class FeatureFlags:
    def __init__(self):
        self.flags = {
            'recommendations': True,
            'analytics': True,
            'search_autocomplete': True
        }
    
    def is_enabled(self, feature):
        # Disable non-critical features if error budget low
        if self.error_budget_low():
            non_critical = ['analytics', 'search_autocomplete']
            if feature in non_critical:
                return False
        return self.flags.get(feature, False)
    
    def error_budget_low(self):
        # Check if error budget < 20%
        return get_error_budget() < 0.2

# Usage
flags = FeatureFlags()
if flags.is_enabled('recommendations'):
    show_recommendations()
else:
    # Gracefully degrade - show static content
    show_popular_items()

2. Circuit Breaker:

from enum import Enum
import time

class CircuitState(Enum):
    CLOSED = "closed"  # Normal operation
    OPEN = "open"      # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing recovery

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failures = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED
    
    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                # Return fallback instead of calling failing service
                return self.fallback()
        
        try:
            result = func(*args, **kwargs)
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            return self.fallback()
    
    def on_success(self):
        self.failures = 0
        self.state = CircuitState.CLOSED
    
    def on_failure(self):
        self.failures += 1
        self.last_failure_time = time.time()
        if self.failures >= self.failure_threshold:
            self.state = CircuitState.OPEN
    
    def fallback(self):
        # Return cached data or default response
        return {"status": "degraded", "data": []}

# Usage
breaker = CircuitBreaker()
result = breaker.call(external_api_call, user_id=123)

3. Timeouts and Retries:

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

def create_resilient_session():
    session = requests.Session()
    
    # Retry strategy
    retry = Retry(
        total=3,
        backoff_factor=0.3,
        status_forcelist=[500, 502, 503, 504]
    )
    
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    
    return session

# Usage with timeout
session = create_resilient_session()
try:
    response = session.get('https://api.example.com/data', timeout=5)
except requests.exceptions.Timeout:
    # Gracefully degrade - use cached data
    response = get_cached_data()

Rarity: Common
Difficulty: Medium

Containers and Kubernetes Basics

11. What container and Kubernetes basics should a junior SRE know?

Answer: Junior SREs are often expected to understand how an application is packaged, started, checked for health, and rolled back. You do not need deep cluster internals for a junior role, but you should be comfortable with images, containers, logs, resource usage, deployments, and health probes.

Containers vs Virtual Machines:

Loading diagram...

Key Differences:

Feature	Containers	Virtual Machines
Startup Time	Seconds	Minutes
Size	MBs	GBs
Resource Usage	Lightweight	Heavy
Isolation	Process-level	Hardware-level
OS	Shares host OS	Separate OS per VM

Docker basics:

# Pull a specific image tag
docker pull nginx:1.27

# Run a container
docker run -d \
  --name my-nginx \
  -p 8080:80 \
  nginx:1.27

# List running containers
docker ps

# View logs
docker logs my-nginx

# Execute command in container
docker exec -it my-nginx bash

# Stop container
docker stop my-nginx

# Remove container
docker rm my-nginx

Dockerfile example:

# Base image
FROM python:3.12-slim

# Set working directory
WORKDIR /app

# Copy requirements
COPY requirements.txt .

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=3s \
  CMD curl -f http://localhost:8000/health || exit 1

# Run application
CMD ["python", "app.py"]

Build and Run:

# Build image
docker build -t myapp:1.0 .

# Run with environment variables
docker run -d \
  --name myapp \
  -p 8000:8000 \
  -e DATABASE_URL=postgres://db:5432/mydb \
  -e LOG_LEVEL=info \
  myapp:1.0

# View resource usage
docker stats myapp

Kubernetes health probes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: registry.example.com/api:1.0.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 20

How to explain probes:

Readiness probe: tells Kubernetes whether the pod should receive traffic
Liveness probe: tells Kubernetes whether the container should be restarted
Startup probe: useful when an application starts slowly and would otherwise fail liveness checks too early

Docker Compose (multi-container local setup):

services:
  web:
    build: .
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql://db:5432/mydb
    depends_on:
      - db
      - redis
    restart: unless-stopped
  
  db:
    image: postgres:14
    environment:
      - POSTGRES_DB=mydb
      - POSTGRES_PASSWORD=secret
    volumes:
      - postgres_data:/var/lib/postgresql/data
  
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

volumes:
  postgres_data:

Run with Docker Compose:

# Start all services
docker compose up -d

# View logs
docker compose logs -f web

# Scale service
docker compose up -d --scale web=3

# Stop all services
docker compose down

Best Practices:

Use official base images
Minimize layer count
Don't run as root
Use .dockerignore
Tag images properly
Scan for vulnerabilities

Rarity: Very Common
Difficulty: Easy-Medium

Version Control & Deployment

12. Explain Git workflows and how you handle deployments.

Answer: Git is essential for version control and deployment automation.

Common Git Workflow:

Loading diagram...

Basic Git Commands:

# Clone repository
git clone https://github.com/company/repo.git
cd repo

# Create feature branch
git checkout -b feature/add-monitoring

# Make changes and commit
git add .
git commit -m "Add Prometheus monitoring"

# Push to remote
git push origin feature/add-monitoring

# Update from main
git checkout main
git pull origin main
git checkout feature/add-monitoring
git rebase main

# Merge feature (after PR approval)
git checkout main
git merge feature/add-monitoring
git push origin main

Branching Strategy:

1. Gitflow:

main: Production-ready code
develop: Integration branch
feature/*: New features
release/*: Release preparation
hotfix/*: Emergency fixes

2. Trunk-Based Development:

# Short-lived feature branches
git checkout -b feature/quick-fix
# Work for < 1 day
git push origin feature/quick-fix
# Merge to main immediately after review

Deployment Workflow:

1. CI/CD Pipeline (GitHub Actions):

name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run tests
        run: |
          npm install
          npm test
      
      - name: Run linting
        run: npm run lint
  
  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Build Docker image
        run: |
          docker build -t myapp:${{ github.sha }} .
          docker tag myapp:${{ github.sha }} myapp:latest
      
      - name: Push to registry
        run: |
          echo ${{ secrets.DOCKER_PASSWORD }} | docker login -u ${{ secrets.DOCKER_USERNAME }} --password-stdin
          docker push myapp:${{ github.sha }}
          docker push myapp:latest
  
  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to production
        run: |
          kubectl set image deployment/myapp \
            myapp=myapp:${{ github.sha }}
          kubectl rollout status deployment/myapp

2. Deployment Script:

#!/bin/bash
# deploy.sh - Simple deployment script

set -e  # Exit on error

# Configuration
APP_NAME="myapp"
VERSION=$(git rev-parse --short HEAD)
ENVIRONMENT=${1:-staging}

echo "Deploying $APP_NAME version $VERSION to $ENVIRONMENT"

# Run tests
echo "Running tests..."
npm test

# Build application
echo "Building application..."
npm run build

# Build Docker image
echo "Building Docker image..."
docker build -t $APP_NAME:$VERSION .

# Push to registry
echo "Pushing to registry..."
docker tag $APP_NAME:$VERSION registry.company.com/$APP_NAME:$VERSION
docker push registry.company.com/$APP_NAME:$VERSION

# Deploy to Kubernetes
echo "Deploying to Kubernetes..."
kubectl set image deployment/$APP_NAME \
  $APP_NAME=registry.company.com/$APP_NAME:$VERSION \
  --namespace=$ENVIRONMENT

# Wait for rollout
kubectl rollout status deployment/$APP_NAME \
  --namespace=$ENVIRONMENT \
  --timeout=5m

echo "Deployment complete!"

# Run smoke tests
echo "Running smoke tests..."
curl -f https://$ENVIRONMENT.company.com/health || {
  echo "Health check failed! Rolling back..."
  kubectl rollout undo deployment/$APP_NAME --namespace=$ENVIRONMENT
  exit 1
}

echo "Deployment successful!"

3. Rollback Procedure:

# View deployment history
kubectl rollout history deployment/myapp

# Rollback to previous version
kubectl rollout undo deployment/myapp

# Rollback to specific revision
kubectl rollout undo deployment/myapp --to-revision=3

# Check rollback status
kubectl rollout status deployment/myapp

Deployment Best Practices:

Always test before deploying
Use feature flags for risky changes
Deploy during low-traffic periods
Have rollback plan ready
Monitor after deployment
Use blue-green or canary deployments

Rarity: Very Common
Difficulty: Medium

Linux Fundamentals

13. How do you troubleshoot a Linux server that's running slow?

Answer: Systematic approach to diagnosing performance issues:

Troubleshooting Checklist:

1. Check System Load:

# View current load average
uptime
# Output: 15:30:45 up 10 days, 3:25, 2 users, load average: 2.5, 2.0, 1.8
# Load average: 1min, 5min, 15min

# View detailed system stats
top
# Press '1' to see per-CPU stats
# Press 'M' to sort by memory
# Press 'P' to sort by CPU

# Better alternative to top
htop

2. Check CPU Usage:

# CPU usage per process
ps aux --sort=-%cpu | head -10

# Real-time CPU monitoring
mpstat 1 5  # 5 samples, 1 second apart

# CPU usage by core
lscpu

3. Check Memory:

# Memory usage
free -h
#               total        used        free      shared  buff/cache   available
# Mem:           15Gi       8.2Gi       2.1Gi       324Mi       5.0Gi       6.5Gi
# Swap:         2.0Gi       512Mi       1.5Gi

# Memory usage per process
ps aux --sort=-%mem | head -10

# Detailed memory info
cat /proc/meminfo

# Check for OOM (Out of Memory) kills
dmesg | grep -i "out of memory"
journalctl -k | grep -i "killed process"

4. Check Disk I/O:

# Disk usage
df -h

# Inode usage (can run out even with disk space)
df -i

# Disk I/O statistics
iostat -x 1 5

# Find large files
du -sh /* | sort -rh | head -10

# Find files using most disk I/O
iotop

5. Check Network:

# Network connections
netstat -tuln  # Listening ports
netstat -tun   # Established connections

# Network statistics
netstat -s

# Network interface stats
ifconfig
ip addr show

# Bandwidth usage
iftop

# Check DNS resolution
nslookup google.com
dig google.com

6. Check Running Processes:

# All processes
ps aux

# Process tree
pstree

# Find process by name
pgrep -a nginx

# Kill process
kill -9 <PID>
pkill -9 nginx

# Check zombie processes
ps aux | grep 'Z'

7. Check Logs:

# System logs
tail -f /var/log/syslog
journalctl -f  # Follow systemd logs

# Application logs
tail -f /var/log/nginx/error.log

# Search for errors
grep -i error /var/log/syslog | tail -20

# Check auth logs
tail -f /var/log/auth.log

Common Issues and Solutions:

High CPU:

# Find CPU-intensive process
top -o %CPU

# If it's a runaway process
kill -15 <PID>  # Graceful
kill -9 <PID>   # Force kill

# If it's legitimate high load
# - Scale horizontally (add servers)
# - Optimize application code
# - Add caching

High Memory:

# Clear cache (safe, kernel will rebuild)
sync
echo 3 > /proc/sys/vm/drop_caches

# Find memory leaks
valgrind --leak-check=full ./myapp

# Restart memory-hungry service
systemctl restart myapp

Disk Full:

# Find large files
find / -type f -size +100M -exec ls -lh {} \;

# Clean up logs
journalctl --vacuum-time=7d  # Keep last 7 days
find /var/log -name "*.log" -mtime +30 -delete

# Clean package cache (Ubuntu/Debian)
apt-get clean
apt-get autoclean

Slow Network:

# Test connectivity
ping -c 5 google.com

# Test bandwidth
iperf3 -c server.example.com

# Trace route
traceroute google.com

# Check firewall
iptables -L -n

Quick Diagnostic Script:

#!/bin/bash
# quick-diag.sh - Quick system diagnostics

echo "=== System Load ==="
uptime

echo -e "\n=== CPU Usage ==="
top -bn1 | head -20

echo -e "\n=== Memory Usage ==="
free -h

echo -e "\n=== Disk Usage ==="
df -h

echo -e "\n=== Top CPU Processes ==="
ps aux --sort=-%cpu | head -6

echo -e "\n=== Top Memory Processes ==="
ps aux --sort=-%mem | head -6

echo -e "\n=== Network Connections ==="
netstat -tuln | wc -l
echo "connections"

echo -e "\n=== Recent Errors ==="
journalctl -p err -n 10 --no-pager

Rarity: Very Common
Difficulty: Medium

Conclusion

Preparing for a junior SRE interview means connecting reliability concepts to practical operator behavior. Focus on:

SRE Fundamentals: SLIs, SLOs, error budgets, toil reduction
Monitoring: Setting up effective alerts and observability
Incident Response: Structured approach to handling outages
Automation: Scripting to reduce toil and improve reliability
Containers and Kubernetes: Images, logs, rollouts, probes, and rollback commands

Practice these concepts in a small lab if you can: run a service, break it, observe the symptoms, write a runbook, and automate one safe recovery step. That gives you concrete stories for the interview without exaggerating your experience.

Fresh career advice

Junior SRE Interview Questions and Answers

Junior SRE interview focus

SRE Fundamentals

1. What is Site Reliability Engineering and how does it differ from DevOps?

2. Explain SLIs, SLOs, and error budgets.

3. What is toil and how do you reduce it?

Monitoring & Observability

4. What's the difference between monitoring and observability?

5. How do you set up effective alerts?

Incident Response

6. Walk through your incident response process.

7. How do you troubleshoot a service that's experiencing high latency?

Automation & Scripting

8. Write a script to check if a service is healthy and restart it if needed.

Reliability Practices

9. What is a runbook and why is it important?

2. Identify bottleneck

3. Check dependencies

Mitigation Steps

Quick fixes (< 5 minutes)

If issue persists

Resolution

Escalation

Containers and Kubernetes Basics

11. What container and Kubernetes basics should a junior SRE know?

Version Control & Deployment

12. Explain Git workflows and how you handle deployments.

Linux Fundamentals

13. How do you troubleshoot a Linux server that's running slow?

Conclusion

Weekly career tips that actually work

Weekly career tips that actually work

Related Posts

Senior Site Reliability Engineer Interview Questions and Answers

Junior DevOps Engineer Interview Questions and Answers

Junior Python Backend Developer Interview Questions

Stand Out to Recruiters & Land Your Dream Job

Share this post

Double Your Interview Callbacks