Junior SRE Interview Questions and Answers

Milad Bonakdar
Author
Prepare for junior SRE interviews with practical questions on SLOs, error budgets, alerting, incidents, Linux troubleshooting, automation, and Kubernetes basics.
Junior SRE interview focus
Junior SRE interviews usually test whether you can reason from user impact to system symptoms: SLOs, error budgets, alerts, incidents, Linux signals, automation, containers, and basic Kubernetes operations. You do not need to sound like a principal engineer. You need to show that you can investigate carefully, communicate clearly during incidents, and automate repetitive operational work without hiding risk.
Use these questions to practice concise answers, then connect each answer to a real example from projects, labs, internships, or on-call shadowing.
SRE Fundamentals
1. What is Site Reliability Engineering and how does it differ from DevOps?
Answer: SRE is a reliability-focused way to run production systems by applying software engineering to operations work.
Key Principles:
- Treat operations as a software problem
- Keep repetitive operational work visible and reduce it with automation
- Error budgets for balancing reliability and velocity
- Blameless postmortems
- Gradual rollouts and automated rollbacks
SRE vs DevOps:
A good junior answer: SRE turns reliability into measurable engineering work. DevOps is a broader culture of collaboration and automation; SRE makes that culture concrete with SLOs, error budgets, incident review, and toil reduction.
Rarity: Very Common
Difficulty: Easy
2. Explain SLIs, SLOs, and error budgets.
Answer: These are core SRE concepts for measuring reliability from the user's point of view:
SLI (Service Level Indicator):
- Quantitative measure of service level
- Examples: request success rate, latency, availability, freshness, throughput
SLO (Service Level Objective):
- Target value for an SLI over a time window
- Example: "99.9% of valid API requests succeed over 30 days"
Error Budget:
- The gap between perfect reliability and the SLO
- Used to decide when to slow risky changes, improve reliability, or continue shipping
Rarity: Very Common
Difficulty: Medium
3. What is toil and how do you reduce it?
Answer: Toil is repetitive, manual operational work that:
- Is manual (requires human action)
- Is repetitive
- Can be automated
- Has no enduring value
- Grows linearly with service growth
Examples of toil:
- Manually restarting services
- Copying files between servers
- Manually scaling resources
- Repetitive ticket responses
Toil reduction strategies:
Junior interview framing: Do not say every manual task is bad. First confirm the task is safe to automate, document the current steps, add logging and rollback behavior, then measure whether the automation actually reduces interruptions.
Rarity: Very Common
Difficulty: Easy-Medium
Monitoring & Observability
4. What's the difference between monitoring and observability?
Answer: Monitoring: Collecting predefined metrics and alerts
- Known-unknowns: You know what to watch for
- Dashboards, alerts, metrics
- Examples: CPU, memory, request rate
Observability: Understanding system state from outputs
- Unknown-unknowns: Debug issues you didn't anticipate
- Logs, metrics, traces combined
- Can answer arbitrary questions
Common telemetry signals:
- Metrics: Aggregated numbers such as request rate, errors, saturation, and latency
- Logs: Discrete events with enough context to explain what happened
- Traces: Request flow across services, useful for dependency and latency analysis
In an interview, explain how you would combine them. For example: metrics show API latency rising, traces show most slow spans wait on the database, and logs show repeated query timeouts after a deployment.
Example: Prometheus + Grafana + Loki
Rarity: Common
Difficulty: Medium
5. How do you set up effective alerts?
Answer: Good alerts are actionable, tied to user impact, and tuned enough that responders trust them.
Alert Best Practices:
1. Alert on symptoms, not causes:
2. Include runbook links:
3. Use appropriate severity levels:
4. Avoid alert fatigue:
- Use
for:duration to avoid flapping - Group related alerts
- Page on symptoms that need immediate human action
- Send lower-priority causes to tickets or dashboards
- Review alerts after incidents and remove noisy ones
Rarity: Very Common
Difficulty: Medium
Incident Response
6. Walk through your incident response process.
Answer: Structured incident response reduces customer impact and keeps the team aligned:
Incident Response Steps:
1. Detection:
- Alert fires or user reports issue
- Acknowledge alert
- Create incident channel
2. Triage:
3. Mitigation:
4. Resolution:
- Fix root cause
- Verify metrics return to normal
- Monitor for recurrence
5. Post-incident review:
Rarity: Very Common
Difficulty: Medium
7. How do you troubleshoot a service that's experiencing high latency?
Answer: Systematic debugging approach:
Common causes:
- Database slow queries
- External API timeouts
- Memory pressure (GC pauses)
- Network issues
- Resource exhaustion
- Inefficient code paths
Rarity: Very Common
Difficulty: Medium
Automation & Scripting
8. Write a script to check if a service is healthy and restart it if needed.
Answer: Health check and auto-remediation script:
Rarity: Common
Difficulty: Medium
Reliability Practices
9. What is a runbook and why is it important?
Answer: A runbook is a documented procedure for handling operational tasks and incidents.
Runbook Structure:
2. Identify bottleneck
- Check database query time
- Check external API calls
- Check cache hit rate
- Review recent deployments
3. Check dependencies
Mitigation Steps
Quick fixes (< 5 minutes)
- Scale up application instances
- Increase cache TTL temporarily
If issue persists
- Rollback recent deployment
- Enable rate limiting
Resolution
- Fix root cause (slow query, inefficient code)
- Deploy fix
- Monitor for 30 minutes
- Scale back to normal capacity
Escalation
If unable to resolve within 30 minutes:
- Escalate to: @backend-team
- Slack channel: #incidents
- On-call: Use PagerDuty escalation policy
Related
2. Circuit Breaker:
3. Timeouts and Retries:
Rarity: Common
Difficulty: Medium
Containers and Kubernetes Basics
11. What container and Kubernetes basics should a junior SRE know?
Answer: Junior SREs are often expected to understand how an application is packaged, started, checked for health, and rolled back. You do not need deep cluster internals for a junior role, but you should be comfortable with images, containers, logs, resource usage, deployments, and health probes.
Containers vs Virtual Machines:
Key Differences:
Docker basics:
Dockerfile example:
Build and Run:
Kubernetes health probes:
How to explain probes:
- Readiness probe: tells Kubernetes whether the pod should receive traffic
- Liveness probe: tells Kubernetes whether the container should be restarted
- Startup probe: useful when an application starts slowly and would otherwise fail liveness checks too early
Docker Compose (multi-container local setup):
Run with Docker Compose:
Best Practices:
- Use official base images
- Minimize layer count
- Don't run as root
- Use .dockerignore
- Tag images properly
- Scan for vulnerabilities
Rarity: Very Common
Difficulty: Easy-Medium
Version Control & Deployment
12. Explain Git workflows and how you handle deployments.
Answer: Git is essential for version control and deployment automation.
Common Git Workflow:
Basic Git Commands:
Branching Strategy:
1. Gitflow:
main: Production-ready codedevelop: Integration branchfeature/*: New featuresrelease/*: Release preparationhotfix/*: Emergency fixes
2. Trunk-Based Development:
Deployment Workflow:
1. CI/CD Pipeline (GitHub Actions):
2. Deployment Script:
3. Rollback Procedure:
Deployment Best Practices:
- Always test before deploying
- Use feature flags for risky changes
- Deploy during low-traffic periods
- Have rollback plan ready
- Monitor after deployment
- Use blue-green or canary deployments
Rarity: Very Common
Difficulty: Medium
Linux Fundamentals
13. How do you troubleshoot a Linux server that's running slow?
Answer: Systematic approach to diagnosing performance issues:
Troubleshooting Checklist:
1. Check System Load:
2. Check CPU Usage:
3. Check Memory:
4. Check Disk I/O:
5. Check Network:
6. Check Running Processes:
7. Check Logs:
Common Issues and Solutions:
High CPU:
High Memory:
Disk Full:
Slow Network:
Quick Diagnostic Script:
Rarity: Very Common
Difficulty: Medium
Conclusion
Preparing for a junior SRE interview means connecting reliability concepts to practical operator behavior. Focus on:
- SRE Fundamentals: SLIs, SLOs, error budgets, toil reduction
- Monitoring: Setting up effective alerts and observability
- Incident Response: Structured approach to handling outages
- Automation: Scripting to reduce toil and improve reliability
- Containers and Kubernetes: Images, logs, rollouts, probes, and rollback commands
Practice these concepts in a small lab if you can: run a service, break it, observe the symptoms, write a runbook, and automate one safe recovery step. That gives you concrete stories for the interview without exaggerating your experience.


