Senior DevOps Engineer Interview Questions for Production Systems

Milad Bonakdar
Author
Prepare for senior DevOps interviews with practical questions on Kubernetes, Terraform state, GitOps, security, observability, incident response, and production trade-offs.
Senior DevOps interview focus
Senior DevOps interviews usually test whether you can run production systems, not just name tools. Expect scenario-based questions about Kubernetes failures, Terraform state safety, GitOps rollouts, cloud resilience, security controls, observability, incident response, and the trade-offs behind each decision.
Use this guide to practice answers that show judgment: what you would check first, which risk you would reduce, how you would validate a fix, and how you would explain the trade-off to engineering, security, or product stakeholders.
Advanced Kubernetes
1. Explain Kubernetes architecture and the role of key components.
Answer: Kubernetes uses a control plane and worker node architecture. A strong senior answer should explain both the components and how desired state moves through the system.
Control Plane Components:
- API Server: Entry point for Kubernetes API requests and admission control
- etcd: Distributed key-value store for cluster state
- Scheduler: Assigns pods to nodes based on resources, constraints, affinity, taints, and tolerations
- Controller Manager: Runs controllers that reconcile actual state toward desired state
- Cloud Controller Manager: Integrates cluster behavior with cloud provider APIs
Node Components:
- kubelet: Node agent that starts pods and reports node/pod status
- kube-proxy: Maintains service networking rules where used by the cluster
- Container Runtime: Runs containers through the Kubernetes CRI, commonly containerd or CRI-O
How it works:
- A user or controller submits a manifest through the API server.
- Admission, validation, and authorization run before the desired state is stored in etcd.
- The scheduler assigns pending pods to nodes.
- The kubelet on each node creates containers through the runtime and reports status.
- Controllers continue reconciling until the actual state matches the desired state.
Rarity: Very Common
Difficulty: Hard
2. How do you troubleshoot a pod stuck in CrashLoopBackOff?
Answer: Start by separating scheduling, image, configuration, runtime, probe, and resource problems. A senior answer should show a repeatable triage path and avoid blindly restarting pods before capturing evidence.
Common causes:
- Application crashes on startup
- Missing environment variables
- Incorrect liveness probe configuration
- Insufficient resources (OOMKilled)
- Image pull errors
- Missing dependencies
- Bad rollout or incompatible config change
- Database, DNS, or network dependency unavailable at startup
Example fix:
Rarity: Very Common
Difficulty: Medium
3. Explain Kubernetes networking: Services, Ingress, and Network Policies.
Answer: Kubernetes networking layers:
Services: Types of service exposure:
Ingress: HTTP/HTTPS routing:
Network Policies: Control pod-to-pod communication:
Rarity: Very Common
Difficulty: Hard
4. How do you implement autoscaling in Kubernetes?
Answer: Multiple autoscaling strategies:
Horizontal Pod Autoscaler (HPA):
Vertical Pod Autoscaler (VPA):
Cluster Autoscaler: Automatically adjusts cluster size based on pending pods:
Rarity: Common
Difficulty: Medium
Advanced Terraform
5. Explain Terraform state management and best practices.
Answer: Terraform state maps configuration to real infrastructure, tracks metadata, and lets Terraform calculate safe changes. In a team, state should be remote, encrypted, locked during runs, versioned where possible, and split by environment or blast radius.
Remote State Configuration:
State Locking:
Best Practices:
1. Never commit state files to Git
2. Use separate backends for critical environments
Workspaces are useful for some workflows, but production isolation is usually clearer with separate root modules, separate backend keys, and separate access controls.
3. Import existing resources intentionally
4. Use state commands instead of editing state JSON
5. Backup state before major changes
6. Detect and handle drift
Run terraform plan in CI, review out-of-band changes, and decide whether to import, update code, or revert the manual change. Avoid using -target as a normal workflow because it can hide dependencies.
Rarity: Very Common
Difficulty: Hard
6. How do you structure Terraform code for large projects?
Answer: Modular structure for maintainability:
Directory Structure:
Module Example:
Using Modules:
Rarity: Common
Difficulty: Hard
Cloud Architecture
7. Design a highly available multi-region architecture on AWS.
Answer: Multi-region architecture for high availability:
Key Components:
1. DNS and Traffic Management:
2. Database Replication:
3. Data Replication:
Design Principles:
- Active-active or active-passive setup
- Automated failover with health checks
- Data replication with minimal lag
- Consistent deployment across regions
- Monitoring and alerting for both regions
Rarity: Common
Difficulty: Hard
GitOps & CI/CD
8. Explain GitOps and how to implement it with ArgoCD.
Answer: GitOps uses Git as the source of truth for declarative infrastructure and application configuration. A controller such as Argo CD continuously compares Git with the cluster and reconciles drift.
Principles:
- Declarative configuration in Git
- Automated synchronization
- Version control for all changes
- Continuous reconciliation
ArgoCD Implementation:
Directory Structure:
Kustomization:
Benefits:
- Git as audit trail
- Easy rollbacks (git revert)
- Declarative desired state
- Automated drift detection
- Multi-cluster management
Senior-level considerations:
- Use sync waves or hooks when resources must be applied in order, such as CRDs before custom resources or migrations before deployments.
- Add health checks, smoke tests, and progressive rollout controls so auto-sync does not turn a bad commit into a wider incident.
- Separate application repos from environment manifests when teams need different review and ownership boundaries.
- Protect production with branch rules, signed commits where required, policy checks, and clear rollback runbooks.
Rarity: Common
Difficulty: Medium
Security & Compliance
9. How do you implement security best practices in Kubernetes?
Answer: Use a layered model: control access to the API, isolate workloads, protect secrets, verify images, enforce admission policies, and audit changes. The goal is to reduce both accidental exposure and attacker movement after a compromise.
1. Pod Security Standards:
2. RBAC (Role-Based Access Control):
3. Network Policies:
4. Secrets Management:
5. Security Context:
6. Image Scanning:
Additional controls to mention in senior interviews:
- Encrypt secrets at rest and prefer external secret stores for rotation and auditability.
- Use namespace boundaries, least-privilege service accounts, and short-lived credentials.
- Enforce admission policies for approved registries, signed images, required labels, and restricted pod settings.
- Keep cluster, node, and add-on versions current through tested upgrade windows.
Rarity: Very Common
Difficulty: Hard
Observability & SRE
10. Design a comprehensive observability stack.
Answer: Design observability around user-facing SLOs, not just dashboards. Metrics, logs, and traces should help answer: is the service healthy, who is affected, what changed, and what action should the on-call engineer take?
Architecture:
1. Metrics (Prometheus + Grafana):
2. Logging (Loki):
3. Tracing with OpenTelemetry:
4. Alerting Rules:
5. SLO Monitoring:
Senior-level alerting rules:
- Page on symptoms that threaten the SLO, not every low-level metric.
- Use burn-rate alerts for fast and slow error-budget consumption.
- Keep dashboards useful for investigation, while alerts stay actionable.
- Include deployment, config, dependency, and infrastructure change events in the same investigation path.
Rarity: Common
Difficulty: Hard
Disaster Recovery
11. How do you implement disaster recovery for a Kubernetes cluster?
Answer: A good DR answer starts with business targets. Define RTO and RPO per service, know which dependencies must be restored first, test the restore path, and keep runbooks current.
1. Backup Strategy:
2. etcd Backup:
3. Restore Procedure:
4. Multi-Region Failover:
5. RTO/RPO Targets:
- RTO (Recovery Time Objective): how long the service can be unavailable
- RPO (Recovery Point Objective): how much data loss the business can tolerate
- Regular DR drills with restore validation
- Documented runbooks with owners and escalation paths
- Automated failover where the failure mode is well understood
Rarity: Common
Difficulty: Hard
Service Mesh
12. Explain service mesh architecture and when to use it.
Answer: A service mesh provides infrastructure layer for service-to-service communication.
Core Components:
Istio Implementation:
Circuit Breaking:
Mutual TLS:
When to Use:
- Microservices with complex communication patterns
- Need for advanced traffic management
- Security requirements (mTLS, authorization)
- Observability across services
- Gradual rollouts and A/B testing
Trade-offs:
- Added complexity and latency
- Resource overhead (sidecar proxies)
- Learning curve
- Debugging challenges
Rarity: Common
Difficulty: Hard
Cost Optimization
13. How do you optimize cloud infrastructure costs?
Answer: Cost optimization requires continuous monitoring and strategic decisions.
1. Right-Sizing Resources:
2. Reserved Instances & Savings Plans:
3. Auto-Scaling Policies:
4. Storage Optimization:
5. Spot Instances for Non-Critical Workloads:
Cost Optimization Checklist:
- Right-size instances based on actual usage
- Use Reserved Instances for predictable workloads
- Implement auto-scaling
- Delete unused resources (EBS volumes, snapshots, IPs)
- Use spot instances for batch jobs
- Optimize data transfer costs
- Implement S3 lifecycle policies
- Use CloudFront for static content
- Monitor and set budget alerts
Rarity: Very Common
Difficulty: Medium
Performance Tuning
14. How do you troubleshoot and optimize application performance?
Answer: Systematic approach to performance optimization:
1. Profiling and Monitoring:
2. Database Optimization:
3. Caching Strategy:
4. Load Testing:
5. Application-Level Optimization:
6. CDN and Static Asset Optimization:
Performance Optimization Checklist:
- Profile code to identify bottlenecks
- Optimize database queries and add indexes
- Implement caching (Redis, Memcached)
- Use CDN for static assets
- Enable compression (gzip, brotli)
- Optimize images and assets
- Use async/await for I/O operations
- Implement connection pooling
- Monitor application metrics
- Load test before production
Rarity: Very Common
Difficulty: Hard
Incident Management
15. Describe your approach to incident management and post-mortems.
Answer: Effective incident management minimizes downtime and prevents recurrence.
Incident Response Process:
1. Detection and Alert:
2. Incident Classification:
3. Communication Template:
4. Post-Mortem Template:
5. Runbook Example:
Best Practices:
- Blameless post-mortems
- Focus on systems, not individuals
- Document action items with owners and deadlines
- Share learnings across organization
- Regular incident response drills
- Maintain updated runbooks
- Track MTTR (Mean Time To Recovery)
- Implement preventive measures
Rarity: Very Common
Difficulty: Medium
Conclusion
Senior DevOps engineer interviews reward practical production judgment. Prepare to explain not only what tool you would use, but why it fits the risk, team, and service maturity.
- Kubernetes: Architecture, networking, troubleshooting, security
- Terraform: State management, modules, best practices
- Cloud Architecture: Multi-region, high availability, disaster recovery
- GitOps: Declarative infrastructure, ArgoCD, automation
- Security: RBAC, network policies, secrets management, compliance
- Observability: Metrics, logs, traces, SLOs, alerting
- SRE Practices: Reliability, incident response, capacity planning
When you practice, turn each answer into a short scenario: the failure signal, your first checks, the likely root causes, the mitigation, the long-term fix, and the trade-off you would explain to the team. That is the difference between a memorized DevOps answer and a senior-level interview answer.


