Senior DevOps Engineer Interview Questions: Complete Guide

Milad Bonakdar
Author
Master advanced DevOps concepts with comprehensive interview questions covering Kubernetes, Terraform, cloud architecture, GitOps, security, SRE practices, and high availability for senior DevOps engineers.
Introduction
Senior DevOps engineers are expected to architect scalable infrastructure, implement advanced automation, ensure security and compliance, and drive DevOps culture across organizations. This role demands deep expertise in container orchestration, infrastructure as code, cloud architecture, and site reliability engineering.
This comprehensive guide covers essential interview questions for senior DevOps engineers, focusing on advanced concepts, production systems, and strategic thinking. Each question includes detailed explanations and practical examples.
Advanced Kubernetes
1. Explain Kubernetes architecture and the role of key components.
Answer: Kubernetes follows a master-worker architecture:
Control Plane Components:
- API Server: Frontend for Kubernetes control plane, handles all REST requests
- etcd: Distributed key-value store for cluster state
- Scheduler: Assigns pods to nodes based on resource requirements
- Controller Manager: Runs controller processes (replication, endpoints, etc.)
- Cloud Controller Manager: Integrates with cloud provider APIs
Node Components:
- kubelet: Agent that ensures containers are running in pods
- kube-proxy: Maintains network rules for pod communication
- Container Runtime: Runs containers (Docker, containerd, CRI-O)
How it works:
- User submits deployment via kubectl
- API Server validates and stores in etcd
- Scheduler assigns pods to nodes
- kubelet on node creates containers
- kube-proxy configures networking
Rarity: Very Common
Difficulty: Hard
2. How do you troubleshoot a pod stuck in CrashLoopBackOff?
Answer: Systematic debugging approach:
# 1. Check pod status and events
kubectl describe pod <pod-name>
# Look for: Image pull errors, resource limits, failed health checks
# 2. Check logs
kubectl logs <pod-name>
kubectl logs <pod-name> --previous # Previous container logs
# 3. Check resource constraints
kubectl top pod <pod-name>
kubectl describe node <node-name>
# 4. Check liveness/readiness probes
kubectl get pod <pod-name> -o yaml | grep -A 10 livenessProbe
# 5. Exec into container (if it stays up briefly)
kubectl exec -it <pod-name> -- /bin/sh
# 6. Check image
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].image}'
docker pull <image> # Test locally
# 7. Check ConfigMaps/Secrets
kubectl get configmap
kubectl get secret
# 8. Review deployment/pod spec
kubectl get deployment <deployment-name> -o yamlCommon causes:
- Application crashes on startup
- Missing environment variables
- Incorrect liveness probe configuration
- Insufficient resources (OOMKilled)
- Image pull errors
- Missing dependencies
Example fix:
# Increase resource limits
resources:
limits:
memory: "512Mi"
cpu: "500m"
requests:
memory: "256Mi"
cpu: "250m"
# Adjust probe timing
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # Give app time to start
periodSeconds: 10
failureThreshold: 3Rarity: Very Common
Difficulty: Medium
3. Explain Kubernetes networking: Services, Ingress, and Network Policies.
Answer: Kubernetes networking layers:
Services: Types of service exposure:
# ClusterIP (internal only)
apiVersion: v1
kind: Service
metadata:
name: backend
spec:
type: ClusterIP
selector:
app: backend
ports:
- port: 80
targetPort: 8080
# NodePort (external access via node IP)
spec:
type: NodePort
ports:
- port: 80
targetPort: 8080
nodePort: 30080
# LoadBalancer (cloud load balancer)
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 8080Ingress: HTTP/HTTPS routing:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: api.example.com
http:
paths:
- path: /v1
pathType: Prefix
backend:
service:
name: api-v1
port:
number: 80
- path: /v2
pathType: Prefix
backend:
service:
name: api-v2
port:
number: 80
tls:
- hosts:
- api.example.com
secretName: api-tlsNetwork Policies: Control pod-to-pod communication:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-policy
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432Rarity: Very Common
Difficulty: Hard
4. How do you implement autoscaling in Kubernetes?
Answer: Multiple autoscaling strategies:
Horizontal Pod Autoscaler (HPA):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15Vertical Pod Autoscaler (VPA):
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: app
updatePolicy:
updateMode: "Auto" # or "Recreate", "Initial", "Off"
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 2GiCluster Autoscaler: Automatically adjusts cluster size based on pending pods:
# AWS example
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-config
data:
min-nodes: "2"
max-nodes: "10"
scale-down-delay-after-add: "10m"
scale-down-unneeded-time: "10m"Rarity: Common
Difficulty: Medium
Advanced Terraform
5. Explain Terraform state management and best practices.
Answer: Terraform state tracks infrastructure and is critical for operations.
Remote State Configuration:
# backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}State Locking:
# DynamoDB table for state locking
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}Best Practices:
1. Never commit state files to Git
# .gitignore
*.tfstate
*.tfstate.*
.terraform/2. Use workspaces for environments
terraform workspace new dev
terraform workspace new staging
terraform workspace new prod
terraform workspace select dev
terraform apply3. Import existing resources
# Import existing EC2 instance
terraform import aws_instance.web i-1234567890abcdef0
# Verify
terraform plan4. State manipulation (use carefully)
# List resources in state
terraform state list
# Show specific resource
terraform state show aws_instance.web
# Move resource in state
terraform state mv aws_instance.old aws_instance.new
# Remove resource from state (doesn't delete)
terraform state rm aws_instance.web5. Backup state before major changes
terraform state pull > backup.tfstateRarity: Very Common
Difficulty: Hard
6. How do you structure Terraform code for large projects?
Answer: Modular structure for maintainability:
Directory Structure:
terraform/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── terraform.tfvars
│ │ └── backend.tf
│ ├── staging/
│ └── prod/
├── modules/
│ ├── vpc/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── README.md
│ ├── eks/
│ ├── rds/
│ └── s3/
└── global/
├── iam/
└── route53/
Module Example:
# modules/vpc/main.tf
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = merge(
var.tags,
{
Name = "${var.environment}-vpc"
}
)
}
resource "aws_subnet" "private" {
count = length(var.private_subnet_cidrs)
vpc_id = aws_vpc.main.id
cidr_block = var.private_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index]
tags = merge(
var.tags,
{
Name = "${var.environment}-private-${count.index + 1}"
Type = "private"
}
)
}
# modules/vpc/variables.tf
variable "vpc_cidr" {
description = "CIDR block for VPC"
type = string
}
variable "environment" {
description = "Environment name"
type = string
}
variable "private_subnet_cidrs" {
description = "CIDR blocks for private subnets"
type = list(string)
}
variable "availability_zones" {
description = "Availability zones"
type = list(string)
}
variable "tags" {
description = "Common tags"
type = map(string)
default = {}
}
# modules/vpc/outputs.tf
output "vpc_id" {
value = aws_vpc.main.id
}
output "private_subnet_ids" {
value = aws_subnet.private[*].id
}Using Modules:
# environments/prod/main.tf
module "vpc" {
source = "../../modules/vpc"
vpc_cidr = "10.0.0.0/16"
environment = "prod"
private_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
tags = {
Project = "MyApp"
ManagedBy = "Terraform"
}
}
module "eks" {
source = "../../modules/eks"
cluster_name = "prod-cluster"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
node_group_size = 3
}Rarity: Common
Difficulty: Hard
Cloud Architecture
7. Design a highly available multi-region architecture on AWS.
Answer: Multi-region architecture for high availability:
Key Components:
1. DNS and Traffic Management:
# Route 53 with health checks
resource "aws_route53_health_check" "primary" {
fqdn = "api.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 30
}
resource "aws_route53_record" "api" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
failover_routing_policy {
type = "PRIMARY"
}
set_identifier = "primary"
health_check_id = aws_route53_health_check.primary.id
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
}2. Database Replication:
# RDS with cross-region read replica
resource "aws_db_instance" "primary" {
identifier = "prod-db-primary"
engine = "postgres"
instance_class = "db.r5.xlarge"
multi_az = true
backup_retention_period = 7
provider = aws.us-east-1
}
resource "aws_db_instance" "replica" {
identifier = "prod-db-replica"
replicate_source_db = aws_db_instance.primary.arn
instance_class = "db.r5.xlarge"
auto_minor_version_upgrade = false
provider = aws.us-west-2
}3. Data Replication:
# S3 cross-region replication
resource "aws_s3_bucket_replication_configuration" "replication" {
bucket = aws_s3_bucket.source.id
role = aws_iam_role.replication.arn
rule {
id = "replicate-all"
status = "Enabled"
destination {
bucket = aws_s3_bucket.destination.arn
storage_class = "STANDARD"
}
}
}Design Principles:
- Active-active or active-passive setup
- Automated failover with health checks
- Data replication with minimal lag
- Consistent deployment across regions
- Monitoring and alerting for both regions
Rarity: Common
Difficulty: Hard
GitOps & CI/CD
8. Explain GitOps and how to implement it with ArgoCD.
Answer: GitOps uses Git as the single source of truth for declarative infrastructure and applications.
Principles:
- Declarative configuration in Git
- Automated synchronization
- Version control for all changes
- Continuous reconciliation
ArgoCD Implementation:
# Application manifest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/org/app-manifests
targetRevision: main
path: k8s/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- CreateNamespace=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3mDirectory Structure:
app-manifests/
├── base/
│ ├── deployment.yaml
│ ├── service.yaml
│ └── kustomization.yaml
└── overlays/
├── dev/
│ ├── kustomization.yaml
│ └── patches/
├── staging/
└── production/
├── kustomization.yaml
├── replicas.yaml
└── resources.yaml
Kustomization:
# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- ../../base
replicas:
- name: myapp
count: 5
resources:
- ingress.yaml
patches:
- path: resources.yaml
target:
kind: Deployment
name: myappBenefits:
- Git as audit trail
- Easy rollbacks (git revert)
- Declarative desired state
- Automated drift detection
- Multi-cluster management
Rarity: Common
Difficulty: Medium
Security & Compliance
9. How do you implement security best practices in Kubernetes?
Answer: Multi-layered security approach:
1. Pod Security Standards:
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted2. RBAC (Role-Based Access Control):
# Role for developers
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: developer
rules:
- apiGroups: ["", "apps"]
resources: ["pods", "deployments", "services"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get"]
# RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: developer-binding
namespace: production
subjects:
- kind: Group
name: developers
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: developer
apiGroup: rbac.authorization.k8s.io3. Network Policies:
# Default deny all ingress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress4. Secrets Management:
# External Secrets Operator
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: app-secrets
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore
target:
name: app-secrets
creationPolicy: Owner
data:
- secretKey: database-password
remoteRef:
key: prod/database
property: password5. Security Context:
apiVersion: v1
kind: Pod
metadata:
name: secure-pod
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: myapp:1.0
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}6. Image Scanning:
# Admission controller with OPA
apiVersion: v1
kind: ConfigMap
metadata:
name: opa-policy
data:
policy.rego: |
package kubernetes.admission
deny[msg] {
input.request.kind.kind == "Pod"
image := input.request.object.spec.containers[_].image
not startswith(image, "registry.company.com/")
msg := sprintf("Image %v is not from approved registry", [image])
}Rarity: Very Common
Difficulty: Hard
Observability & SRE
10. Design a comprehensive observability stack.
Answer: Three pillars of observability: Metrics, Logs, Traces
Architecture:
1. Metrics (Prometheus + Grafana):
# ServiceMonitor for app metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
path: /metrics2. Logging (Loki):
# Promtail config for log collection
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
data:
promtail.yaml: |
server:
http_listen_port: 9080
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace3. Tracing (Jaeger):
# Application instrumentation
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Setup tracing
trace.set_tracer_provider(TracerProvider())
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger-agent",
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer(__name__)
# Use in code
with tracer.start_as_current_span("process_request"):
# Your code here
pass4. Alerting Rules:
# PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
spec:
groups:
- name: app
interval: 30s
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "High latency detected"5. SLO Monitoring:
# SLO definition
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: api-availability
spec:
service: "api"
labels:
team: "platform"
slos:
- name: "requests-availability"
objective: 99.9
description: "API requests should succeed"
sli:
events:
errorQuery: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
totalQuery: sum(rate(http_requests_total[{{.window}}]))
alerting:
pageAlert:
labels:
severity: critical
ticketAlert:
labels:
severity: warningRarity: Common
Difficulty: Hard
Disaster Recovery
11. How do you implement disaster recovery for a Kubernetes cluster?
Answer: Comprehensive DR strategy:
1. Backup Strategy:
# Velero backup schedule
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 2 AM daily
template:
includedNamespaces:
- production
- staging
excludedResources:
- events
- events.events.k8s.io
storageLocation: aws-s3
volumeSnapshotLocations:
- aws-ebs
ttl: 720h # 30 days2. etcd Backup:
#!/bin/bash
# Automated etcd backup script
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db
# Upload to S3
aws s3 cp /backup/etcd-snapshot-*.db s3://etcd-backups/
# Cleanup old backups
find /backup -name "etcd-snapshot-*.db" -mtime +7 -delete3. Restore Procedure:
# Restore etcd from snapshot
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
--data-dir=/var/lib/etcd-restore \
--initial-cluster=etcd-0=https://10.0.1.10:2380 \
--initial-advertise-peer-urls=https://10.0.1.10:2380
# Restore application with Velero
velero restore create --from-backup daily-backup-20231125
velero restore describe <restore-name>4. Multi-Region Failover:
# Terraform for multi-region setup
module "primary_cluster" {
source = "./modules/eks"
region = "us-east-1"
# ... configuration
}
module "dr_cluster" {
source = "./modules/eks"
region = "us-west-2"
# ... configuration
}
# Route 53 health check and failover
resource "aws_route53_health_check" "primary" {
fqdn = module.primary_cluster.endpoint
port = 443
type = "HTTPS"
resource_path = "/healthz"
failure_threshold = 3
}5. RTO/RPO Targets:
- RTO (Recovery Time Objective): < 1 hour
- RPO (Recovery Point Objective): < 15 minutes
- Regular DR drills (monthly)
- Documented runbooks
- Automated failover where possible
Rarity: Common
Difficulty: Hard
Service Mesh
12. Explain service mesh architecture and when to use it.
Answer: A service mesh provides infrastructure layer for service-to-service communication.
Core Components:
Istio Implementation:
# Virtual Service for traffic routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- match:
- headers:
end-user:
exact: jason
route:
- destination:
host: reviews
subset: v2
- route:
- destination:
host: reviews
subset: v1
weight: 80
- destination:
host: reviews
subset: v2
weight: 20
# Destination Rule
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews
spec:
host: reviews
trafficPolicy:
loadBalancer:
simple: LEAST_REQUEST
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
http2MaxRequests: 100Circuit Breaking:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: backend
spec:
host: backend
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 10
maxRequestsPerConnection: 2
outlierDetection:
consecutiveErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50Mutual TLS:
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: production
spec:
mtls:
mode: STRICT
# Authorization Policy
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: frontend-policy
spec:
selector:
matchLabels:
app: frontend
action: ALLOW
rules:
- from:
- source:
principals: ["cluster.local/ns/default/sa/gateway"]
to:
- operation:
methods: ["GET", "POST"]When to Use:
- Microservices with complex communication patterns
- Need for advanced traffic management
- Security requirements (mTLS, authorization)
- Observability across services
- Gradual rollouts and A/B testing
Trade-offs:
- Added complexity and latency
- Resource overhead (sidecar proxies)
- Learning curve
- Debugging challenges
Rarity: Common
Difficulty: Hard
Cost Optimization
13. How do you optimize cloud infrastructure costs?
Answer: Cost optimization requires continuous monitoring and strategic decisions.
1. Right-Sizing Resources:
# AWS Cost Explorer analysis script
import boto3
import pandas as pd
from datetime import datetime, timedelta
class CostOptimizer:
def __init__(self):
self.ce_client = boto3.client('ce')
self.ec2_client = boto3.client('ec2')
self.cloudwatch = boto3.client('cloudwatch')
def analyze_underutilized_instances(self, days=30):
"""Find underutilized EC2 instances"""
end_date = datetime.now()
start_date = end_date - timedelta(days=days)
instances = self.ec2_client.describe_instances()
recommendations = []
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']
# Get CPU utilization
cpu_stats = self.cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=start_date,
EndTime=end_date,
Period=3600,
Statistics=['Average']
)
avg_cpu = sum(d['Average'] for d in cpu_stats['Datapoints']) / len(cpu_stats['Datapoints']) if cpu_stats['Datapoints'] else 0
if avg_cpu < 10:
recommendations.append({
'instance_id': instance_id,
'instance_type': instance['InstanceType'],
'avg_cpu': avg_cpu,
'recommendation': 'Consider downsizing or terminating'
})
return pd.DataFrame(recommendations)
def get_cost_by_service(self, days=30):
"""Get cost breakdown by service"""
end_date = datetime.now().strftime('%Y-%m-%d')
start_date = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d')
response = self.ce_client.get_cost_and_usage(
TimePeriod={'Start': start_date, 'End': end_date},
Granularity='MONTHLY',
Metrics=['UnblendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
costs = []
for result in response['ResultsByTime']:
for group in result['Groups']:
costs.append({
'service': group['Keys'][0],
'cost': float(group['Metrics']['UnblendedCost']['Amount'])
})
return pd.DataFrame(costs).sort_values('cost', ascending=False)
# Usage
optimizer = CostOptimizer()
underutilized = optimizer.analyze_underutilized_instances()
print("Underutilized Instances:")
print(underutilized)
costs = optimizer.get_cost_by_service()
print("\nTop 10 Services by Cost:")
print(costs.head(10))2. Reserved Instances & Savings Plans:
# Terraform for Reserved Instances
resource "aws_ec2_capacity_reservation" "production" {
instance_type = "m5.xlarge"
instance_platform = "Linux/UNIX"
availability_zone = "us-east-1a"
instance_count = 10
tags = {
Environment = "production"
CostCenter = "engineering"
}
}3. Auto-Scaling Policies:
# Kubernetes HPA with custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: cost-optimized-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max4. Storage Optimization:
# S3 lifecycle policy
aws s3api put-bucket-lifecycle-configuration \
--bucket my-bucket \
--lifecycle-configuration '{
"Rules": [
{
"Id": "Archive old logs",
"Status": "Enabled",
"Filter": {"Prefix": "logs/"},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
}
],
"Expiration": {"Days": 365}
}
]
}'5. Spot Instances for Non-Critical Workloads:
# Kubernetes with spot instances
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: spot-provisioner
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.large", "m5.xlarge", "m5.2xlarge"]
limits:
resources:
cpu: 1000
ttlSecondsAfterEmpty: 30
ttlSecondsUntilExpired: 604800Cost Optimization Checklist:
- Right-size instances based on actual usage
- Use Reserved Instances for predictable workloads
- Implement auto-scaling
- Delete unused resources (EBS volumes, snapshots, IPs)
- Use spot instances for batch jobs
- Optimize data transfer costs
- Implement S3 lifecycle policies
- Use CloudFront for static content
- Monitor and set budget alerts
Rarity: Very Common
Difficulty: Medium
Performance Tuning
14. How do you troubleshoot and optimize application performance?
Answer: Systematic approach to performance optimization:
1. Profiling and Monitoring:
# Python application profiling
import cProfile
import pstats
from functools import wraps
import time
def profile_function(func):
"""Decorator to profile function performance"""
@wraps(func)
def wrapper(*args, **kwargs):
profiler = cProfile.Profile()
profiler.enable()
start_time = time.time()
result = func(*args, **kwargs)
end_time = time.time()
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)
print(f"\nTotal execution time: {end_time - start_time:.4f} seconds")
return result
return wrapper
@profile_function
def slow_function():
# Your code here
pass2. Database Optimization:
-- Identify slow queries
SELECT
query,
calls,
total_time,
mean_time,
max_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- Add indexes for frequently queried columns
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
CREATE INDEX CONCURRENTLY idx_orders_user_created
ON orders(user_id, created_at DESC);
-- Analyze query plan
EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.id, u.name
HAVING COUNT(o.id) > 5;3. Caching Strategy:
# Redis caching implementation
import redis
import json
from functools import wraps
class CacheManager:
def __init__(self, redis_host='localhost', redis_port=6379):
self.redis_client = redis.Redis(
host=redis_host,
port=redis_port,
decode_responses=True
)
def cache(self, ttl=3600):
"""Decorator for caching function results"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
# Generate cache key
cache_key = f"{func.__name__}:{str(args)}:{str(kwargs)}"
# Try to get from cache
cached_result = self.redis_client.get(cache_key)
if cached_result:
return json.loads(cached_result)
# Execute function
result = func(*args, **kwargs)
# Store in cache
self.redis_client.setex(
cache_key,
ttl,
json.dumps(result)
)
return result
return wrapper
return decorator
cache_manager = CacheManager()
@cache_manager.cache(ttl=300)
def expensive_database_query(user_id):
# Expensive operation
return query_database(user_id)4. Load Testing:
# Locust load testing
from locust import HttpUser, task, between
class WebsiteUser(HttpUser):
wait_time = between(1, 3)
@task(3)
def view_homepage(self):
self.client.get("/")
@task(2)
def view_product(self):
product_id = random.randint(1, 1000)
self.client.get(f"/products/{product_id}")
@task(1)
def search(self):
self.client.get("/search?q=test")
def on_start(self):
# Login
self.client.post("/login", {
"username": "test",
"password": "test"
})
# Run: locust -f loadtest.py --host=https://example.com5. Application-Level Optimization:
# Async processing for I/O-bound operations
import asyncio
import aiohttp
async def fetch_data(session, url):
async with session.get(url) as response:
return await response.json()
async def fetch_all_data(urls):
async with aiohttp.ClientSession() as session:
tasks = [fetch_data(session, url) for url in urls]
results = await asyncio.gather(*tasks)
return results
# Usage
urls = ['http://api1.com', 'http://api2.com', 'http://api3.com']
results = asyncio.run(fetch_all_data(urls))6. CDN and Static Asset Optimization:
# Nginx configuration for performance
http {
# Enable gzip compression
gzip on;
gzip_vary on;
gzip_min_length 1024;
gzip_types text/plain text/css text/xml text/javascript
application/x-javascript application/xml+rss
application/json application/javascript;
# Browser caching
location ~* \.(jpg|jpeg|png|gif|ico|css|js|woff|woff2)$ {
expires 1y;
add_header Cache-Control "public, immutable";
}
# Proxy caching
proxy_cache_path /var/cache/nginx levels=1:2
keys_zone=my_cache:10m max_size=1g
inactive=60m use_temp_path=off;
location / {
proxy_cache my_cache;
proxy_cache_valid 200 60m;
proxy_cache_use_stale error timeout http_500 http_502 http_503;
proxy_pass http://backend;
}
}Performance Optimization Checklist:
- Profile code to identify bottlenecks
- Optimize database queries and add indexes
- Implement caching (Redis, Memcached)
- Use CDN for static assets
- Enable compression (gzip, brotli)
- Optimize images and assets
- Use async/await for I/O operations
- Implement connection pooling
- Monitor application metrics
- Load test before production
Rarity: Very Common
Difficulty: Hard
Incident Management
15. Describe your approach to incident management and post-mortems.
Answer: Effective incident management minimizes downtime and prevents recurrence.
Incident Response Process:
1. Detection and Alert:
# PagerDuty integration with Prometheus
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'pagerduty'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<pagerduty-integration-key>'
description: '{{ .GroupLabels.alertname }}'
severity: '{{ .CommonLabels.severity }}'
- name: 'slack'
slack_configs:
- api_url: '<slack-webhook-url>'
channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'2. Incident Classification:
# Incident severity classification
class IncidentSeverity:
SEV1 = {
'name': 'Critical',
'description': 'Complete service outage affecting all users',
'response_time': '15 minutes',
'notification': ['on-call', 'management', 'stakeholders']
}
SEV2 = {
'name': 'High',
'description': 'Major functionality impaired, affecting many users',
'response_time': '30 minutes',
'notification': ['on-call', 'team-lead']
}
SEV3 = {
'name': 'Medium',
'description': 'Minor functionality impaired, limited user impact',
'response_time': '2 hours',
'notification': ['on-call']
}
SEV4 = {
'name': 'Low',
'description': 'No immediate user impact',
'response_time': '24 hours',
'notification': ['team']
}
class IncidentManager:
def __init__(self):
self.incidents = []
def create_incident(self, title, severity, description):
incident = {
'id': generate_incident_id(),
'title': title,
'severity': severity,
'description': description,
'status': 'investigating',
'created_at': datetime.now(),
'timeline': [],
'responders': []
}
self.incidents.append(incident)
self.notify_responders(incident)
return incident
def update_timeline(self, incident_id, update):
incident = self.get_incident(incident_id)
incident['timeline'].append({
'timestamp': datetime.now(),
'update': update
})
def resolve_incident(self, incident_id, resolution):
incident = self.get_incident(incident_id)
incident['status'] = 'resolved'
incident['resolved_at'] = datetime.now()
incident['resolution'] = resolution
# Trigger post-mortem creation
self.create_postmortem(incident)3. Communication Template:
# Incident Communication Template
## Initial Notification
**Subject:** [SEV1] Production Outage - API Service Down
**Status:** Investigating
**Impact:** All API requests failing, affecting 100% of users
**Started:** 2024-11-25 14:30 UTC
**Next Update:** 15 minutes
We are investigating reports of API service unavailability.
Our team is actively working on resolution.
## Update
**Status:** Identified
**Root Cause:** Database connection pool exhausted
**Action:** Restarting application servers and increasing pool size
**ETA:** 10 minutes
## Resolution
**Status:** Resolved
**Duration:** 45 minutes (14:30 - 15:15 UTC)
**Resolution:** Increased database connection pool from 50 to 200
**Next Steps:** Post-mortem scheduled for tomorrow
Thank you for your patience.4. Post-Mortem Template:
# Post-Mortem: API Service Outage
**Date:** 2024-11-25
**Duration:** 45 minutes
**Severity:** SEV1
**Impact:** 100% of API requests failed
## Summary
On November 25, 2024, our API service experienced a complete outage
lasting 45 minutes due to database connection pool exhaustion.
## Timeline
- **14:30 UTC:** Alerts triggered for high error rate
- **14:32 UTC:** On-call engineer paged
- **14:35 UTC:** Incident declared SEV1
- **14:40 UTC:** Root cause identified (connection pool exhausted)
- **14:45 UTC:** Mitigation applied (increased pool size)
- **15:00 UTC:** Service partially restored
- **15:15 UTC:** Full service restoration confirmed
## Root Cause
Database connection pool was configured with a maximum of 50 connections.
Traffic spike from marketing campaign exceeded this limit, causing all
new requests to fail.
## Resolution
1. Increased connection pool size from 50 to 200
2. Restarted application servers
3. Verified service restoration
## Impact
- **Users Affected:** ~50,000
- **Failed Requests:** ~2.5 million
- **Revenue Impact:** Estimated $25,000
## Action Items
1. [ ] Implement auto-scaling for connection pool (Owner: @john, Due: Dec 1)
2. [ ] Add alerting for connection pool utilization (Owner: @jane, Due: Nov 28)
3. [ ] Load test with 2x expected traffic (Owner: @bob, Due: Dec 5)
4. [ ] Document connection pool tuning (Owner: @alice, Due: Nov 30)
5. [ ] Review capacity planning process (Owner: @team, Due: Dec 10)
## Lessons Learned
**What Went Well:**
- Quick detection through monitoring
- Clear communication to stakeholders
- Fast root cause identification
**What Went Wrong:**
- Connection pool not sized for peak traffic
- No alerting on connection pool metrics
- Insufficient load testing
**Where We Got Lucky:**
- Issue occurred during business hours
- Database itself remained healthy
- Quick rollback option available5. Runbook Example:
# Incident Runbook
name: "API Service Down"
severity: SEV1
symptoms:
- High error rate (>5%)
- Increased latency (>1s)
- Failed health checks
investigation_steps:
- step: "Check service status"
command: "kubectl get pods -n production"
- step: "Check logs"
command: "kubectl logs -n production -l app=api --tail=100"
- step: "Check database connectivity"
command: "kubectl exec -it api-pod -- nc -zv database 5432"
- step: "Check resource usage"
command: "kubectl top pods -n production"
mitigation_steps:
- action: "Restart pods"
command: "kubectl rollout restart deployment/api -n production"
- action: "Scale up replicas"
command: "kubectl scale deployment/api --replicas=10 -n production"
- action: "Failover to backup region"
command: "aws route53 change-resource-record-sets --hosted-zone-id Z123 ..."
escalation:
- level: 1
contact: "on-call-engineer"
timeout: "15 minutes"
- level: 2
contact: "team-lead"
timeout: "30 minutes"
- level: 3
contact: "engineering-director"
timeout: "1 hour"Best Practices:
- Blameless post-mortems
- Focus on systems, not individuals
- Document action items with owners and deadlines
- Share learnings across organization
- Regular incident response drills
- Maintain updated runbooks
- Track MTTR (Mean Time To Recovery)
- Implement preventive measures
Rarity: Very Common
Difficulty: Medium
Conclusion
Senior DevOps engineer interviews require deep technical expertise and hands-on experience. Key areas to master:
- Kubernetes: Architecture, networking, troubleshooting, security
- Terraform: State management, modules, best practices
- Cloud Architecture: Multi-region, high availability, disaster recovery
- GitOps: Declarative infrastructure, ArgoCD, automation
- Security: RBAC, network policies, secrets management, compliance
- Observability: Metrics, logs, traces, SLOs, alerting
- SRE Practices: Reliability, incident response, capacity planning
Focus on production experience, architectural decisions, and trade-offs. Be prepared to discuss real-world scenarios and how you've solved complex problems. Good luck!




