November 25, 2025
25 min read

Senior DevOps Engineer Interview Questions: Complete Guide

interview
career-advice
job-search
Senior DevOps Engineer Interview Questions: Complete Guide
MB

Milad Bonakdar

Author

Master advanced DevOps concepts with comprehensive interview questions covering Kubernetes, Terraform, cloud architecture, GitOps, security, SRE practices, and high availability for senior DevOps engineers.


Introduction

Senior DevOps engineers are expected to architect scalable infrastructure, implement advanced automation, ensure security and compliance, and drive DevOps culture across organizations. This role demands deep expertise in container orchestration, infrastructure as code, cloud architecture, and site reliability engineering.

This comprehensive guide covers essential interview questions for senior DevOps engineers, focusing on advanced concepts, production systems, and strategic thinking. Each question includes detailed explanations and practical examples.


Advanced Kubernetes

1. Explain Kubernetes architecture and the role of key components.

Answer: Kubernetes follows a master-worker architecture:

Control Plane Components:

  • API Server: Frontend for Kubernetes control plane, handles all REST requests
  • etcd: Distributed key-value store for cluster state
  • Scheduler: Assigns pods to nodes based on resource requirements
  • Controller Manager: Runs controller processes (replication, endpoints, etc.)
  • Cloud Controller Manager: Integrates with cloud provider APIs

Node Components:

  • kubelet: Agent that ensures containers are running in pods
  • kube-proxy: Maintains network rules for pod communication
  • Container Runtime: Runs containers (Docker, containerd, CRI-O)
Loading diagram...

How it works:

  1. User submits deployment via kubectl
  2. API Server validates and stores in etcd
  3. Scheduler assigns pods to nodes
  4. kubelet on node creates containers
  5. kube-proxy configures networking

Rarity: Very Common
Difficulty: Hard


2. How do you troubleshoot a pod stuck in CrashLoopBackOff?

Answer: Systematic debugging approach:

# 1. Check pod status and events
kubectl describe pod <pod-name>
# Look for: Image pull errors, resource limits, failed health checks

# 2. Check logs
kubectl logs <pod-name>
kubectl logs <pod-name> --previous  # Previous container logs

# 3. Check resource constraints
kubectl top pod <pod-name>
kubectl describe node <node-name>

# 4. Check liveness/readiness probes
kubectl get pod <pod-name> -o yaml | grep -A 10 livenessProbe

# 5. Exec into container (if it stays up briefly)
kubectl exec -it <pod-name> -- /bin/sh

# 6. Check image
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].image}'
docker pull <image>  # Test locally

# 7. Check ConfigMaps/Secrets
kubectl get configmap
kubectl get secret

# 8. Review deployment/pod spec
kubectl get deployment <deployment-name> -o yaml

Common causes:

  • Application crashes on startup
  • Missing environment variables
  • Incorrect liveness probe configuration
  • Insufficient resources (OOMKilled)
  • Image pull errors
  • Missing dependencies

Example fix:

# Increase resource limits
resources:
  limits:
    memory: "512Mi"
    cpu: "500m"
  requests:
    memory: "256Mi"
    cpu: "250m"

# Adjust probe timing
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30  # Give app time to start
  periodSeconds: 10
  failureThreshold: 3

Rarity: Very Common
Difficulty: Medium


3. Explain Kubernetes networking: Services, Ingress, and Network Policies.

Answer: Kubernetes networking layers:

Services: Types of service exposure:

# ClusterIP (internal only)
apiVersion: v1
kind: Service
metadata:
  name: backend
spec:
  type: ClusterIP
  selector:
    app: backend
  ports:
    - port: 80
      targetPort: 8080

# NodePort (external access via node IP)
spec:
  type: NodePort
  ports:
    - port: 80
      targetPort: 8080
      nodePort: 30080

# LoadBalancer (cloud load balancer)
spec:
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: 8080

Ingress: HTTP/HTTPS routing:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /v1
        pathType: Prefix
        backend:
          service:
            name: api-v1
            port:
              number: 80
      - path: /v2
        pathType: Prefix
        backend:
          service:
            name: api-v2
            port:
              number: 80
  tls:
  - hosts:
    - api.example.com
    secretName: api-tls

Network Policies: Control pod-to-pod communication:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-policy
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: database
    ports:
    - protocol: TCP
      port: 5432

Rarity: Very Common
Difficulty: Hard


4. How do you implement autoscaling in Kubernetes?

Answer: Multiple autoscaling strategies:

Horizontal Pod Autoscaler (HPA):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15

Vertical Pod Autoscaler (VPA):

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  updatePolicy:
    updateMode: "Auto"  # or "Recreate", "Initial", "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: app
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2
        memory: 2Gi

Cluster Autoscaler: Automatically adjusts cluster size based on pending pods:

# AWS example
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-config
data:
  min-nodes: "2"
  max-nodes: "10"
  scale-down-delay-after-add: "10m"
  scale-down-unneeded-time: "10m"

Rarity: Common
Difficulty: Medium


Advanced Terraform

5. Explain Terraform state management and best practices.

Answer: Terraform state tracks infrastructure and is critical for operations.

Remote State Configuration:

# backend.tf
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

State Locking:

# DynamoDB table for state locking
resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

Best Practices:

1. Never commit state files to Git

# .gitignore
*.tfstate
*.tfstate.*
.terraform/

2. Use workspaces for environments

terraform workspace new dev
terraform workspace new staging
terraform workspace new prod

terraform workspace select dev
terraform apply

3. Import existing resources

# Import existing EC2 instance
terraform import aws_instance.web i-1234567890abcdef0

# Verify
terraform plan

4. State manipulation (use carefully)

# List resources in state
terraform state list

# Show specific resource
terraform state show aws_instance.web

# Move resource in state
terraform state mv aws_instance.old aws_instance.new

# Remove resource from state (doesn't delete)
terraform state rm aws_instance.web

5. Backup state before major changes

terraform state pull > backup.tfstate

Rarity: Very Common
Difficulty: Hard


6. How do you structure Terraform code for large projects?

Answer: Modular structure for maintainability:

Directory Structure:

terraform/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── terraform.tfvars
│   │   └── backend.tf
│   ├── staging/
│   └── prod/
├── modules/
│   ├── vpc/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── README.md
│   ├── eks/
│   ├── rds/
│   └── s3/
└── global/
    ├── iam/
    └── route53/

Module Example:

# modules/vpc/main.tf
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = merge(
    var.tags,
    {
      Name = "${var.environment}-vpc"
    }
  )
}

resource "aws_subnet" "private" {
  count             = length(var.private_subnet_cidrs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.private_subnet_cidrs[count.index]
  availability_zone = var.availability_zones[count.index]

  tags = merge(
    var.tags,
    {
      Name = "${var.environment}-private-${count.index + 1}"
      Type = "private"
    }
  )
}

# modules/vpc/variables.tf
variable "vpc_cidr" {
  description = "CIDR block for VPC"
  type        = string
}

variable "environment" {
  description = "Environment name"
  type        = string
}

variable "private_subnet_cidrs" {
  description = "CIDR blocks for private subnets"
  type        = list(string)
}

variable "availability_zones" {
  description = "Availability zones"
  type        = list(string)
}

variable "tags" {
  description = "Common tags"
  type        = map(string)
  default     = {}
}

# modules/vpc/outputs.tf
output "vpc_id" {
  value = aws_vpc.main.id
}

output "private_subnet_ids" {
  value = aws_subnet.private[*].id
}

Using Modules:

# environments/prod/main.tf
module "vpc" {
  source = "../../modules/vpc"

  vpc_cidr             = "10.0.0.0/16"
  environment          = "prod"
  private_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  availability_zones   = ["us-east-1a", "us-east-1b", "us-east-1c"]

  tags = {
    Project   = "MyApp"
    ManagedBy = "Terraform"
  }
}

module "eks" {
  source = "../../modules/eks"

  cluster_name    = "prod-cluster"
  vpc_id          = module.vpc.vpc_id
  subnet_ids      = module.vpc.private_subnet_ids
  node_group_size = 3
}

Rarity: Common
Difficulty: Hard


Cloud Architecture

7. Design a highly available multi-region architecture on AWS.

Answer: Multi-region architecture for high availability:

Loading diagram...

Key Components:

1. DNS and Traffic Management:

# Route 53 with health checks
resource "aws_route53_health_check" "primary" {
  fqdn              = "api.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 30
}

resource "aws_route53_record" "api" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"

  failover_routing_policy {
    type = "PRIMARY"
  }

  set_identifier = "primary"
  health_check_id = aws_route53_health_check.primary.id

  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }
}

2. Database Replication:

# RDS with cross-region read replica
resource "aws_db_instance" "primary" {
  identifier           = "prod-db-primary"
  engine               = "postgres"
  instance_class       = "db.r5.xlarge"
  multi_az             = true
  backup_retention_period = 7

  provider = aws.us-east-1
}

resource "aws_db_instance" "replica" {
  identifier             = "prod-db-replica"
  replicate_source_db    = aws_db_instance.primary.arn
  instance_class         = "db.r5.xlarge"
  auto_minor_version_upgrade = false

  provider = aws.us-west-2
}

3. Data Replication:

# S3 cross-region replication
resource "aws_s3_bucket_replication_configuration" "replication" {
  bucket = aws_s3_bucket.source.id
  role   = aws_iam_role.replication.arn

  rule {
    id     = "replicate-all"
    status = "Enabled"

    destination {
      bucket        = aws_s3_bucket.destination.arn
      storage_class = "STANDARD"
    }
  }
}

Design Principles:

  • Active-active or active-passive setup
  • Automated failover with health checks
  • Data replication with minimal lag
  • Consistent deployment across regions
  • Monitoring and alerting for both regions

Rarity: Common
Difficulty: Hard


GitOps & CI/CD

8. Explain GitOps and how to implement it with ArgoCD.

Answer: GitOps uses Git as the single source of truth for declarative infrastructure and applications.

Principles:

  1. Declarative configuration in Git
  2. Automated synchronization
  3. Version control for all changes
  4. Continuous reconciliation

ArgoCD Implementation:

# Application manifest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/app-manifests
    targetRevision: main
    path: k8s/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
    - CreateNamespace=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

Directory Structure:

app-manifests/
├── base/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── kustomization.yaml
└── overlays/
    ├── dev/
    │   ├── kustomization.yaml
    │   └── patches/
    ├── staging/
    └── production/
        ├── kustomization.yaml
        ├── replicas.yaml
        └── resources.yaml

Kustomization:

# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

bases:
- ../../base

replicas:
- name: myapp
  count: 5

resources:
- ingress.yaml

patches:
- path: resources.yaml
  target:
    kind: Deployment
    name: myapp

Benefits:

  • Git as audit trail
  • Easy rollbacks (git revert)
  • Declarative desired state
  • Automated drift detection
  • Multi-cluster management

Rarity: Common
Difficulty: Medium


Security & Compliance

9. How do you implement security best practices in Kubernetes?

Answer: Multi-layered security approach:

1. Pod Security Standards:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

2. RBAC (Role-Based Access Control):

# Role for developers
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: developer
rules:
- apiGroups: ["", "apps"]
  resources: ["pods", "deployments", "services"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get"]

# RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: developer-binding
  namespace: production
subjects:
- kind: Group
  name: developers
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: developer
  apiGroup: rbac.authorization.k8s.io

3. Network Policies:

# Default deny all ingress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress

4. Secrets Management:

# External Secrets Operator
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: SecretStore
  target:
    name: app-secrets
    creationPolicy: Owner
  data:
  - secretKey: database-password
    remoteRef:
      key: prod/database
      property: password

5. Security Context:

apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: myapp:1.0
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
        - ALL
    volumeMounts:
    - name: tmp
      mountPath: /tmp
  volumes:
  - name: tmp
    emptyDir: {}

6. Image Scanning:

# Admission controller with OPA
apiVersion: v1
kind: ConfigMap
metadata:
  name: opa-policy
data:
  policy.rego: |
    package kubernetes.admission
    
    deny[msg] {
      input.request.kind.kind == "Pod"
      image := input.request.object.spec.containers[_].image
      not startswith(image, "registry.company.com/")
      msg := sprintf("Image %v is not from approved registry", [image])
    }

Rarity: Very Common
Difficulty: Hard


Observability & SRE

10. Design a comprehensive observability stack.

Answer: Three pillars of observability: Metrics, Logs, Traces

Architecture:

Loading diagram...

1. Metrics (Prometheus + Grafana):

# ServiceMonitor for app metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

2. Logging (Loki):

# Promtail config for log collection
apiVersion: v1
kind: ConfigMap
metadata:
  name: promtail-config
data:
  promtail.yaml: |
    server:
      http_listen_port: 9080
    
    clients:
      - url: http://loki:3100/loki/api/v1/push
    
    scrape_configs:
      - job_name: kubernetes-pods
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_label_app]
            target_label: app
          - source_labels: [__meta_kubernetes_namespace]
            target_label: namespace

3. Tracing (Jaeger):

# Application instrumentation
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Setup tracing
trace.set_tracer_provider(TracerProvider())
jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger-agent",
    agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

tracer = trace.get_tracer(__name__)

# Use in code
with tracer.start_as_current_span("process_request"):
    # Your code here
    pass

4. Alerting Rules:

# PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
spec:
  groups:
  - name: app
    interval: 30s
    rules:
    - alert: HighErrorRate
      expr: |
        sum(rate(http_requests_total{status=~"5.."}[5m]))
        /
        sum(rate(http_requests_total[5m]))
        > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value | humanizePercentage }}"
    
    - alert: HighLatency
      expr: |
        histogram_quantile(0.95,
          rate(http_request_duration_seconds_bucket[5m])
        ) > 1
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High latency detected"

5. SLO Monitoring:

# SLO definition
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: api-availability
spec:
  service: "api"
  labels:
    team: "platform"
  slos:
    - name: "requests-availability"
      objective: 99.9
      description: "API requests should succeed"
      sli:
        events:
          errorQuery: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
          totalQuery: sum(rate(http_requests_total[{{.window}}]))
      alerting:
        pageAlert:
          labels:
            severity: critical
        ticketAlert:
          labels:
            severity: warning

Rarity: Common
Difficulty: Hard


Disaster Recovery

11. How do you implement disaster recovery for a Kubernetes cluster?

Answer: Comprehensive DR strategy:

1. Backup Strategy:

# Velero backup schedule
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    includedNamespaces:
    - production
    - staging
    excludedResources:
    - events
    - events.events.k8s.io
    storageLocation: aws-s3
    volumeSnapshotLocations:
    - aws-ebs
    ttl: 720h  # 30 days

2. etcd Backup:

#!/bin/bash
# Automated etcd backup script

ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db

# Upload to S3
aws s3 cp /backup/etcd-snapshot-*.db s3://etcd-backups/

# Cleanup old backups
find /backup -name "etcd-snapshot-*.db" -mtime +7 -delete

3. Restore Procedure:

# Restore etcd from snapshot
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
  --data-dir=/var/lib/etcd-restore \
  --initial-cluster=etcd-0=https://10.0.1.10:2380 \
  --initial-advertise-peer-urls=https://10.0.1.10:2380

# Restore application with Velero
velero restore create --from-backup daily-backup-20231125
velero restore describe <restore-name>

4. Multi-Region Failover:

# Terraform for multi-region setup
module "primary_cluster" {
  source = "./modules/eks"
  region = "us-east-1"
  # ... configuration
}

module "dr_cluster" {
  source = "./modules/eks"
  region = "us-west-2"
  # ... configuration
}

# Route 53 health check and failover
resource "aws_route53_health_check" "primary" {
  fqdn              = module.primary_cluster.endpoint
  port              = 443
  type              = "HTTPS"
  resource_path     = "/healthz"
  failure_threshold = 3
}

5. RTO/RPO Targets:

  • RTO (Recovery Time Objective): < 1 hour
  • RPO (Recovery Point Objective): < 15 minutes
  • Regular DR drills (monthly)
  • Documented runbooks
  • Automated failover where possible

Rarity: Common
Difficulty: Hard


Service Mesh

12. Explain service mesh architecture and when to use it.

Answer: A service mesh provides infrastructure layer for service-to-service communication.

Core Components:

Loading diagram...

Istio Implementation:

# Virtual Service for traffic routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - match:
    - headers:
        end-user:
          exact: jason
    route:
    - destination:
        host: reviews
        subset: v2
  - route:
    - destination:
        host: reviews
        subset: v1
      weight: 80
    - destination:
        host: reviews
        subset: v2
      weight: 20

# Destination Rule
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: reviews
spec:
  host: reviews
  trafficPolicy:
    loadBalancer:
      simple: LEAST_REQUEST
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2
    trafficPolicy:
      connectionPool:
        tcp:
          maxConnections: 100
        http:
          http1MaxPendingRequests: 50
          http2MaxRequests: 100

Circuit Breaking:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: backend
spec:
  host: backend
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 10
        maxRequestsPerConnection: 2
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Mutual TLS:

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: STRICT

# Authorization Policy
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: frontend-policy
spec:
  selector:
    matchLabels:
      app: frontend
  action: ALLOW
  rules:
  - from:
    - source:
        principals: ["cluster.local/ns/default/sa/gateway"]
    to:
    - operation:
        methods: ["GET", "POST"]

When to Use:

  • Microservices with complex communication patterns
  • Need for advanced traffic management
  • Security requirements (mTLS, authorization)
  • Observability across services
  • Gradual rollouts and A/B testing

Trade-offs:

  • Added complexity and latency
  • Resource overhead (sidecar proxies)
  • Learning curve
  • Debugging challenges

Rarity: Common
Difficulty: Hard


Cost Optimization

13. How do you optimize cloud infrastructure costs?

Answer: Cost optimization requires continuous monitoring and strategic decisions.

1. Right-Sizing Resources:

# AWS Cost Explorer analysis script
import boto3
import pandas as pd
from datetime import datetime, timedelta

class CostOptimizer:
    def __init__(self):
        self.ce_client = boto3.client('ce')
        self.ec2_client = boto3.client('ec2')
        self.cloudwatch = boto3.client('cloudwatch')
    
    def analyze_underutilized_instances(self, days=30):
        """Find underutilized EC2 instances"""
        end_date = datetime.now()
        start_date = end_date - timedelta(days=days)
        
        instances = self.ec2_client.describe_instances()
        recommendations = []
        
        for reservation in instances['Reservations']:
            for instance in reservation['Instances']:
                instance_id = instance['InstanceId']
                
                # Get CPU utilization
                cpu_stats = self.cloudwatch.get_metric_statistics(
                    Namespace='AWS/EC2',
                    MetricName='CPUUtilization',
                    Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
                    StartTime=start_date,
                    EndTime=end_date,
                    Period=3600,
                    Statistics=['Average']
                )
                
                avg_cpu = sum(d['Average'] for d in cpu_stats['Datapoints']) / len(cpu_stats['Datapoints']) if cpu_stats['Datapoints'] else 0
                
                if avg_cpu < 10:
                    recommendations.append({
                        'instance_id': instance_id,
                        'instance_type': instance['InstanceType'],
                        'avg_cpu': avg_cpu,
                        'recommendation': 'Consider downsizing or terminating'
                    })
        
        return pd.DataFrame(recommendations)
    
    def get_cost_by_service(self, days=30):
        """Get cost breakdown by service"""
        end_date = datetime.now().strftime('%Y-%m-%d')
        start_date = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d')
        
        response = self.ce_client.get_cost_and_usage(
            TimePeriod={'Start': start_date, 'End': end_date},
            Granularity='MONTHLY',
            Metrics=['UnblendedCost'],
            GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
        )
        
        costs = []
        for result in response['ResultsByTime']:
            for group in result['Groups']:
                costs.append({
                    'service': group['Keys'][0],
                    'cost': float(group['Metrics']['UnblendedCost']['Amount'])
                })
        
        return pd.DataFrame(costs).sort_values('cost', ascending=False)

# Usage
optimizer = CostOptimizer()
underutilized = optimizer.analyze_underutilized_instances()
print("Underutilized Instances:")
print(underutilized)

costs = optimizer.get_cost_by_service()
print("\nTop 10 Services by Cost:")
print(costs.head(10))

2. Reserved Instances & Savings Plans:

# Terraform for Reserved Instances
resource "aws_ec2_capacity_reservation" "production" {
  instance_type     = "m5.xlarge"
  instance_platform = "Linux/UNIX"
  availability_zone = "us-east-1a"
  instance_count    = 10
  
  tags = {
    Environment = "production"
    CostCenter  = "engineering"
  }
}

3. Auto-Scaling Policies:

# Kubernetes HPA with custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: cost-optimized-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max

4. Storage Optimization:

# S3 lifecycle policy
aws s3api put-bucket-lifecycle-configuration \
  --bucket my-bucket \
  --lifecycle-configuration '{
    "Rules": [
      {
        "Id": "Archive old logs",
        "Status": "Enabled",
        "Filter": {"Prefix": "logs/"},
        "Transitions": [
          {
            "Days": 30,
            "StorageClass": "STANDARD_IA"
          },
          {
            "Days": 90,
            "StorageClass": "GLACIER"
          }
        ],
        "Expiration": {"Days": 365}
      }
    ]
  }'

5. Spot Instances for Non-Critical Workloads:

# Kubernetes with spot instances
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: spot-provisioner
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["m5.large", "m5.xlarge", "m5.2xlarge"]
  limits:
    resources:
      cpu: 1000
  ttlSecondsAfterEmpty: 30
  ttlSecondsUntilExpired: 604800

Cost Optimization Checklist:

  • Right-size instances based on actual usage
  • Use Reserved Instances for predictable workloads
  • Implement auto-scaling
  • Delete unused resources (EBS volumes, snapshots, IPs)
  • Use spot instances for batch jobs
  • Optimize data transfer costs
  • Implement S3 lifecycle policies
  • Use CloudFront for static content
  • Monitor and set budget alerts

Rarity: Very Common
Difficulty: Medium


Performance Tuning

14. How do you troubleshoot and optimize application performance?

Answer: Systematic approach to performance optimization:

1. Profiling and Monitoring:

# Python application profiling
import cProfile
import pstats
from functools import wraps
import time

def profile_function(func):
    """Decorator to profile function performance"""
    @wraps(func)
    def wrapper(*args, **kwargs):
        profiler = cProfile.Profile()
        profiler.enable()
        
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        
        profiler.disable()
        
        stats = pstats.Stats(profiler)
        stats.sort_stats('cumulative')
        stats.print_stats(10)
        
        print(f"\nTotal execution time: {end_time - start_time:.4f} seconds")
        
        return result
    return wrapper

@profile_function
def slow_function():
    # Your code here
    pass

2. Database Optimization:

-- Identify slow queries
SELECT 
    query,
    calls,
    total_time,
    mean_time,
    max_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

-- Add indexes for frequently queried columns
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
CREATE INDEX CONCURRENTLY idx_orders_user_created 
    ON orders(user_id, created_at DESC);

-- Analyze query plan
EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.id, u.name
HAVING COUNT(o.id) > 5;

3. Caching Strategy:

# Redis caching implementation
import redis
import json
from functools import wraps

class CacheManager:
    def __init__(self, redis_host='localhost', redis_port=6379):
        self.redis_client = redis.Redis(
            host=redis_host,
            port=redis_port,
            decode_responses=True
        )
    
    def cache(self, ttl=3600):
        """Decorator for caching function results"""
        def decorator(func):
            @wraps(func)
            def wrapper(*args, **kwargs):
                # Generate cache key
                cache_key = f"{func.__name__}:{str(args)}:{str(kwargs)}"
                
                # Try to get from cache
                cached_result = self.redis_client.get(cache_key)
                if cached_result:
                    return json.loads(cached_result)
                
                # Execute function
                result = func(*args, **kwargs)
                
                # Store in cache
                self.redis_client.setex(
                    cache_key,
                    ttl,
                    json.dumps(result)
                )
                
                return result
            return wrapper
        return decorator

cache_manager = CacheManager()

@cache_manager.cache(ttl=300)
def expensive_database_query(user_id):
    # Expensive operation
    return query_database(user_id)

4. Load Testing:

# Locust load testing
from locust import HttpUser, task, between

class WebsiteUser(HttpUser):
    wait_time = between(1, 3)
    
    @task(3)
    def view_homepage(self):
        self.client.get("/")
    
    @task(2)
    def view_product(self):
        product_id = random.randint(1, 1000)
        self.client.get(f"/products/{product_id}")
    
    @task(1)
    def search(self):
        self.client.get("/search?q=test")
    
    def on_start(self):
        # Login
        self.client.post("/login", {
            "username": "test",
            "password": "test"
        })

# Run: locust -f loadtest.py --host=https://example.com

5. Application-Level Optimization:

# Async processing for I/O-bound operations
import asyncio
import aiohttp

async def fetch_data(session, url):
    async with session.get(url) as response:
        return await response.json()

async def fetch_all_data(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_data(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
        return results

# Usage
urls = ['http://api1.com', 'http://api2.com', 'http://api3.com']
results = asyncio.run(fetch_all_data(urls))

6. CDN and Static Asset Optimization:

# Nginx configuration for performance
http {
    # Enable gzip compression
    gzip on;
    gzip_vary on;
    gzip_min_length 1024;
    gzip_types text/plain text/css text/xml text/javascript 
               application/x-javascript application/xml+rss 
               application/json application/javascript;
    
    # Browser caching
    location ~* \.(jpg|jpeg|png|gif|ico|css|js|woff|woff2)$ {
        expires 1y;
        add_header Cache-Control "public, immutable";
    }
    
    # Proxy caching
    proxy_cache_path /var/cache/nginx levels=1:2 
                     keys_zone=my_cache:10m max_size=1g 
                     inactive=60m use_temp_path=off;
    
    location / {
        proxy_cache my_cache;
        proxy_cache_valid 200 60m;
        proxy_cache_use_stale error timeout http_500 http_502 http_503;
        proxy_pass http://backend;
    }
}

Performance Optimization Checklist:

  • Profile code to identify bottlenecks
  • Optimize database queries and add indexes
  • Implement caching (Redis, Memcached)
  • Use CDN for static assets
  • Enable compression (gzip, brotli)
  • Optimize images and assets
  • Use async/await for I/O operations
  • Implement connection pooling
  • Monitor application metrics
  • Load test before production

Rarity: Very Common
Difficulty: Hard


Incident Management

15. Describe your approach to incident management and post-mortems.

Answer: Effective incident management minimizes downtime and prevents recurrence.

Incident Response Process:

1. Detection and Alert:

# PagerDuty integration with Prometheus
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m
    
    route:
      group_by: ['alertname', 'cluster']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'pagerduty'
      routes:
      - match:
          severity: critical
        receiver: 'pagerduty'
        continue: true
      - match:
          severity: warning
        receiver: 'slack'
    
    receivers:
    - name: 'pagerduty'
      pagerduty_configs:
      - service_key: '<pagerduty-integration-key>'
        description: '{{ .GroupLabels.alertname }}'
        severity: '{{ .CommonLabels.severity }}'
    
    - name: 'slack'
      slack_configs:
      - api_url: '<slack-webhook-url>'
        channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

2. Incident Classification:

# Incident severity classification
class IncidentSeverity:
    SEV1 = {
        'name': 'Critical',
        'description': 'Complete service outage affecting all users',
        'response_time': '15 minutes',
        'notification': ['on-call', 'management', 'stakeholders']
    }
    
    SEV2 = {
        'name': 'High',
        'description': 'Major functionality impaired, affecting many users',
        'response_time': '30 minutes',
        'notification': ['on-call', 'team-lead']
    }
    
    SEV3 = {
        'name': 'Medium',
        'description': 'Minor functionality impaired, limited user impact',
        'response_time': '2 hours',
        'notification': ['on-call']
    }
    
    SEV4 = {
        'name': 'Low',
        'description': 'No immediate user impact',
        'response_time': '24 hours',
        'notification': ['team']
    }

class IncidentManager:
    def __init__(self):
        self.incidents = []
    
    def create_incident(self, title, severity, description):
        incident = {
            'id': generate_incident_id(),
            'title': title,
            'severity': severity,
            'description': description,
            'status': 'investigating',
            'created_at': datetime.now(),
            'timeline': [],
            'responders': []
        }
        
        self.incidents.append(incident)
        self.notify_responders(incident)
        
        return incident
    
    def update_timeline(self, incident_id, update):
        incident = self.get_incident(incident_id)
        incident['timeline'].append({
            'timestamp': datetime.now(),
            'update': update
        })
    
    def resolve_incident(self, incident_id, resolution):
        incident = self.get_incident(incident_id)
        incident['status'] = 'resolved'
        incident['resolved_at'] = datetime.now()
        incident['resolution'] = resolution
        
        # Trigger post-mortem creation
        self.create_postmortem(incident)

3. Communication Template:

# Incident Communication Template

## Initial Notification
**Subject:** [SEV1] Production Outage - API Service Down

**Status:** Investigating
**Impact:** All API requests failing, affecting 100% of users
**Started:** 2024-11-25 14:30 UTC
**Next Update:** 15 minutes

We are investigating reports of API service unavailability. 
Our team is actively working on resolution.

## Update
**Status:** Identified
**Root Cause:** Database connection pool exhausted
**Action:** Restarting application servers and increasing pool size
**ETA:** 10 minutes

## Resolution
**Status:** Resolved
**Duration:** 45 minutes (14:30 - 15:15 UTC)
**Resolution:** Increased database connection pool from 50 to 200
**Next Steps:** Post-mortem scheduled for tomorrow

Thank you for your patience.

4. Post-Mortem Template:

# Post-Mortem: API Service Outage

**Date:** 2024-11-25
**Duration:** 45 minutes
**Severity:** SEV1
**Impact:** 100% of API requests failed

## Summary
On November 25, 2024, our API service experienced a complete outage 
lasting 45 minutes due to database connection pool exhaustion.

## Timeline
- **14:30 UTC:** Alerts triggered for high error rate
- **14:32 UTC:** On-call engineer paged
- **14:35 UTC:** Incident declared SEV1
- **14:40 UTC:** Root cause identified (connection pool exhausted)
- **14:45 UTC:** Mitigation applied (increased pool size)
- **15:00 UTC:** Service partially restored
- **15:15 UTC:** Full service restoration confirmed

## Root Cause
Database connection pool was configured with a maximum of 50 connections.
Traffic spike from marketing campaign exceeded this limit, causing all
new requests to fail.

## Resolution
1. Increased connection pool size from 50 to 200
2. Restarted application servers
3. Verified service restoration

## Impact
- **Users Affected:** ~50,000
- **Failed Requests:** ~2.5 million
- **Revenue Impact:** Estimated $25,000

## Action Items
1. [ ] Implement auto-scaling for connection pool (Owner: @john, Due: Dec 1)
2. [ ] Add alerting for connection pool utilization (Owner: @jane, Due: Nov 28)
3. [ ] Load test with 2x expected traffic (Owner: @bob, Due: Dec 5)
4. [ ] Document connection pool tuning (Owner: @alice, Due: Nov 30)
5. [ ] Review capacity planning process (Owner: @team, Due: Dec 10)

## Lessons Learned
**What Went Well:**
- Quick detection through monitoring
- Clear communication to stakeholders
- Fast root cause identification

**What Went Wrong:**
- Connection pool not sized for peak traffic
- No alerting on connection pool metrics
- Insufficient load testing

**Where We Got Lucky:**
- Issue occurred during business hours
- Database itself remained healthy
- Quick rollback option available

5. Runbook Example:

# Incident Runbook
name: "API Service Down"
severity: SEV1
symptoms:
  - High error rate (>5%)
  - Increased latency (>1s)
  - Failed health checks

investigation_steps:
  - step: "Check service status"
    command: "kubectl get pods -n production"
  
  - step: "Check logs"
    command: "kubectl logs -n production -l app=api --tail=100"
  
  - step: "Check database connectivity"
    command: "kubectl exec -it api-pod -- nc -zv database 5432"
  
  - step: "Check resource usage"
    command: "kubectl top pods -n production"

mitigation_steps:
  - action: "Restart pods"
    command: "kubectl rollout restart deployment/api -n production"
    
  - action: "Scale up replicas"
    command: "kubectl scale deployment/api --replicas=10 -n production"
  
  - action: "Failover to backup region"
    command: "aws route53 change-resource-record-sets --hosted-zone-id Z123 ..."

escalation:
  - level: 1
    contact: "on-call-engineer"
    timeout: "15 minutes"
  
  - level: 2
    contact: "team-lead"
    timeout: "30 minutes"
  
  - level: 3
    contact: "engineering-director"
    timeout: "1 hour"

Best Practices:

  • Blameless post-mortems
  • Focus on systems, not individuals
  • Document action items with owners and deadlines
  • Share learnings across organization
  • Regular incident response drills
  • Maintain updated runbooks
  • Track MTTR (Mean Time To Recovery)
  • Implement preventive measures

Rarity: Very Common
Difficulty: Medium


Conclusion

Senior DevOps engineer interviews require deep technical expertise and hands-on experience. Key areas to master:

  1. Kubernetes: Architecture, networking, troubleshooting, security
  2. Terraform: State management, modules, best practices
  3. Cloud Architecture: Multi-region, high availability, disaster recovery
  4. GitOps: Declarative infrastructure, ArgoCD, automation
  5. Security: RBAC, network policies, secrets management, compliance
  6. Observability: Metrics, logs, traces, SLOs, alerting
  7. SRE Practices: Reliability, incident response, capacity planning

Focus on production experience, architectural decisions, and trade-offs. Be prepared to discuss real-world scenarios and how you've solved complex problems. Good luck!

Related Posts

Recent Posts

Weekly career tips that actually work

Get the latest insights delivered straight to your inbox