高级 DevOps 工程师生产系统面试题

Milad Bonakdar
作者
用实战问题准备 Kubernetes、Terraform 状态、GitOps、安全、可观测性、事故响应和生产系统取舍等高级 DevOps 面试重点。
简介
高级 DevOps 工程师需要架构可扩展的基础设施、实施高级自动化、确保安全性和合规性,并在整个组织内推动 DevOps 文化。这个职位要求在容器编排、基础设施即代码、云架构和站点可靠性工程方面具有深厚的专业知识。
本综合指南涵盖了高级 DevOps 工程师的必备面试问题,重点关注高级概念、生产系统和战略思维。每个问题都包含详细的解释和实际示例。
高级 Kubernetes
1. 解释 Kubernetes 架构以及关键组件的作用。
回答: Kubernetes 遵循主从架构:
控制平面组件:
- API Server: Kubernetes 控制平面的前端,处理所有 REST 请求
- etcd: 用于集群状态的分布式键值存储
- Scheduler: 根据资源需求将 Pod 分配给节点
- Controller Manager: 运行控制器进程(复制、端点等)
- Cloud Controller Manager: 与云提供商 API 集成
节点组件:
- kubelet: 代理,确保容器在 Pod 中运行
- kube-proxy: 维护 Pod 通信的网络规则
- Container Runtime: 运行容器(Docker、containerd、CRI-O)
工作原理:
- 用户通过 kubectl 提交部署
- API Server 验证并存储在 etcd 中
- Scheduler 将 Pod 分配给节点
- 节点上的 kubelet 创建容器
- kube-proxy 配置网络
常见程度: 非常常见 难度: 困难
2. 如何排查卡在 CrashLoopBackOff 状态的 Pod?
回答: 系统性的调试方法:
# 1. 检查 Pod 状态和事件
kubectl describe pod <pod-name>
# 查找:镜像拉取错误、资源限制、健康检查失败
# 2. 检查日志
kubectl logs <pod-name>
kubectl logs <pod-name> --previous # 之前的容器日志
# 3. 检查资源约束
kubectl top pod <pod-name>
kubectl describe node <node-name>
# 4. 检查存活/就绪探针
kubectl get pod <pod-name> -o yaml | grep -A 10 livenessProbe
# 5. 进入容器(如果它短暂保持运行)
kubectl exec -it <pod-name> -- /bin/sh
# 6. 检查镜像
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].image}'
docker pull <image> # 在本地测试
# 7. 检查 ConfigMaps/Secrets
kubectl get configmap
kubectl get secret
# 8. 查看 Deployment/Pod 规范
kubectl get deployment <deployment-name> -o yaml常见原因:
- 应用程序在启动时崩溃
- 缺少环境变量
- 不正确的存活探针配置
- 资源不足(OOMKilled)
- 镜像拉取错误
- 缺少依赖项
示例修复:
# 增加资源限制
resources:
limits:
memory: "512Mi"
cpu: "500m"
requests:
memory: "256Mi"
cpu: "250m"
# 调整探针时序
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # 给应用程序启动时间
periodSeconds: 10
failureThreshold: 3常见程度: 非常常见 难度: 中等
3. 解释 Kubernetes 网络:Services、Ingress 和 Network Policies。
回答: Kubernetes 网络层:
Services: Service 暴露类型:
# ClusterIP(仅限内部)
apiVersion: v1
kind: Service
metadata:
name: backend
spec:
type: ClusterIP
selector:
app: backend
ports:
- port: 80
targetPort: 8080
# NodePort(通过节点 IP 进行外部访问)
spec:
type: NodePort
ports:
- port: 80
targetPort: 8080
nodePort: 30080
# LoadBalancer(云负载均衡器)
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 8080Ingress: HTTP/HTTPS 路由:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: api.example.com
http:
paths:
- path: /v1
pathType: Prefix
backend:
service:
name: api-v1
port:
number: 80
- path: /v2
pathType: Prefix
backend:
service:
name: api-v2
port:
number: 80
tls:
- hosts:
- api.example.com
secretName: api-tlsNetwork Policies: 控制 Pod 到 Pod 的通信:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-policy
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432常见程度: 非常常见 难度: 困难
4. 如何在 Kubernetes 中实现自动伸缩?
回答: 多种自动伸缩策略:
Horizontal Pod Autoscaler (HPA):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15Vertical Pod Autoscaler (VPA):
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: app
updatePolicy:
updateMode: "Auto" # 或 "Recreate", "Initial", "Off"
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 2GiCluster Autoscaler: 根据待处理的 Pod 自动调整集群大小:
# AWS 示例
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-config
data:
min-nodes: "2"
max-nodes: "10"
scale-down-delay-after-add: "10m"
scale-down-unneeded-time: "10m"常见程度: 常见 难度: 中等
高级 Terraform
5. 解释 Terraform 状态管理和最佳实践。
回答: Terraform 状态跟踪基础设施,对于操作至关重要。
远程状态配置:
# backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}状态锁定:
# 用于状态锁定的 DynamoDB 表
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}最佳实践:
1. 永远不要将状态文件提交到 Git
# .gitignore
*.tfstate
*.tfstate.*
.terraform/2. 使用工作区进行环境隔离
terraform workspace new dev
terraform workspace new staging
terraform workspace new prod
terraform workspace select dev
terraform apply3. 导入现有资源
# 导入现有的 EC2 实例
terraform import aws_instance.web i-1234567890abcdef0
# 验证
terraform plan4. 状态操作(谨慎使用)
# 列出状态中的资源
terraform state list
# 显示特定资源
terraform state show aws_instance.web
# 在状态中移动资源
terraform state mv aws_instance.old aws_instance.new
# 从状态中删除资源(不删除)
terraform state rm aws_instance.web5. 在重大更改之前备份状态
terraform state pull > backup.tfstate常见程度: 非常常见 难度: 困难
6. 如何为大型项目构建 Terraform 代码?
回答: 模块化结构,以提高可维护性:
目录结构:
terraform/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── terraform.tfvars
│ │ └── backend.tf
│ ├── staging/
│ └── prod/
├── modules/
│ ├── vpc/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── README.md
│ ├── eks/
│ ├── rds/
│ └── s3/
└── global/
├── iam/
└── route53/
模块示例:
# modules/vpc/main.tf
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = merge(
var.tags,
{
Name = "${var.environment}-vpc"
}
)
}
resource "aws_subnet" "private" {
count = length(var.private_subnet_cidrs)
vpc_id = aws_vpc.main.id
cidr_block = var.private_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index]
tags = merge(
var.tags,
{
Name = "${var.environment}-private-${count.index + 1}"
Type = "private"
}
)
}
# modules/vpc/variables.tf
variable "vpc_cidr" {
description = "VPC 的 CIDR 块"
type = string
}
variable "environment" {
description = "环境名称"
type = string
}
variable "private_subnet_cidrs" {
description = "私有子网的 CIDR 块"
type = list(string)
}
variable "availability_zones" {
description = "可用区"
type = list(string)
}
variable "tags" {
description = "通用标签"
type = map(string)
default = {}
}
# modules/vpc/outputs.tf
output "vpc_id" {
value = aws_vpc.main.id
}
output "private_subnet_ids" {
value = aws_subnet.private[*].id
}使用模块:
# environments/prod/main.tf
module "vpc" {
source = "../../modules/vpc"
vpc_cidr = "10.0.0.0/16"
environment = "prod"
private_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
tags = {
Project = "MyApp"
ManagedBy = "Terraform"
}
}
module "eks" {
source = "../../modules/eks"
cluster_name = "prod-cluster"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
node_group_size = 3
}常见程度: 常见 难度: 困难
云架构
7. 在 AWS 上设计一个高可用的多区域架构。
回答: 用于高可用性的多区域架构:
关键组件:
1. DNS 和流量管理:
# 带有健康检查的 Route 53
resource "aws_route53_health_check" "primary" {
fqdn = "api.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 30
}
resource "aws_route53_record" "api" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
failover_routing_policy {
type = "PRIMARY"
}
set_identifier = "primary"
health_check_id = aws_route53_health_check.primary.id
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
}2. 数据库复制:
# 带有跨区域只读副本的 RDS
resource "aws_db_instance" "primary" {
identifier = "prod-db-primary"
engine = "postgres"
instance_class = "db.r5.xlarge"
multi_az = true
backup_retention_period = 7
provider = aws.us-east-1
}
resource "aws_db_instance" "replica" {
identifier = "prod-db-replica"
replicate_source_db = aws_db_instance.primary.arn
instance_class = "db.r5.xlarge"
auto_minor_version_upgrade = false
provider = aws.us-west-2
}3. 数据复制:
# S3 跨区域复制
resource "aws_s3_bucket_replication_configuration" "replication" {
bucket = aws_s3_bucket.source.id
role = aws_iam_role.replication.arn
rule {
id = "replicate-all"
status = "Enabled"
destination {
bucket = aws_s3_bucket.destination.arn
storage_class = "STANDARD"
}
}
}设计原则:
- 主动-主动或主动-被动设置
- 通过健康检查实现自动故障转移
- 以最小的延迟进行数据复制
- 跨区域的一致部署
- 两个区域的监控和警报
常见程度: 常见 难度: 困难
GitOps & CI/CD
8. 解释 GitOps 以及如何使用 ArgoCD 实现它。
回答: GitOps 使用 Git 作为声明式基础设施和应用程序的单一事实来源。
原则:
- Git 中的声明式配置
- 自动同步
- 所有更改的版本控制
- 持续协调
ArgoCD 实现:
# 应用程序清单
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/org/app-manifests
targetRevision: main
path: k8s/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- CreateNamespace=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m目录结构:
app-manifests/
├── base/
│ ├── deployment.yaml
│ ├── service.yaml
│ └── kustomization.yaml
└── overlays/
├── dev/
│ ├── kustomization.yaml
│ └── patches/
├── staging/
└── production/
├── kustomization.yaml
├── replicas.yaml
└── resources.yaml
Kustomization:
# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- ../../base
replicas:
- name: myapp
count: 5
resources:
- ingress.yaml
patches:
- path: resources.yaml
target:
kind: Deployment
name: myapp优点:
- Git 作为审计跟踪
- 易于回滚 (git revert)
- 声明式的期望状态
- 自动漂移检测
- 多集群管理
常见程度: 常见 难度: 中等
安全 & 合规
9. 如何在 Kubernetes 中实施安全最佳实践?
回答: 多层安全方法:
1. Pod 安全标准:
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted2. RBAC(基于角色的访问控制):
# 开发人员的角色
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: developer
rules:
- apiGroups: ["", "apps"]
resources: ["pods", "deployments", "services"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get"]
# RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: developer-binding
namespace: production
subjects:
- kind: Group
name: developers
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: developer
apiGroup: rbac.authorization.k8s.io3. Network Policies:
# 默认拒绝所有入口
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress4. Secrets 管理:
# External Secrets Operator
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: app-secrets
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore
target:
name: app-secrets
creationPolicy: Owner
data:
- secretKey: database-password
remoteRef:
key: prod/database
property: password5. Security Context:
apiVersion: v1
kind: Pod
metadata:
name: secure-pod
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: myapp:1.0
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}6. 镜像扫描:
# 带有 OPA 的准入控制器
apiVersion: v1
kind: ConfigMap
metadata:
name: opa-policy
data:
policy.rego: |
package kubernetes.admission
deny[msg] {
input.request.kind.kind == "Pod"
image := input.request.object.spec.containers[_].image
not startswith(image, "registry.company.com/")
msg := sprintf("Image %v is not from approved registry", [image])
}常见程度: 非常常见 难度: 困难
可观测性 & SRE
10. 设计一个全面的可观测性堆栈。
回答: 可观测性的三大支柱:指标、日志、追踪
架构:
1. 指标 (Prometheus + Grafana):
# 应用程序指标的 ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
path: /metrics2. 日志 (Loki):
# 用于日志收集的 Promtail 配置
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
data:
promtail.yaml: |
server:
http_listen_port: 9080
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace3. 追踪 (Jaeger):
# 应用程序检测
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# 设置追踪
trace.set_tracer_provider(TracerProvider())
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger-agent",
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer(__name__)
# 在代码中使用
with tracer.start_as_current_span("process_request"):
# 你的代码
pass4. 警报规则:
# PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
spec:
groups:
- name: app
interval: 30s
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "检测到高错误率"
description: "错误率为 {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "检测到高延迟"5. SLO 监控:
# SLO 定义
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: api-availability
spec:
service: "api"
labels:
team: "platform"
slos:
- name: "requests-availability"
objective: 99.9
description: "API 请求应该成功"
sli:
events:
errorQuery: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
totalQuery: sum(rate(http_requests_total[{{.window}}]))
alerting:
pageAlert:
labels:
severity: critical
ticketAlert:
labels:
severity: warning常见程度: 常见 难度: 困难
灾难恢复
11. 如何为 Kubernetes 集群实施灾难恢复?
回答: 全面的 DR 策略:
1. 备份策略:
# Velero 备份计划
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 每天凌晨 2 点
template:
includedNamespaces:
- production
- staging
excludedResources:
- events
- events.events.k8s.io
storageLocation: aws-s3
volumeSnapshotLocations:
- aws-ebs
ttl: 720h # 30 天2. etcd 备份:
#!/bin/bash
# 自动化的 etcd 备份脚本
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db
# 上传到 S3
aws s3 cp /backup/etcd-snapshot-*.db s3://etcd-backups/
# 清理旧备份
find /backup -name "etcd-snapshot-*.db" -mtime +7 -delete3. 恢复过程:
# 从快照恢复 etcd
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
--data-dir=/var/lib/etcd-restore \
--initial-cluster=etcd-0=https://10.0.1.10:2380 \
--initial-advertise-peer-urls=https://10.0.1.10:2380
# 使用 Velero 恢复应用程序
velero restore create --from-backup daily-backup-20231125
velero restore describe <restore-name>4. 多区域故障转移:
# 用于多区域设置的 Terraform
module "primary_cluster" {
source = "./modules/eks"
region = "us-east-1"
# ... 配置
}
module "dr_cluster" {
source = "./modules/eks"
region = "us-west-2"
# ... 配置
}
# Route 53 健康检查和故障转移
resource "aws_route53_health_check" "primary" {
fqdn = module.primary_cluster.endpoint
port = 443
type = "HTTPS"
resource_path = "/healthz"
failure_threshold = 3
}5. RTO/RPO 目标:
- RTO(恢复时间目标): < 1 小时
- RPO(恢复点目标): < 15 分钟
- 定期 DR 演练(每月)
- 记录在案的运行手册
- 尽可能实现自动故障转移
常见程度: 常见 难度: 困难
服务网格
12. 解释服务网格架构以及何时使用它。
回答: 服务网格为服务到服务通信提供基础设施层。
核心组件:
Istio 实现:
# 用于流量路由的 Virtual Service
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- match:
- headers:
end-user:
exact: jason
route:
- destination:
host: reviews
subset: v2
- route:
- destination:
host: reviews
subset: v1
weight: 80
- destination:
host: reviews
subset: v2
weight: 20
# Destination Rule
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews
spec:
host: reviews
trafficPolicy:
loadBalancer:
simple: LEAST_REQUEST
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
http2MaxRequests: 100熔断:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: backend
spec:
host: backend
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 10
maxRequestsPerConnection: 2
outlierDetection:
consecutiveErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50Mutual TLS:
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: production
spec:
mtls:
mode: STRICT
# Authorization Policy
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: frontend-policy
spec:
selector:
matchLabels:
app: frontend
action: ALLOW
rules:
- from:
- source:
principals: ["cluster.local/

