November 25, 2025
13 min read

Senior AWS Cloud Engineer Interview Questions and Answers

interview
career-advice
job-search
Senior AWS Cloud Engineer Interview Questions and Answers
Milad Bonakdar

Milad Bonakdar

Author

Prepare for senior AWS cloud engineer interviews with practical questions on architecture, networking, Auto Scaling, Lambda, cost optimization, IAM security, RDS, and production troubleshooting.


Introduction

Senior AWS cloud engineer interviews usually test how you make production trade-offs, not whether you can name services. Be ready to explain a design, defend the security model, estimate cost impact, plan failure recovery, and show how you would operate the system after launch.

This guide focuses on senior-level AWS interview questions with practical answers for architecture, networking, compute, cost optimization, IAM security, databases, monitoring, and troubleshooting.


Architecture & Design

1. Design a highly available multi-tier web application on AWS.

Answer: A production-ready multi-tier architecture requires redundancy, scalability, and security:

Loading diagram...

Key Components:

1. DNS & CDN:

# Route 53 for DNS with health checks
aws route53 create-health-check \
  --health-check-config IPAddress=203.0.113.1,Port=443,Type=HTTPS

# CloudFront for global content delivery
aws cloudfront create-distribution \
  --origin-domain-name myapp.example.com

2. Load Balancing & Auto Scaling:

# Create Application Load Balancer
aws elbv2 create-load-balancer \
  --name my-alb \
  --subnets subnet-12345 subnet-67890 \
  --security-groups sg-12345

# Create Auto Scaling Group
aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name my-asg \
  --launch-template LaunchTemplateName=my-template \
  --min-size 2 \
  --max-size 10 \
  --desired-capacity 4 \
  --target-group-arns arn:aws:elasticloadbalancing:...

3. Database & Caching:

  • RDS Multi-AZ for high availability
  • Read replicas for read scaling
  • ElastiCache for session/data caching

Design Principles:

  • Deploy across multiple AZs
  • Use managed services when possible
  • Implement auto scaling
  • Separate tiers with security groups
  • Use S3 for static content

Rarity: Very Common
Difficulty: Hard


2. Explain VPC Peering and when to use it.

Answer: VPC Peering connects two VPCs privately using AWS network.

Characteristics:

  • Private connectivity (no internet)
  • No single point of failure
  • No bandwidth bottleneck
  • Supports cross-region peering
  • Non-transitive (A↔B, B↔C doesn't mean A↔C)

Use Cases:

  • Connect production and management VPCs
  • Share resources across VPCs
  • Multi-account architectures
  • Hybrid cloud connectivity
# Create VPC peering connection
aws ec2 create-vpc-peering-connection \
  --vpc-id vpc-1a2b3c4d \
  --peer-vpc-id vpc-5e6f7g8h \
  --peer-region us-west-2

# Accept peering connection
aws ec2 accept-vpc-peering-connection \
  --vpc-peering-connection-id pcx-1234567890abcdef0

# Update route tables
aws ec2 create-route \
  --route-table-id rtb-12345 \
  --destination-cidr-block 10.1.0.0/16 \
  --vpc-peering-connection-id pcx-1234567890abcdef0

Alternatives:

  • Transit Gateway: Hub-and-spoke, transitive routing
  • PrivateLink: Service-to-service connectivity
  • VPN: Encrypted connectivity

Rarity: Common
Difficulty: Medium


Advanced Compute

3. How does Auto Scaling work and how do you optimize it?

Answer: Auto Scaling automatically adjusts capacity based on demand.

Scaling Policies:

1. Target Tracking:

{
  "TargetValue": 70.0,
  "PredefinedMetricSpecification": {
    "PredefinedMetricType": "ASGAverageCPUUtilization"
  }
}

2. Step Scaling:

{
  "AdjustmentType": "PercentChangeInCapacity",
  "MetricAggregationType": "Average",
  "StepAdjustments": [
    {
      "MetricIntervalLowerBound": 0,
      "MetricIntervalUpperBound": 10,
      "ScalingAdjustment": 10
    },
    {
      "MetricIntervalLowerBound": 10,
      "ScalingAdjustment": 30
    }
  ]
}

3. Scheduled Scaling:

aws autoscaling put-scheduled-update-group-action \
  --auto-scaling-group-name my-asg \
  --scheduled-action-name scale-up-morning \
  --recurrence "0 8 * * *" \
  --desired-capacity 10

Optimization Strategies:

  • Use predictive scaling for known patterns
  • Set appropriate cooldown periods
  • Monitor scaling metrics
  • Use mixed instance types
  • Implement lifecycle hooks for graceful shutdown

Rarity: Very Common
Difficulty: Medium-Hard


Serverless & Advanced Services

4. When would you use Lambda vs EC2?

Answer: Choose based on workload characteristics:

Use Lambda when:

  • Event-driven workloads
  • Short-running tasks (< 15 minutes)
  • Variable/unpredictable traffic
  • Want zero server management
  • Cost optimization for sporadic use

Use EC2 when:

  • Long-running processes
  • Need full OS control
  • Specific software requirements
  • Consistent high load
  • Stateful applications

Lambda Example:

import json
import boto3

def lambda_handler(event, context):
    """
    Process S3 upload event
    """
    s3 = boto3.client('s3')
    
    # Get bucket and key from event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Process file
    response = s3.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read()
    
    # Do something with content
    process_data(content)
    
    return {
        'statusCode': 200,
        'body': json.dumps('Processing complete')
    }

Cost Comparison:

  • Lambda: Pay per request + duration
  • EC2: Pay for uptime (even if idle)

Rarity: Common
Difficulty: Medium


Cost Optimization

5. How do you optimize AWS costs?

Answer: A strong senior answer treats cost optimization as an operating process, not a one-time cleanup:

Strategies:

1. Right-sizing:

# Use AWS Compute Optimizer
aws compute-optimizer get-ec2-instance-recommendations \
  --instance-arns arn:aws:ec2:us-east-1:123456789012:instance/i-1234567890abcdef0

2. Reserved Instances & Savings Plans:

  • 1-year or 3-year commitments
  • Up to 72% savings vs on-demand
  • Use for steady compute after checking Cost Explorer recommendations, existing commitments, and expected roadmap changes

3. Spot Instances:

# Launch spot instances
aws ec2 request-spot-instances \
  --spot-price "0.05" \
  --instance-count 5 \
  --type "one-time" \
  --launch-specification file://specification.json

4. S3 Lifecycle Policies:

{
  "Rules": [
    {
      "Id": "Move to IA after 30 days",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ]
    }
  ]
}

5. Auto Scaling:

  • Scale down during off-hours
  • Use predictive scaling

6. Monitoring:

  • AWS Cost Explorer
  • Budget alerts
  • Tag resources for cost allocation
  • Review Cost Optimization Hub and Compute Optimizer recommendations with business context before acting

Rarity: Very Common
Difficulty: Medium


Security & Compliance

6. How do you implement defense in depth on AWS?

Answer: A senior answer should combine preventive controls, detection, and fast response across every layer:

Layers:

1. Network Security:

# VPC with private subnets
# Security groups (allow only necessary ports)
# NACLs for subnet-level control
# WAF for application protection

# Example: Restrict SSH to bastion host only
aws ec2 authorize-security-group-ingress \
  --group-id sg-app-servers \
  --protocol tcp \
  --port 22 \
  --source-group sg-bastion

2. Identity & Access:

  • Prefer federation and temporary credentials for people and workloads
  • Require MFA where long-lived or root credentials still exist
  • Grant least privilege, then review unused permissions regularly
  • Use IAM Access Analyzer to validate policies and identify public, cross-account, or unused access
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::my-bucket/*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": "203.0.113.0/24"
        }
      }
    }
  ]
}

3. Data Protection:

  • Encryption at rest (KMS)
  • Encryption in transit (TLS)
  • S3 bucket policies
  • RDS encryption

4. Monitoring & Logging:

# Enable CloudTrail
aws cloudtrail create-trail \
  --name my-trail \
  --s3-bucket-name my-bucket

# Enable VPC Flow Logs
aws ec2 create-flow-logs \
  --resource-type VPC \
  --resource-ids vpc-12345 \
  --traffic-type ALL \
  --log-destination-type s3 \
  --log-destination arn:aws:s3:::my-bucket

5. Compliance:

  • AWS Config for compliance monitoring
  • Security Hub for centralized findings
  • GuardDuty for threat detection

Rarity: Very Common
Difficulty: Hard


Database Services

7. Explain RDS Multi-AZ vs Read Replicas and when to use each.

Answer: Both provide redundancy but serve different purposes:

Multi-AZ Deployment:

  • Purpose: High availability and disaster recovery
  • Synchronous replication to standby in different AZ
  • Automatic failover (1-2 minutes)
  • Same endpoint after failover
  • Standard Multi-AZ DB instances do not serve reads from the standby; Multi-AZ DB clusters can provide readable standby instances, so clarify the exact RDS topology
  • Adds cost for standby capacity and storage; estimate it against recovery requirements
# Create Multi-AZ RDS instance
aws rds create-db-instance \
  --db-instance-identifier mydb \
  --db-instance-class db.t3.medium \
  --engine postgres \
  --master-username admin \
  --master-user-password MyPassword123 \
  --allocated-storage 100 \
  --multi-az \
  --backup-retention-period 7

Read Replicas:

  • Purpose: Scale read operations
  • Asynchronous replication
  • Multiple replicas possible (up to 15 for Aurora)
  • Different endpoints for each replica
  • Can be in different regions
  • Can be promoted to standalone DB
# Create read replica
aws rds create-db-instance-read-replica \
  --db-instance-identifier mydb-replica-1 \
  --source-db-instance-identifier mydb \
  --db-instance-class db.t3.medium \
  --availability-zone us-east-1b

# Promote read replica to standalone
aws rds promote-read-replica \
  --db-instance-identifier mydb-replica-1

Comparison Table:

FeatureMulti-AZRead Replica
ReplicationSynchronousAsynchronous
PurposeHA/DRRead scaling
FailoverAutomaticManual promotion
EndpointSameDifferent
RegionsSame region onlyCross-region supported
PerformanceStandard instance Multi-AZ is for availability; Multi-AZ clusters can add readable standbysImproves read throughput for read-heavy workloads
Use CaseProduction databasesAnalytics, reporting

Best Practice: Use both together

  • Multi-AZ for high availability
  • Read replicas for read scaling

Rarity: Very Common
Difficulty: Medium-Hard


8. How do you implement database migration with minimal downtime?

Answer: Database migration strategies for production systems:

Strategy 1: AWS DMS (Database Migration Service)

# Create replication instance
aws dms create-replication-instance \
  --replication-instance-identifier my-replication-instance \
  --replication-instance-class dms.t3.medium \
  --allocated-storage 100

# Create source endpoint
aws dms create-endpoint \
  --endpoint-identifier source-db \
  --endpoint-type source \
  --engine-name postgres \
  --server-name source-db.example.com \
  --port 5432 \
  --username admin \
  --password MyPassword123

# Create target endpoint
aws dms create-endpoint \
  --endpoint-identifier target-db \
  --endpoint-type target \
  --engine-name aurora-postgresql \
  --server-name target-db.cluster-xxx.us-east-1.rds.amazonaws.com \
  --port 5432 \
  --username admin \
  --password MyPassword123

# Create migration task
aws dms create-replication-task \
  --replication-task-identifier migration-task \
  --source-endpoint-arn arn:aws:dms:us-east-1:123456789012:endpoint:source-db \
  --target-endpoint-arn arn:aws:dms:us-east-1:123456789012:endpoint:target-db \
  --replication-instance-arn arn:aws:dms:us-east-1:123456789012:rep:my-replication-instance \
  --migration-type full-load-and-cdc \
  --table-mappings file://table-mappings.json

Migration Phases:

1. Full Load:

  • Copy existing data
  • Can take hours/days
  • Application still uses source

2. CDC (Change Data Capture):

  • Replicate ongoing changes
  • Keeps target in sync
  • Minimal lag (seconds)

3. Cutover:

# Migration cutover script
import boto3
import time

def perform_cutover():
    """
    Cutover to new database with minimal downtime
    """
    # 1. Enable maintenance mode
    enable_maintenance_mode()
    
    # 2. Wait for replication lag to be zero
    wait_for_replication_sync()
    
    # 3. Update application config
    update_database_endpoint(
        old_endpoint='source-db.example.com',
        new_endpoint='target-db.cluster-xxx.us-east-1.rds.amazonaws.com'
    )
    
    # 4. Restart application
    restart_application()
    
    # 5. Verify connectivity
    verify_database_connection()
    
    # 6. Disable maintenance mode
    disable_maintenance_mode()
    
    print("Cutover complete!")

def wait_for_replication_sync(max_lag_seconds=5):
    """Wait for replication lag to be minimal"""
    dms = boto3.client('dms')
    
    while True:
        response = dms.describe_replication_tasks(
            Filters=[{'Name': 'replication-task-id', 'Values': ['migration-task']}]
        )
        
        lag = response['ReplicationTasks'][0]['ReplicationTaskStats']['FullLoadProgressPercent']
        
        if lag < max_lag_seconds:
            print(f"Replication lag: {lag}s - Ready for cutover")
            break
        
        print(f"Replication lag: {lag}s - Waiting...")
        time.sleep(10)

Strategy 2: Blue-Green Deployment

# Create Aurora clone (instant, copy-on-write)
aws rds restore-db-cluster-to-point-in-time \
  --source-db-cluster-identifier production-cluster \
  --db-cluster-identifier staging-cluster \
  --restore-type copy-on-write \
  --use-latest-restorable-time

# Test on staging
# When ready, swap DNS/endpoints

Downtime Comparison:

  • DMS: < 1 minute (just cutover)
  • Blue-Green: < 30 seconds (DNS switch)
  • Traditional dump/restore: Hours to days

Rarity: Common
Difficulty: Hard


Monitoring & Troubleshooting

9. How do you troubleshoot high AWS costs?

Answer: Cost optimization requires systematic analysis:

Investigation Steps:

1. Use Cost Explorer:

# Get cost breakdown by service
aws ce get-cost-and-usage \
  --time-period Start=2026-04-01,End=2026-04-30 \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

# Get cost by resource tags
aws ce get-cost-and-usage \
  --time-period Start=2026-04-01,End=2026-04-30 \
  --granularity DAILY \
  --metrics BlendedCost \
  --group-by Type=TAG,Key=Environment

2. Identify Cost Anomalies:

import boto3
from datetime import datetime, timedelta

def analyze_cost_anomalies():
    """
    Identify unusual cost spikes
    """
    ce = boto3.client('ce')
    
    # Get last 30 days of costs
    end_date = datetime.now()
    start_date = end_date - timedelta(days=30)
    
    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': start_date.strftime('%Y-%m-%d'),
            'End': end_date.strftime('%Y-%m-%d')
        },
        Granularity='DAILY',
        Metrics=['BlendedCost'],
        GroupBy=[{'Type': 'SERVICE', 'Key': 'SERVICE'}]
    )
    
    # Analyze each service
    for result in response['ResultsByTime']:
        date = result['TimePeriod']['Start']
        for group in result['Groups']:
            service = group['Keys'][0]
            cost = float(group['Metrics']['BlendedCost']['Amount'])
            
            # Flag costs > $100/day
            if cost > 100:
                print(f"⚠️  {date}: {service} = ${cost:.2f}")
    
    return response

# Common cost culprits
cost_culprits = {
    'EC2': [
        'Oversized instances',
        'Idle instances',
        'Unattached EBS volumes',
        'Old snapshots'
    ],
    'RDS': [
        'Multi-AZ when not needed',
        'Oversized instances',
        'Excessive backup retention'
    ],
    'S3': [
        'Wrong storage class',
        'No lifecycle policies',
        'Excessive requests'
    ],
    'Data Transfer': [
        'Cross-region traffic',
        'NAT Gateway usage',
        'CloudFront not used'
    ]
}

3. Resource Cleanup Script:

#!/bin/bash
# Find and report unused resources

echo "=== Unattached EBS Volumes ==="
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].[VolumeId,Size,CreateTime]' \
  --output table

echo "=== Idle EC2 Instances (< 5% CPU for 7 days) ==="
# Use CloudWatch to identify
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Average

echo "=== Elastic IPs not attached ==="
aws ec2 describe-addresses \
  --filters "Name=domain,Values=vpc" \
  --query 'Addresses[?AssociationId==null].[PublicIp,AllocationId]' \
  --output table

echo "=== Old Snapshots (> 90 days) ==="
aws ec2 describe-snapshots \
  --owner-ids self \
  --query 'Snapshots[?StartTime<=`'$(date -u -d '90 days ago' +%Y-%m-%d)'`].[SnapshotId,StartTime,VolumeSize]' \
  --output table

4. Set Up Cost Alerts:

# Create budget alert
aws budgets create-budget \
  --account-id 123456789012 \
  --budget file://budget.json \
  --notifications-with-subscribers file://notifications.json

# budget.json
{
  "BudgetName": "Monthly-Budget",
  "BudgetLimit": {
    "Amount": "1000",
    "Unit": "USD"
  },
  "TimeUnit": "MONTHLY",
  "BudgetType": "COST"
}

Quick Wins:

  • Delete unattached EBS volumes
  • Stop/terminate idle EC2 instances
  • Use S3 Intelligent-Tiering
  • Enable S3 lifecycle policies
  • Use Spot instances for non-critical workloads
  • Right-size over-provisioned instances

Rarity: Very Common
Difficulty: Medium


Advanced Networking

10. Explain AWS Transit Gateway and its use cases.

Answer: Transit Gateway is a hub-and-spoke network topology service that simplifies network architecture.

Without Transit Gateway:

Loading diagram...

Problem: N² connections (mesh topology)

With Transit Gateway:

Loading diagram...

Solution: Hub-and-spoke (N connections)

Key Features:

  • Transitive routing: A→TGW→B→TGW→C works
  • Centralized management
  • Supports up to 5,000 VPCs
  • Cross-region peering
  • Route tables for traffic control

Setup:

# Create Transit Gateway
aws ec2 create-transit-gateway \
  --description "Main Transit Gateway" \
  --options AmazonSideAsn=64512,AutoAcceptSharedAttachments=enable

# Attach VPC
aws ec2 create-transit-gateway-vpc-attachment \
  --transit-gateway-id tgw-1234567890abcdef0 \
  --vpc-id vpc-1234567890abcdef0 \
  --subnet-ids subnet-1234567890abcdef0 subnet-0987654321fedcba0

# Create route in VPC route table
aws ec2 create-route \
  --route-table-id rtb-1234567890abcdef0 \
  --destination-cidr-block 10.0.0.0/8 \
  --transit-gateway-id tgw-1234567890abcdef0

# Create Transit Gateway route table
aws ec2 create-transit-gateway-route-table \
  --transit-gateway-id tgw-1234567890abcdef0

# Add route
aws ec2 create-transit-gateway-route \
  --destination-cidr-block 10.1.0.0/16 \
  --transit-gateway-route-table-id tgw-rtb-1234567890abcdef0 \
  --transit-gateway-attachment-id tgw-attach-1234567890abcdef0

Use Cases:

1. Multi-VPC Architecture:

# Example: Centralized egress
vpc_architecture = {
    'production_vpcs': ['vpc-prod-1', 'vpc-prod-2', 'vpc-prod-3'],
    'shared_services': 'vpc-shared',  # NAT, proxies, etc.
    'on_premises': 'vpn-connection'
}

# All production VPCs route internet traffic through shared services VPC
# Centralized security controls, logging, NAT

2. Network Segmentation:

# Separate route tables for different environments
# Production can't reach development
# Development can reach shared services

3. Multi-Region Connectivity:

# Create Transit Gateway in us-east-1
aws ec2 create-transit-gateway --region us-east-1

# Create Transit Gateway in eu-west-1
aws ec2 create-transit-gateway --region eu-west-1

# Peer them
aws ec2 create-transit-gateway-peering-attachment \
  --transit-gateway-id tgw-us-east-1 \
  --peer-transit-gateway-id tgw-eu-west-1 \
  --peer-region eu-west-1

Cost Considerations:

  • Attachments and data processing are billable, so estimate traffic before centralizing everything
  • Centralized inspection, NAT, and cross-region routing can change the bill quickly
  • Check current regional pricing before choosing Transit Gateway over peering or PrivateLink

Alternatives:

  • VPC Peering: Simpler, cheaper for few VPCs
  • PrivateLink: Service-to-service connectivity
  • VPN: Direct connections

Rarity: Common
Difficulty: Hard


Conclusion

Senior AWS cloud engineer interviews require deep technical knowledge and practical experience. Focus on:

  1. Architecture: Multi-tier designs, high availability, disaster recovery
  2. Advanced Networking: VPC peering, Transit Gateway, PrivateLink
  3. Compute: Auto Scaling optimization, Lambda vs EC2 decisions
  4. Cost Optimization: Right-sizing, reserved instances, lifecycle policies
  5. Security: Defense in depth, IAM best practices, encryption
  6. Operational Excellence: Monitoring, logging, automation

Back each answer with a production example: the trade-off you chose, the failure mode you planned for, the metric you monitored, and what you would improve next.

Newsletter subscription

Weekly career tips that actually work

Get the latest insights delivered straight to your inbox

Stand Out to Recruiters & Land Your Dream Job

Join thousands who transformed their careers with AI-powered resumes that pass ATS and impress hiring managers.

Start building now

Share this post

Beat the 75% ATS Rejection Rate

3 out of 4 resumes never reach a human eye. Our keyword optimization increases your pass rate by up to 80%, ensuring recruiters actually see your potential.