November 25, 2025
13 min read

Senior Cloud Engineer AWS Interview Questions: Complete Guide

interview
career-advice
job-search
Senior Cloud Engineer AWS Interview Questions: Complete Guide
MB

Milad Bonakdar

Author

Master advanced AWS concepts with comprehensive interview questions covering architecture design, auto scaling, advanced networking, cost optimization, and security for senior cloud engineer roles.


Introduction

Senior AWS cloud engineers are expected to design scalable architectures, optimize costs, implement advanced security, and solve complex cloud challenges. This role requires deep expertise in AWS services, architectural best practices, and hands-on experience with production systems.

This guide covers essential interview questions for senior AWS cloud engineers, focusing on architecture, advanced services, and strategic cloud solutions.


Architecture & Design

1. Design a highly available multi-tier web application on AWS.

Answer: A production-ready multi-tier architecture requires redundancy, scalability, and security:

Loading diagram...

Key Components:

1. DNS & CDN:

# Route 53 for DNS with health checks
aws route53 create-health-check \
  --health-check-config IPAddress=203.0.113.1,Port=443,Type=HTTPS

# CloudFront for global content delivery
aws cloudfront create-distribution \
  --origin-domain-name myapp.example.com

2. Load Balancing & Auto Scaling:

# Create Application Load Balancer
aws elbv2 create-load-balancer \
  --name my-alb \
  --subnets subnet-12345 subnet-67890 \
  --security-groups sg-12345

# Create Auto Scaling Group
aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name my-asg \
  --launch-template LaunchTemplateName=my-template \
  --min-size 2 \
  --max-size 10 \
  --desired-capacity 4 \
  --target-group-arns arn:aws:elasticloadbalancing:...

3. Database & Caching:

  • RDS Multi-AZ for high availability
  • Read replicas for read scaling
  • ElastiCache for session/data caching

Design Principles:

  • Deploy across multiple AZs
  • Use managed services when possible
  • Implement auto scaling
  • Separate tiers with security groups
  • Use S3 for static content

Rarity: Very Common
Difficulty: Hard


2. Explain VPC Peering and when to use it.

Answer: VPC Peering connects two VPCs privately using AWS network.

Characteristics:

  • Private connectivity (no internet)
  • No single point of failure
  • No bandwidth bottleneck
  • Supports cross-region peering
  • Non-transitive (A↔B, B↔C doesn't mean A↔C)

Use Cases:

  • Connect production and management VPCs
  • Share resources across VPCs
  • Multi-account architectures
  • Hybrid cloud connectivity
# Create VPC peering connection
aws ec2 create-vpc-peering-connection \
  --vpc-id vpc-1a2b3c4d \
  --peer-vpc-id vpc-5e6f7g8h \
  --peer-region us-west-2

# Accept peering connection
aws ec2 accept-vpc-peering-connection \
  --vpc-peering-connection-id pcx-1234567890abcdef0

# Update route tables
aws ec2 create-route \
  --route-table-id rtb-12345 \
  --destination-cidr-block 10.1.0.0/16 \
  --vpc-peering-connection-id pcx-1234567890abcdef0

Alternatives:

  • Transit Gateway: Hub-and-spoke, transitive routing
  • PrivateLink: Service-to-service connectivity
  • VPN: Encrypted connectivity

Rarity: Common
Difficulty: Medium


Advanced Compute

3. How does Auto Scaling work and how do you optimize it?

Answer: Auto Scaling automatically adjusts capacity based on demand.

Scaling Policies:

1. Target Tracking:

{
  "TargetValue": 70.0,
  "PredefinedMetricSpecification": {
    "PredefinedMetricType": "ASGAverageCPUUtilization"
  }
}

2. Step Scaling:

{
  "AdjustmentType": "PercentChangeInCapacity",
  "MetricAggregationType": "Average",
  "StepAdjustments": [
    {
      "MetricIntervalLowerBound": 0,
      "MetricIntervalUpperBound": 10,
      "ScalingAdjustment": 10
    },
    {
      "MetricIntervalLowerBound": 10,
      "ScalingAdjustment": 30
    }
  ]
}

3. Scheduled Scaling:

aws autoscaling put-scheduled-update-group-action \
  --auto-scaling-group-name my-asg \
  --scheduled-action-name scale-up-morning \
  --recurrence "0 8 * * *" \
  --desired-capacity 10

Optimization Strategies:

  • Use predictive scaling for known patterns
  • Set appropriate cooldown periods
  • Monitor scaling metrics
  • Use mixed instance types
  • Implement lifecycle hooks for graceful shutdown

Rarity: Very Common
Difficulty: Medium-Hard


Serverless & Advanced Services

4. When would you use Lambda vs EC2?

Answer: Choose based on workload characteristics:

Use Lambda when:

  • Event-driven workloads
  • Short-running tasks (< 15 minutes)
  • Variable/unpredictable traffic
  • Want zero server management
  • Cost optimization for sporadic use

Use EC2 when:

  • Long-running processes
  • Need full OS control
  • Specific software requirements
  • Consistent high load
  • Stateful applications

Lambda Example:

import json
import boto3

def lambda_handler(event, context):
    """
    Process S3 upload event
    """
    s3 = boto3.client('s3')
    
    # Get bucket and key from event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Process file
    response = s3.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read()
    
    # Do something with content
    process_data(content)
    
    return {
        'statusCode': 200,
        'body': json.dumps('Processing complete')
    }

Cost Comparison:

  • Lambda: Pay per request + duration
  • EC2: Pay for uptime (even if idle)

Rarity: Common
Difficulty: Medium


Cost Optimization

5. How do you optimize AWS costs?

Answer: Cost optimization requires continuous monitoring and adjustment:

Strategies:

1. Right-sizing:

# Use AWS Compute Optimizer
aws compute-optimizer get-ec2-instance-recommendations \
  --instance-arns arn:aws:ec2:us-east-1:123456789012:instance/i-1234567890abcdef0

2. Reserved Instances & Savings Plans:

  • 1-year or 3-year commitments
  • Up to 72% savings vs on-demand
  • Use for predictable workloads

3. Spot Instances:

# Launch spot instances
aws ec2 request-spot-instances \
  --spot-price "0.05" \
  --instance-count 5 \
  --type "one-time" \
  --launch-specification file://specification.json

4. S3 Lifecycle Policies:

{
  "Rules": [
    {
      "Id": "Move to IA after 30 days",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ]
    }
  ]
}

5. Auto Scaling:

  • Scale down during off-hours
  • Use predictive scaling

6. Monitoring:

  • AWS Cost Explorer
  • Budget alerts
  • Tag resources for cost allocation

Rarity: Very Common
Difficulty: Medium


Security & Compliance

6. How do you implement defense in depth on AWS?

Answer: Multi-layered security approach:

Layers:

1. Network Security:

# VPC with private subnets
# Security groups (allow only necessary ports)
# NACLs for subnet-level control
# WAF for application protection

# Example: Restrict SSH to bastion host only
aws ec2 authorize-security-group-ingress \
  --group-id sg-app-servers \
  --protocol tcp \
  --port 22 \
  --source-group sg-bastion

2. Identity & Access:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::my-bucket/*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": "203.0.113.0/24"
        }
      }
    }
  ]
}

3. Data Protection:

  • Encryption at rest (KMS)
  • Encryption in transit (TLS)
  • S3 bucket policies
  • RDS encryption

4. Monitoring & Logging:

# Enable CloudTrail
aws cloudtrail create-trail \
  --name my-trail \
  --s3-bucket-name my-bucket

# Enable VPC Flow Logs
aws ec2 create-flow-logs \
  --resource-type VPC \
  --resource-ids vpc-12345 \
  --traffic-type ALL \
  --log-destination-type s3 \
  --log-destination arn:aws:s3:::my-bucket

5. Compliance:

  • AWS Config for compliance monitoring
  • Security Hub for centralized findings
  • GuardDuty for threat detection

Rarity: Very Common
Difficulty: Hard


Database Services

7. Explain RDS Multi-AZ vs Read Replicas and when to use each.

Answer: Both provide redundancy but serve different purposes:

Multi-AZ Deployment:

  • Purpose: High availability and disaster recovery
  • Synchronous replication to standby in different AZ
  • Automatic failover (1-2 minutes)
  • Same endpoint after failover
  • No performance benefit for reads
  • Doubles cost (standby instance)
# Create Multi-AZ RDS instance
aws rds create-db-instance \
  --db-instance-identifier mydb \
  --db-instance-class db.t3.medium \
  --engine postgres \
  --master-username admin \
  --master-user-password MyPassword123 \
  --allocated-storage 100 \
  --multi-az \
  --backup-retention-period 7

Read Replicas:

  • Purpose: Scale read operations
  • Asynchronous replication
  • Multiple replicas possible (up to 15 for Aurora)
  • Different endpoints for each replica
  • Can be in different regions
  • Can be promoted to standalone DB
# Create read replica
aws rds create-db-instance-read-replica \
  --db-instance-identifier mydb-replica-1 \
  --source-db-instance-identifier mydb \
  --db-instance-class db.t3.medium \
  --availability-zone us-east-1b

# Promote read replica to standalone
aws rds promote-read-replica \
  --db-instance-identifier mydb-replica-1

Comparison Table:

FeatureMulti-AZRead Replica
ReplicationSynchronousAsynchronous
PurposeHA/DRRead scaling
FailoverAutomaticManual promotion
EndpointSameDifferent
RegionsSame region onlyCross-region supported
PerformanceNo read benefitImproves read performance
Use CaseProduction databasesAnalytics, reporting

Best Practice: Use both together

  • Multi-AZ for high availability
  • Read replicas for read scaling

Rarity: Very Common
Difficulty: Medium-Hard


8. How do you implement database migration with minimal downtime?

Answer: Database migration strategies for production systems:

Strategy 1: AWS DMS (Database Migration Service)

# Create replication instance
aws dms create-replication-instance \
  --replication-instance-identifier my-replication-instance \
  --replication-instance-class dms.t3.medium \
  --allocated-storage 100

# Create source endpoint
aws dms create-endpoint \
  --endpoint-identifier source-db \
  --endpoint-type source \
  --engine-name postgres \
  --server-name source-db.example.com \
  --port 5432 \
  --username admin \
  --password MyPassword123

# Create target endpoint
aws dms create-endpoint \
  --endpoint-identifier target-db \
  --endpoint-type target \
  --engine-name aurora-postgresql \
  --server-name target-db.cluster-xxx.us-east-1.rds.amazonaws.com \
  --port 5432 \
  --username admin \
  --password MyPassword123

# Create migration task
aws dms create-replication-task \
  --replication-task-identifier migration-task \
  --source-endpoint-arn arn:aws:dms:us-east-1:123456789012:endpoint:source-db \
  --target-endpoint-arn arn:aws:dms:us-east-1:123456789012:endpoint:target-db \
  --replication-instance-arn arn:aws:dms:us-east-1:123456789012:rep:my-replication-instance \
  --migration-type full-load-and-cdc \
  --table-mappings file://table-mappings.json

Migration Phases:

1. Full Load:

  • Copy existing data
  • Can take hours/days
  • Application still uses source

2. CDC (Change Data Capture):

  • Replicate ongoing changes
  • Keeps target in sync
  • Minimal lag (seconds)

3. Cutover:

# Migration cutover script
import boto3
import time

def perform_cutover():
    """
    Cutover to new database with minimal downtime
    """
    # 1. Enable maintenance mode
    enable_maintenance_mode()
    
    # 2. Wait for replication lag to be zero
    wait_for_replication_sync()
    
    # 3. Update application config
    update_database_endpoint(
        old_endpoint='source-db.example.com',
        new_endpoint='target-db.cluster-xxx.us-east-1.rds.amazonaws.com'
    )
    
    # 4. Restart application
    restart_application()
    
    # 5. Verify connectivity
    verify_database_connection()
    
    # 6. Disable maintenance mode
    disable_maintenance_mode()
    
    print("Cutover complete!")

def wait_for_replication_sync(max_lag_seconds=5):
    """Wait for replication lag to be minimal"""
    dms = boto3.client('dms')
    
    while True:
        response = dms.describe_replication_tasks(
            Filters=[{'Name': 'replication-task-id', 'Values': ['migration-task']}]
        )
        
        lag = response['ReplicationTasks'][0]['ReplicationTaskStats']['FullLoadProgressPercent']
        
        if lag < max_lag_seconds:
            print(f"Replication lag: {lag}s - Ready for cutover")
            break
        
        print(f"Replication lag: {lag}s - Waiting...")
        time.sleep(10)

Strategy 2: Blue-Green Deployment

# Create Aurora clone (instant, copy-on-write)
aws rds restore-db-cluster-to-point-in-time \
  --source-db-cluster-identifier production-cluster \
  --db-cluster-identifier staging-cluster \
  --restore-type copy-on-write \
  --use-latest-restorable-time

# Test on staging
# When ready, swap DNS/endpoints

Downtime Comparison:

  • DMS: < 1 minute (just cutover)
  • Blue-Green: < 30 seconds (DNS switch)
  • Traditional dump/restore: Hours to days

Rarity: Common
Difficulty: Hard


Monitoring & Troubleshooting

9. How do you troubleshoot high AWS costs?

Answer: Cost optimization requires systematic analysis:

Investigation Steps:

1. Use Cost Explorer:

# Get cost breakdown by service
aws ce get-cost-and-usage \
  --time-period Start=2024-11-01,End=2024-11-30 \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

# Get cost by resource tags
aws ce get-cost-and-usage \
  --time-period Start=2024-11-01,End=2024-11-30 \
  --granularity DAILY \
  --metrics BlendedCost \
  --group-by Type=TAG,Key=Environment

2. Identify Cost Anomalies:

import boto3
from datetime import datetime, timedelta

def analyze_cost_anomalies():
    """
    Identify unusual cost spikes
    """
    ce = boto3.client('ce')
    
    # Get last 30 days of costs
    end_date = datetime.now()
    start_date = end_date - timedelta(days=30)
    
    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': start_date.strftime('%Y-%m-%d'),
            'End': end_date.strftime('%Y-%m-%d')
        },
        Granularity='DAILY',
        Metrics=['BlendedCost'],
        GroupBy=[{'Type': 'SERVICE', 'Key': 'SERVICE'}]
    )
    
    # Analyze each service
    for result in response['ResultsByTime']:
        date = result['TimePeriod']['Start']
        for group in result['Groups']:
            service = group['Keys'][0]
            cost = float(group['Metrics']['BlendedCost']['Amount'])
            
            # Flag costs > $100/day
            if cost > 100:
                print(f"⚠️  {date}: {service} = ${cost:.2f}")
    
    return response

# Common cost culprits
cost_culprits = {
    'EC2': [
        'Oversized instances',
        'Idle instances',
        'Unattached EBS volumes',
        'Old snapshots'
    ],
    'RDS': [
        'Multi-AZ when not needed',
        'Oversized instances',
        'Excessive backup retention'
    ],
    'S3': [
        'Wrong storage class',
        'No lifecycle policies',
        'Excessive requests'
    ],
    'Data Transfer': [
        'Cross-region traffic',
        'NAT Gateway usage',
        'CloudFront not used'
    ]
}

3. Resource Cleanup Script:

#!/bin/bash
# Find and report unused resources

echo "=== Unattached EBS Volumes ==="
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].[VolumeId,Size,CreateTime]' \
  --output table

echo "=== Idle EC2 Instances (< 5% CPU for 7 days) ==="
# Use CloudWatch to identify
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Average

echo "=== Elastic IPs not attached ==="
aws ec2 describe-addresses \
  --filters "Name=domain,Values=vpc" \
  --query 'Addresses[?AssociationId==null].[PublicIp,AllocationId]' \
  --output table

echo "=== Old Snapshots (> 90 days) ==="
aws ec2 describe-snapshots \
  --owner-ids self \
  --query 'Snapshots[?StartTime<=`'$(date -u -d '90 days ago' +%Y-%m-%d)'`].[SnapshotId,StartTime,VolumeSize]' \
  --output table

4. Set Up Cost Alerts:

# Create budget alert
aws budgets create-budget \
  --account-id 123456789012 \
  --budget file://budget.json \
  --notifications-with-subscribers file://notifications.json

# budget.json
{
  "BudgetName": "Monthly-Budget",
  "BudgetLimit": {
    "Amount": "1000",
    "Unit": "USD"
  },
  "TimeUnit": "MONTHLY",
  "BudgetType": "COST"
}

Quick Wins:

  • Delete unattached EBS volumes
  • Stop/terminate idle EC2 instances
  • Use S3 Intelligent-Tiering
  • Enable S3 lifecycle policies
  • Use Spot instances for non-critical workloads
  • Right-size over-provisioned instances

Rarity: Very Common
Difficulty: Medium


Advanced Networking

10. Explain AWS Transit Gateway and its use cases.

Answer: Transit Gateway is a hub-and-spoke network topology service that simplifies network architecture.

Without Transit Gateway:

Loading diagram...

Problem: N² connections (mesh topology)

With Transit Gateway:

Loading diagram...

Solution: Hub-and-spoke (N connections)

Key Features:

  • Transitive routing: A→TGW→B→TGW→C works
  • Centralized management
  • Supports up to 5,000 VPCs
  • Cross-region peering
  • Route tables for traffic control

Setup:

# Create Transit Gateway
aws ec2 create-transit-gateway \
  --description "Main Transit Gateway" \
  --options AmazonSideAsn=64512,AutoAcceptSharedAttachments=enable

# Attach VPC
aws ec2 create-transit-gateway-vpc-attachment \
  --transit-gateway-id tgw-1234567890abcdef0 \
  --vpc-id vpc-1234567890abcdef0 \
  --subnet-ids subnet-1234567890abcdef0 subnet-0987654321fedcba0

# Create route in VPC route table
aws ec2 create-route \
  --route-table-id rtb-1234567890abcdef0 \
  --destination-cidr-block 10.0.0.0/8 \
  --transit-gateway-id tgw-1234567890abcdef0

# Create Transit Gateway route table
aws ec2 create-transit-gateway-route-table \
  --transit-gateway-id tgw-1234567890abcdef0

# Add route
aws ec2 create-transit-gateway-route \
  --destination-cidr-block 10.1.0.0/16 \
  --transit-gateway-route-table-id tgw-rtb-1234567890abcdef0 \
  --transit-gateway-attachment-id tgw-attach-1234567890abcdef0

Use Cases:

1. Multi-VPC Architecture:

# Example: Centralized egress
vpc_architecture = {
    'production_vpcs': ['vpc-prod-1', 'vpc-prod-2', 'vpc-prod-3'],
    'shared_services': 'vpc-shared',  # NAT, proxies, etc.
    'on_premises': 'vpn-connection'
}

# All production VPCs route internet traffic through shared services VPC
# Centralized security controls, logging, NAT

2. Network Segmentation:

# Separate route tables for different environments
# Production can't reach development
# Development can reach shared services

3. Multi-Region Connectivity:

# Create Transit Gateway in us-east-1
aws ec2 create-transit-gateway --region us-east-1

# Create Transit Gateway in eu-west-1
aws ec2 create-transit-gateway --region eu-west-1

# Peer them
aws ec2 create-transit-gateway-peering-attachment \
  --transit-gateway-id tgw-us-east-1 \
  --peer-transit-gateway-id tgw-eu-west-1 \
  --peer-region eu-west-1

Cost Considerations:

  • $0.05/hour per attachment
  • $0.02/GB data processed
  • Can be expensive at scale

Alternatives:

  • VPC Peering: Simpler, cheaper for few VPCs
  • PrivateLink: Service-to-service connectivity
  • VPN: Direct connections

Rarity: Common
Difficulty: Hard


Conclusion

Senior AWS cloud engineer interviews require deep technical knowledge and practical experience. Focus on:

  1. Architecture: Multi-tier designs, high availability, disaster recovery
  2. Advanced Networking: VPC peering, Transit Gateway, PrivateLink
  3. Compute: Auto Scaling optimization, Lambda vs EC2 decisions
  4. Cost Optimization: Right-sizing, reserved instances, lifecycle policies
  5. Security: Defense in depth, IAM best practices, encryption
  6. Operational Excellence: Monitoring, logging, automation

Demonstrate real-world experience with production systems, cost optimization initiatives, and security implementations. Good luck!

Related Posts

Recent Posts

Weekly career tips that actually work

Get the latest insights delivered straight to your inbox