Interview - Blog | Minova - ATS Resume Builder

Introduction

Senior System Administrators design, implement, and manage complex IT infrastructure, lead teams, and ensure enterprise-level reliability and security. This role requires deep technical expertise, automation skills, and strategic thinking.

This guide covers essential interview questions for senior system administrators, focusing on advanced concepts and enterprise solutions.

Virtualization & Cloud

1. Explain the difference between Type 1 and Type 2 hypervisors.

Answer:

Type 1 (Bare Metal):

Runs directly on hardware
Better performance
Examples: VMware ESXi, Hyper-V, KVM

Type 2 (Hosted):

Runs on host OS
Easier to set up
Examples: VMware Workstation, VirtualBox

Loading diagram...

KVM Management:

# List VMs
virsh list --all

# Start VM
virsh start vm-name

# Create VM from XML
virsh define vm-config.xml

# Clone VM
virt-clone --original vm1 --name vm2 --auto-clone

# VM resource allocation
virsh setmem vm-name 4G
virsh setvcpus vm-name 4

Rarity: Common
Difficulty: Medium

2. How do you design high availability clusters?

Answer: High Availability (HA) ensures services remain accessible despite failures.

Cluster Types:

Loading diagram...

Active-Passive Cluster:

One node active, others standby
Automatic failover on failure
Lower resource utilization

Active-Active Cluster:

All nodes serve traffic
Better resource utilization
More complex configuration

Pacemaker + Corosync Setup:

# Install cluster software
sudo apt install pacemaker corosync pcs

# Configure cluster authentication
sudo passwd hacluster
sudo pcs cluster auth node1 node2 -u hacluster

# Create cluster
sudo pcs cluster setup --name mycluster node1 node2

# Start cluster
sudo pcs cluster start --all
sudo pcs cluster enable --all

# Disable STONITH for testing (enable in production)
sudo pcs property set stonith-enabled=false

# Create virtual IP resource
sudo pcs resource create virtual_ip ocf:heartbeat:IPaddr2 \
    ip=192.168.1.100 cidr_netmask=24 \
    op monitor interval=30s

# Create web service resource
sudo pcs resource create webserver ocf:heartbeat:apache \
    configfile=/etc/apache2/apache2.conf \
    statusurl="http://localhost/server-status" \
    op monitor interval=1min

# Group resources together
sudo pcs resource group add webgroup virtual_ip webserver

# Set resource constraints
sudo pcs constraint colocation add webserver with virtual_ip INFINITY
sudo pcs constraint order virtual_ip then webserver

# Check cluster status
sudo pcs status
sudo crm_mon -1

Keepalived (Simple HA):

# Install keepalived
sudo apt install keepalived

# Configure on Master
sudo vi /etc/keepalived/keepalived.conf

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    
    authentication {
        auth_type PASS
        auth_pass secret123
    }
    
    virtual_ipaddress {
        192.168.1.100/24
    }
    
    track_script {
        chk_nginx
    }
}

vrrp_script chk_nginx {
    script "/usr/bin/killall -0 nginx"
    interval 2
    weight 2
}

Database Replication (MySQL):

# Master configuration
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = production

# Create replication user
CREATE USER 'repl'@'%' IDENTIFIED BY 'password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
FLUSH PRIVILEGES;

# Get master status
SHOW MASTER STATUS;

# Slave configuration
[mysqld]
server-id = 2
relay-log = /var/log/mysql/mysql-relay-bin
log_bin = /var/log/mysql/mysql-bin.log
read_only = 1

# Configure slave
CHANGE MASTER TO
    MASTER_HOST='master-ip',
    MASTER_USER='repl',
    MASTER_PASSWORD='password',
    MASTER_LOG_FILE='mysql-bin.000001',
    MASTER_LOG_POS=107;

START SLAVE;
SHOW SLAVE STATUS\G

Health Checks:

#!/bin/bash
# Service health check script

check_service() {
    if systemctl is-active --quiet $1; then
        return 0
    else
        return 1
    fi
}

if ! check_service nginx; then
    echo "Nginx down, attempting restart"
    systemctl restart nginx
    sleep 5
    if ! check_service nginx; then
        echo "Nginx failed to restart, triggering failover"
        # Trigger failover
        pcs resource move webgroup node2
    fi
fi

Testing Failover:

# Simulate node failure
sudo pcs cluster stop node1

# Verify failover
sudo pcs status
ping 192.168.1.100

# Restore node
sudo pcs cluster start node1

Rarity: Common
Difficulty: Hard

Automation & Scripting

3. How do you automate system administration tasks?

Answer: Automation reduces toil and improves consistency:

Bash Scripting:

#!/bin/bash
# Automated server health check

HOSTNAME=$(hostname)
DATE=$(date '+%Y-%m-%d %H:%M:%S')
REPORT="/var/log/health-check.log"

echo "=== Health Check: $DATE ===" >> $REPORT

# CPU Load
LOAD=$(uptime | awk -F'load average:' '{print $2}')
echo "Load Average: $LOAD" >> $REPORT

# Memory Usage
MEM=$(free -h | grep Mem | awk '{print "Used: "$3" / "$2}')
echo "Memory: $MEM" >> $REPORT

# Disk Usage
echo "Disk Usage:" >> $REPORT
df -h | grep -vE '^Filesystem|tmpfs|cdrom' >> $REPORT

# Failed Services
FAILED=$(systemctl --failed --no-pager)
if [ -n "$FAILED" ]; then
    echo "Failed Services:" >> $REPORT
    echo "$FAILED" >> $REPORT
fi

# Send alert if critical
DISK_USAGE=$(df -h / | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 90 ]; then
    echo "CRITICAL: Disk usage above 90%" | mail -s "Alert: $HOSTNAME" [email protected]
fi

Ansible Playbook:

---
- name: Configure web servers
  hosts: webservers
  become: yes
  tasks:
    - name: Install packages
      apt:
        name:
          - nginx
          - python3
          - git
        state: present
        update_cache: yes
    
    - name: Copy nginx config
      template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      notify: restart nginx
    
    - name: Ensure nginx is running
      service:
        name: nginx
        state: started
        enabled: yes
  
  handlers:
    - name: restart nginx
      service:
        name: nginx
        state: restarted

Rarity: Very Common
Difficulty: Medium-Hard

4. How do you manage configuration across hundreds of servers?

Answer: Configuration management at scale requires automation and consistency.

Tool Comparison:

Tool	Type	Language	Agent	Complexity
Ansible	Push	YAML	Agentless	Low
Puppet	Pull	Ruby DSL	Agent	High
Chef	Pull	Ruby	Agent	High
SaltStack	Push/Pull	YAML	Agent/Agentless	Medium

Ansible at Scale:

# inventory/production
[webservers]
web[01:20].company.com

[databases]
db[01:05].company.com

[loadbalancers]
lb[01:02].company.com

[webservers: vars]
ansible_user=deploy
ansible_become=yes

# playbooks/site.yml
---
- name: Configure all servers
  hosts: all
  roles:
    - common
    - security
    - monitoring

- name: Configure web servers
  hosts: webservers
  roles:
    - nginx
    - php
    - application
  
- name: Configure databases
  hosts: databases
  roles:
    - mysql
    - backup

# roles/common/tasks/main.yml
---
- name: Update all packages
  apt:
    upgrade: dist
    update_cache: yes
    cache_valid_time: 3600

- name: Install common packages
  apt:
    name:
      - vim
      - htop
      - curl
      - git
    state: present

- name: Configure NTP
  template:
    src: ntp.conf.j2
    dest: /etc/ntp.conf
  notify: restart ntp

- name: Ensure services are running
  service:
    name: "{{ item }}"
    state: started
    enabled: yes
  loop:
    - ntp
    - rsyslog

Dynamic Inventory:

#!/usr/bin/env python3
# dynamic_inventory.py - AWS EC2 dynamic inventory

import json
import boto3

def get_inventory():
    ec2 = boto3.client('ec2')
    response = ec2.describe_instances()
    
    inventory = {
        '_meta': {'hostvars': {}},
        'all': {'hosts': []}
    }
    
    for reservation in response['Reservations']:
        for instance in reservation['Instances']:
            if instance['State']['Name'] != 'running':
                continue
            
            hostname = instance['PrivateIpAddress']
            inventory['all']['hosts'].append(hostname)
            
            # Group by tags
            for tag in instance.get('Tags', []):
                if tag['Key'] == 'Role':
                    role = tag['Value']
                    if role not in inventory:
                        inventory[role] = {'hosts': []}
                    inventory[role]['hosts'].append(hostname)
    
    return inventory

if __name__ == '__main__':
    print(json.dumps(get_inventory(), indent=2))

Infrastructure as Code Best Practices:

1. Version Control:

# Git workflow
git checkout -b feature/update-nginx-config
# Make changes
git add .
git commit -m "Update nginx SSL configuration"
git push origin feature/update-nginx-config
# Create pull request for review

2. Testing:

# Test playbook syntax
ansible-playbook --syntax-check site.yml

# Dry run
ansible-playbook site.yml --check

# Run on staging first
ansible-playbook -i inventory/staging site.yml

# Deploy to production
ansible-playbook -i inventory/production site.yml

3. Secrets Management:

# Ansible Vault
ansible-vault create secrets.yml
ansible-vault encrypt vars/passwords.yml
ansible-playbook site.yml --ask-vault-pass

# Or use password file
ansible-playbook site.yml --vault-password-file ~/.vault_pass

4. Idempotency:

# Bad - not idempotent
- name: Add line to file
  shell: echo "config=value" >> /etc/app.conf

# Good - idempotent
- name: Ensure config line exists
  lineinfile:
    path: /etc/app.conf
    line: "config=value"
    state: present

Parallel Execution:

# Run on 10 hosts at a time
ansible-playbook -i inventory site.yml --forks 10

# Limit to specific hosts
ansible-playbook site.yml --limit webservers

# Run specific tags
ansible-playbook site.yml --tags "configuration,deploy"

Rarity: Common
Difficulty: Medium-Hard

Disaster Recovery

5. How do you design a disaster recovery plan?

Answer: Comprehensive DR strategy:

Key Metrics:

RTO (Recovery Time Objective): Max acceptable downtime
RPO (Recovery Point Objective): Max acceptable data loss

DR Strategy:

1. Backup Strategy:

#!/bin/bash
# Automated backup with retention

BACKUP_SOURCE="/var/www /etc /home"
BACKUP_DEST="/mnt/backup"
REMOTE_SERVER="backup.company.com"
RETENTION_DAYS=30

# Create backup
DATE=$(date +%Y%m%d)
tar -czf $BACKUP_DEST/backup-$DATE.tar.gz $BACKUP_SOURCE

# Sync to remote
rsync -avz --delete $BACKUP_DEST/ $REMOTE_SERVER:/backups/

# Clean old backups
find $BACKUP_DEST -name "backup-*.tar.gz" -mtime +$RETENTION_DAYS -delete

# Verify backup
tar -tzf $BACKUP_DEST/backup-$DATE.tar.gz > /dev/null
if [ $? -eq 0 ]; then
    echo "Backup verified successfully"
else
    echo "Backup verification failed!" | mail -s "Backup Alert" [email protected]
fi

2. Database Replication:

# MySQL Master-Slave setup
# On Master:
CHANGE MASTER TO
  MASTER_HOST='master-server',
  MASTER_USER='repl_user',
  MASTER_PASSWORD='password',
  MASTER_LOG_FILE='mysql-bin.000001',
  MASTER_LOG_POS=107;

START SLAVE;
SHOW SLAVE STATUS\G

3. Documentation:

Recovery procedures
Contact lists
System diagrams
Configuration backups

Rarity: Very Common
Difficulty: Hard

Security Hardening

6. How do you harden a Linux server?

Answer: Multi-layered security approach:

1. System Updates:

# Automated security updates (Ubuntu)
sudo apt install unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades

2. SSH Hardening:

# /etc/ssh/sshd_config
Port 2222  # Change default port
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers admin devops
MaxAuthTries 3
ClientAliveInterval 300
ClientAliveCountMax 2

3. Firewall Configuration:

# iptables rules
iptables -P INPUT DROP
iptables -P FORWARD DROP
iptables -P OUTPUT ACCEPT

# Allow established connections
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow SSH (custom port)
iptables -A INPUT -p tcp --dport 2222 -j ACCEPT

# Allow HTTP/HTTPS
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT

# Save rules
iptables-save > /etc/iptables/rules.v4

4. Intrusion Detection:

# Install AIDE
sudo apt install aide
sudo aideinit

# Check for changes
sudo aide --check

5. Audit Logging:

# Enable auditd
sudo systemctl enable auditd
sudo systemctl start auditd

# Monitor file access
sudo auditctl -w /etc/passwd -p wa -k passwd_changes
sudo auditctl -w /etc/shadow -p wa -k shadow_changes

Rarity: Very Common
Difficulty: Hard

Performance Optimization

7. How do you optimize server performance?

Answer: Systematic performance tuning:

1. Identify Bottlenecks:

# CPU
mpstat 1 10

# Memory
vmstat 1 10

# Disk I/O
iostat -x 1 10

# Network
iftop
nethogs

2. Optimize Services:

# Nginx tuning
worker_processes auto;
worker_connections 4096;
keepalive_timeout 65;
gzip on;
gzip_types text/plain text/css application/json;

# MySQL tuning
innodb_buffer_pool_size = 4G
max_connections = 200
query_cache_size = 64M

3. Kernel Tuning:

# /etc/sysctl.conf
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_fin_timeout = 30
vm.swappiness = 10
fs.file-max = 100000

4. Monitor and Alert:

# Prometheus + Grafana
# Node Exporter for system metrics
# Custom alerts for thresholds

Rarity: Common
Difficulty: Medium-Hard

8. How do you design a comprehensive monitoring and alerting solution?

Answer: Effective monitoring prevents outages and enables quick incident response.

Monitoring Stack Architecture:

Loading diagram...

Prometheus Setup:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
        - 'server1:9100'
        - 'server2:9100'
        - 'server3:9100'
  
  - job_name: 'mysql'
    static_configs:
      - targets: ['db1:9104']
  
  - job_name: 'nginx'
    static_configs:
      - targets: ['web1:9113']

alerting:
  alertmanagers:
    - static_configs:
      - targets: ['localhost:9093']

rule_files:
  - 'alerts/*.yml'

Alert Rules:

# alerts/system.yml
groups:
  - name: system_alerts
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}%"
      
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}%"
      
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Only {{ $value }}% disk space remaining"
      
      - alert: ServiceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} down"
          description: "{{ $labels.instance }} has been down for more than 2 minutes"

Alertmanager Configuration:

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.company.com:587'
  
  - name: 'slack'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
  
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Grafana Dashboard:

{
  "dashboard": {
    "title": "System Overview",
    "panels": [
      {
        "title": "CPU Usage",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "targets": [
          {
            "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100"
          }
        ]
      }
    ]
  }
}

SLO/SLA/SLI Concepts:

SLI (Service Level Indicator):

Quantitative measure of service level
Examples: Uptime %, latency, error rate

SLO (Service Level Objective):

Target value for SLI
Example: 99.9% uptime, p95 latency < 200ms

SLA (Service Level Agreement):

Contract with consequences
Example: 99.9% uptime or customer gets refund

# SLO Example
- alert: SLOViolation
  expr: |
    (
      sum(rate(http_requests_total{status=~"2.."}[30d]))
      /
      sum(rate(http_requests_total[30d]))
    ) < 0.999
  labels:
    severity: critical
  annotations:
    summary: "SLO violation: Success rate below 99.9%"

Preventing Alert Fatigue:

Meaningful Alerts:
- Alert on symptoms, not causes
- Every alert should be actionable
- Remove noisy alerts
Alert Grouping:
- Group related alerts
- Use inhibition rules
- Set appropriate thresholds
Escalation:
- Warning → Team chat
- Critical → PagerDuty
- Use on-call rotations

Rarity: Common
Difficulty: Hard

Enterprise Infrastructure

9. How do you manage a large-scale Windows environment?

Answer: Centralized management strategies:

Group Policy Management:

# Create GPO
New-GPO -Name "Security Policy" -Comment "Enterprise security settings"

# Link to OU
New-GPLink -Name "Security Policy" -Target "OU=Servers,DC=company,DC=com"

# Configure password policy
Set-ADDefaultDomainPasswordPolicy -Identity company.com `
    -MinPasswordLength 12 `
    -PasswordHistoryCount 24 `
    -MaxPasswordAge 90.00:00:00

# Deploy software via GPO
# Computer Configuration > Policies > Software Settings > Software Installation

WSUS (Windows Update):

# Configure WSUS
Set-ItemProperty -Path "HKLM:\SOFTWARE\Policies\Microsoft\Windows\WindowsUpdate" `
    -Name "WUServer" -Value "http://wsus.company.com:8530"

# Force update check
wuauclt /detectnow /updatenow

PowerShell Remoting:

# Enable remoting
Enable-PSRemoting -Force

# Execute on multiple servers
Invoke-Command -ComputerName server1,server2,server3 -ScriptBlock {
    Get-Service | Where-Object {$_.Status -eq "Stopped"}
}

# Parallel execution
$servers = Get-Content servers.txt
$servers | ForEach-Object -Parallel {
    Test-Connection -ComputerName $_ -Count 1
} -ThrottleLimit 10

Rarity: Common
Difficulty: Hard

Conclusion

Senior system administrator interviews require deep technical expertise and leadership experience. Focus on:

Virtualization: Hypervisors, resource management, migration
High Availability: Clustering, failover, replication
Automation: Scripting, configuration management, orchestration
Configuration Management: Ansible, Puppet, IaC at scale
Disaster Recovery: Backup strategies, replication, testing
Security: Hardening, compliance, monitoring
Performance: Optimization, capacity planning, troubleshooting
Monitoring: Prometheus, Grafana, alerting, SLO/SLA
Enterprise Management: AD, GPO, centralized administration

Demonstrate real-world experience with complex infrastructure and strategic decision-making. Good luck!

Table of Contents

Senior System Administrator Interview Questions: Complete Guide

Introduction

Virtualization & Cloud

1. Explain the difference between Type 1 and Type 2 hypervisors.

2. How do you design high availability clusters?

Automation & Scripting

3. How do you automate system administration tasks?

4. How do you manage configuration across hundreds of servers?

Disaster Recovery

5. How do you design a disaster recovery plan?

Security Hardening

6. How do you harden a Linux server?

Performance Optimization

7. How do you optimize server performance?

8. How do you design a comprehensive monitoring and alerting solution?

Enterprise Infrastructure

9. How do you manage a large-scale Windows environment?

Conclusion

Stop Applying. Start Getting Hired.

Share this post

Weekly career tips that actually work

Weekly career tips that actually work

Related Posts

Junior System Administrator Interview Questions: Complete Guide

Senior DevOps Engineer Interview Questions: Complete Guide

Senior Network Engineer Interview Questions: Complete Guide

Stop Applying. Start Getting Hired.

Share this post

Make Your 6 Seconds Count