Senior System Administrator Interview Questions: Complete Guide

Milad Bonakdar
Author
Master advanced system administration concepts with comprehensive interview questions covering virtualization, automation, disaster recovery, security, and enterprise IT infrastructure for senior sysadmin roles.
Introduction
Senior System Administrators design, implement, and manage complex IT infrastructure, lead teams, and ensure enterprise-level reliability and security. This role requires deep technical expertise, automation skills, and strategic thinking.
This guide covers essential interview questions for senior system administrators, focusing on advanced concepts and enterprise solutions.
Virtualization & Cloud
1. Explain the difference between Type 1 and Type 2 hypervisors.
Answer:
Type 1 (Bare Metal):
- Runs directly on hardware
- Better performance
- Examples: VMware ESXi, Hyper-V, KVM
Type 2 (Hosted):
- Runs on host OS
- Easier to set up
- Examples: VMware Workstation, VirtualBox
KVM Management:
# List VMs
virsh list --all
# Start VM
virsh start vm-name
# Create VM from XML
virsh define vm-config.xml
# Clone VM
virt-clone --original vm1 --name vm2 --auto-clone
# VM resource allocation
virsh setmem vm-name 4G
virsh setvcpus vm-name 4Rarity: Common
Difficulty: Medium
2. How do you design high availability clusters?
Answer: High Availability (HA) ensures services remain accessible despite failures.
Cluster Types:
Active-Passive Cluster:
- One node active, others standby
- Automatic failover on failure
- Lower resource utilization
Active-Active Cluster:
- All nodes serve traffic
- Better resource utilization
- More complex configuration
Pacemaker + Corosync Setup:
# Install cluster software
sudo apt install pacemaker corosync pcs
# Configure cluster authentication
sudo passwd hacluster
sudo pcs cluster auth node1 node2 -u hacluster
# Create cluster
sudo pcs cluster setup --name mycluster node1 node2
# Start cluster
sudo pcs cluster start --all
sudo pcs cluster enable --all
# Disable STONITH for testing (enable in production)
sudo pcs property set stonith-enabled=false
# Create virtual IP resource
sudo pcs resource create virtual_ip ocf:heartbeat:IPaddr2 \
ip=192.168.1.100 cidr_netmask=24 \
op monitor interval=30s
# Create web service resource
sudo pcs resource create webserver ocf:heartbeat:apache \
configfile=/etc/apache2/apache2.conf \
statusurl="http://localhost/server-status" \
op monitor interval=1min
# Group resources together
sudo pcs resource group add webgroup virtual_ip webserver
# Set resource constraints
sudo pcs constraint colocation add webserver with virtual_ip INFINITY
sudo pcs constraint order virtual_ip then webserver
# Check cluster status
sudo pcs status
sudo crm_mon -1Keepalived (Simple HA):
# Install keepalived
sudo apt install keepalived
# Configure on Master
sudo vi /etc/keepalived/keepalived.confvrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass secret123
}
virtual_ipaddress {
192.168.1.100/24
}
track_script {
chk_nginx
}
}
vrrp_script chk_nginx {
script "/usr/bin/killall -0 nginx"
interval 2
weight 2
}Database Replication (MySQL):
# Master configuration
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = production
# Create replication user
CREATE USER 'repl'@'%' IDENTIFIED BY 'password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
FLUSH PRIVILEGES;
# Get master status
SHOW MASTER STATUS;
# Slave configuration
[mysqld]
server-id = 2
relay-log = /var/log/mysql/mysql-relay-bin
log_bin = /var/log/mysql/mysql-bin.log
read_only = 1
# Configure slave
CHANGE MASTER TO
MASTER_HOST='master-ip',
MASTER_USER='repl',
MASTER_PASSWORD='password',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=107;
START SLAVE;
SHOW SLAVE STATUS\GHealth Checks:
#!/bin/bash
# Service health check script
check_service() {
if systemctl is-active --quiet $1; then
return 0
else
return 1
fi
}
if ! check_service nginx; then
echo "Nginx down, attempting restart"
systemctl restart nginx
sleep 5
if ! check_service nginx; then
echo "Nginx failed to restart, triggering failover"
# Trigger failover
pcs resource move webgroup node2
fi
fiTesting Failover:
# Simulate node failure
sudo pcs cluster stop node1
# Verify failover
sudo pcs status
ping 192.168.1.100
# Restore node
sudo pcs cluster start node1Rarity: Common
Difficulty: Hard
Automation & Scripting
3. How do you automate system administration tasks?
Answer: Automation reduces toil and improves consistency:
Bash Scripting:
#!/bin/bash
# Automated server health check
HOSTNAME=$(hostname)
DATE=$(date '+%Y-%m-%d %H:%M:%S')
REPORT="/var/log/health-check.log"
echo "=== Health Check: $DATE ===" >> $REPORT
# CPU Load
LOAD=$(uptime | awk -F'load average:' '{print $2}')
echo "Load Average: $LOAD" >> $REPORT
# Memory Usage
MEM=$(free -h | grep Mem | awk '{print "Used: "$3" / "$2}')
echo "Memory: $MEM" >> $REPORT
# Disk Usage
echo "Disk Usage:" >> $REPORT
df -h | grep -vE '^Filesystem|tmpfs|cdrom' >> $REPORT
# Failed Services
FAILED=$(systemctl --failed --no-pager)
if [ -n "$FAILED" ]; then
echo "Failed Services:" >> $REPORT
echo "$FAILED" >> $REPORT
fi
# Send alert if critical
DISK_USAGE=$(df -h / | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 90 ]; then
echo "CRITICAL: Disk usage above 90%" | mail -s "Alert: $HOSTNAME" admin@company.com
fiAnsible Playbook:
---
- name: Configure web servers
hosts: webservers
become: yes
tasks:
- name: Install packages
apt:
name:
- nginx
- python3
- git
state: present
update_cache: yes
- name: Copy nginx config
template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
notify: restart nginx
- name: Ensure nginx is running
service:
name: nginx
state: started
enabled: yes
handlers:
- name: restart nginx
service:
name: nginx
state: restartedRarity: Very Common
Difficulty: Medium-Hard
4. How do you manage configuration across hundreds of servers?
Answer: Configuration management at scale requires automation and consistency.
Tool Comparison:
| Tool | Type | Language | Agent | Complexity |
|---|---|---|---|---|
| Ansible | Push | YAML | Agentless | Low |
| Puppet | Pull | Ruby DSL | Agent | High |
| Chef | Pull | Ruby | Agent | High |
| SaltStack | Push/Pull | YAML | Agent/Agentless | Medium |
Ansible at Scale:
# inventory/production
[webservers]
web[01:20].company.com
[databases]
db[01:05].company.com
[loadbalancers]
lb[01:02].company.com
[webservers: vars]
ansible_user=deploy
ansible_become=yes# playbooks/site.yml
---
- name: Configure all servers
hosts: all
roles:
- common
- security
- monitoring
- name: Configure web servers
hosts: webservers
roles:
- nginx
- php
- application
- name: Configure databases
hosts: databases
roles:
- mysql
- backup# roles/common/tasks/main.yml
---
- name: Update all packages
apt:
upgrade: dist
update_cache: yes
cache_valid_time: 3600
- name: Install common packages
apt:
name:
- vim
- htop
- curl
- git
state: present
- name: Configure NTP
template:
src: ntp.conf.j2
dest: /etc/ntp.conf
notify: restart ntp
- name: Ensure services are running
service:
name: "{{ item }}"
state: started
enabled: yes
loop:
- ntp
- rsyslogDynamic Inventory:
#!/usr/bin/env python3
# dynamic_inventory.py - AWS EC2 dynamic inventory
import json
import boto3
def get_inventory():
ec2 = boto3.client('ec2')
response = ec2.describe_instances()
inventory = {
'_meta': {'hostvars': {}},
'all': {'hosts': []}
}
for reservation in response['Reservations']:
for instance in reservation['Instances']:
if instance['State']['Name'] != 'running':
continue
hostname = instance['PrivateIpAddress']
inventory['all']['hosts'].append(hostname)
# Group by tags
for tag in instance.get('Tags', []):
if tag['Key'] == 'Role':
role = tag['Value']
if role not in inventory:
inventory[role] = {'hosts': []}
inventory[role]['hosts'].append(hostname)
return inventory
if __name__ == '__main__':
print(json.dumps(get_inventory(), indent=2))Infrastructure as Code Best Practices:
1. Version Control:
# Git workflow
git checkout -b feature/update-nginx-config
# Make changes
git add .
git commit -m "Update nginx SSL configuration"
git push origin feature/update-nginx-config
# Create pull request for review2. Testing:
# Test playbook syntax
ansible-playbook --syntax-check site.yml
# Dry run
ansible-playbook site.yml --check
# Run on staging first
ansible-playbook -i inventory/staging site.yml
# Deploy to production
ansible-playbook -i inventory/production site.yml3. Secrets Management:
# Ansible Vault
ansible-vault create secrets.yml
ansible-vault encrypt vars/passwords.yml
ansible-playbook site.yml --ask-vault-pass
# Or use password file
ansible-playbook site.yml --vault-password-file ~/.vault_pass4. Idempotency:
# Bad - not idempotent
- name: Add line to file
shell: echo "config=value" >> /etc/app.conf
# Good - idempotent
- name: Ensure config line exists
lineinfile:
path: /etc/app.conf
line: "config=value"
state: presentParallel Execution:
# Run on 10 hosts at a time
ansible-playbook -i inventory site.yml --forks 10
# Limit to specific hosts
ansible-playbook site.yml --limit webservers
# Run specific tags
ansible-playbook site.yml --tags "configuration,deploy"Rarity: Common
Difficulty: Medium-Hard
Disaster Recovery
5. How do you design a disaster recovery plan?
Answer: Comprehensive DR strategy:
Key Metrics:
- RTO (Recovery Time Objective): Max acceptable downtime
- RPO (Recovery Point Objective): Max acceptable data loss
DR Strategy:
1. Backup Strategy:
#!/bin/bash
# Automated backup with retention
BACKUP_SOURCE="/var/www /etc /home"
BACKUP_DEST="/mnt/backup"
REMOTE_SERVER="backup.company.com"
RETENTION_DAYS=30
# Create backup
DATE=$(date +%Y%m%d)
tar -czf $BACKUP_DEST/backup-$DATE.tar.gz $BACKUP_SOURCE
# Sync to remote
rsync -avz --delete $BACKUP_DEST/ $REMOTE_SERVER:/backups/
# Clean old backups
find $BACKUP_DEST -name "backup-*.tar.gz" -mtime +$RETENTION_DAYS -delete
# Verify backup
tar -tzf $BACKUP_DEST/backup-$DATE.tar.gz > /dev/null
if [ $? -eq 0 ]; then
echo "Backup verified successfully"
else
echo "Backup verification failed!" | mail -s "Backup Alert" admin@company.com
fi2. Database Replication:
# MySQL Master-Slave setup
# On Master:
CHANGE MASTER TO
MASTER_HOST='master-server',
MASTER_USER='repl_user',
MASTER_PASSWORD='password',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=107;
START SLAVE;
SHOW SLAVE STATUS\G3. Documentation:
- Recovery procedures
- Contact lists
- System diagrams
- Configuration backups
Rarity: Very Common
Difficulty: Hard
Security Hardening
6. How do you harden a Linux server?
Answer: Multi-layered security approach:
1. System Updates:
# Automated security updates (Ubuntu)
sudo apt install unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades2. SSH Hardening:
# /etc/ssh/sshd_config
Port 2222 # Change default port
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers admin devops
MaxAuthTries 3
ClientAliveInterval 300
ClientAliveCountMax 23. Firewall Configuration:
# iptables rules
iptables -P INPUT DROP
iptables -P FORWARD DROP
iptables -P OUTPUT ACCEPT
# Allow established connections
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
# Allow SSH (custom port)
iptables -A INPUT -p tcp --dport 2222 -j ACCEPT
# Allow HTTP/HTTPS
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
# Save rules
iptables-save > /etc/iptables/rules.v44. Intrusion Detection:
# Install AIDE
sudo apt install aide
sudo aideinit
# Check for changes
sudo aide --check5. Audit Logging:
# Enable auditd
sudo systemctl enable auditd
sudo systemctl start auditd
# Monitor file access
sudo auditctl -w /etc/passwd -p wa -k passwd_changes
sudo auditctl -w /etc/shadow -p wa -k shadow_changesRarity: Very Common
Difficulty: Hard
Performance Optimization
7. How do you optimize server performance?
Answer: Systematic performance tuning:
1. Identify Bottlenecks:
# CPU
mpstat 1 10
# Memory
vmstat 1 10
# Disk I/O
iostat -x 1 10
# Network
iftop
nethogs2. Optimize Services:
# Nginx tuning
worker_processes auto;
worker_connections 4096;
keepalive_timeout 65;
gzip on;
gzip_types text/plain text/css application/json;
# MySQL tuning
innodb_buffer_pool_size = 4G
max_connections = 200
query_cache_size = 64M3. Kernel Tuning:
# /etc/sysctl.conf
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_fin_timeout = 30
vm.swappiness = 10
fs.file-max = 1000004. Monitor and Alert:
# Prometheus + Grafana
# Node Exporter for system metrics
# Custom alerts for thresholdsRarity: Common
Difficulty: Medium-Hard
8. How do you design a comprehensive monitoring and alerting solution?
Answer: Effective monitoring prevents outages and enables quick incident response.
Monitoring Stack Architecture:
Prometheus Setup:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'node'
static_configs:
- targets:
- 'server1:9100'
- 'server2:9100'
- 'server3:9100'
- job_name: 'mysql'
static_configs:
- targets: ['db1:9104']
- job_name: 'nginx'
static_configs:
- targets: ['web1:9113']
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- 'alerts/*.yml'Alert Rules:
# alerts/system.yml
groups:
- name: system_alerts
interval: 30s
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}%"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Only {{ $value }}% disk space remaining"
- alert: ServiceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} down"
description: "{{ $labels.instance }} has been down for more than 2 minutes"Alertmanager Configuration:
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
email_configs:
- to: 'team@company.com'
from: 'alerts@company.com'
smarthost: 'smtp.company.com:587'
- name: 'slack'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']Grafana Dashboard:
{
"dashboard": {
"title": "System Overview",
"panels": [
{
"title": "CPU Usage",
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}
]
},
{
"title": "Memory Usage",
"targets": [
{
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100"
}
]
}
]
}
}SLO/SLA/SLI Concepts:
SLI (Service Level Indicator):
- Quantitative measure of service level
- Examples: Uptime %, latency, error rate
SLO (Service Level Objective):
- Target value for SLI
- Example: 99.9% uptime, p95 latency < 200ms
SLA (Service Level Agreement):
- Contract with consequences
- Example: 99.9% uptime or customer gets refund
# SLO Example
- alert: SLOViolation
expr: |
(
sum(rate(http_requests_total{status=~"2.."}[30d]))
/
sum(rate(http_requests_total[30d]))
) < 0.999
labels:
severity: critical
annotations:
summary: "SLO violation: Success rate below 99.9%"Preventing Alert Fatigue:
-
Meaningful Alerts:
- Alert on symptoms, not causes
- Every alert should be actionable
- Remove noisy alerts
-
Alert Grouping:
- Group related alerts
- Use inhibition rules
- Set appropriate thresholds
-
Escalation:
- Warning → Team chat
- Critical → PagerDuty
- Use on-call rotations
Rarity: Common
Difficulty: Hard
Enterprise Infrastructure
9. How do you manage a large-scale Windows environment?
Answer: Centralized management strategies:
Group Policy Management:
# Create GPO
New-GPO -Name "Security Policy" -Comment "Enterprise security settings"
# Link to OU
New-GPLink -Name "Security Policy" -Target "OU=Servers,DC=company,DC=com"
# Configure password policy
Set-ADDefaultDomainPasswordPolicy -Identity company.com `
-MinPasswordLength 12 `
-PasswordHistoryCount 24 `
-MaxPasswordAge 90.00:00:00
# Deploy software via GPO
# Computer Configuration > Policies > Software Settings > Software InstallationWSUS (Windows Update):
# Configure WSUS
Set-ItemProperty -Path "HKLM:\SOFTWARE\Policies\Microsoft\Windows\WindowsUpdate" `
-Name "WUServer" -Value "http://wsus.company.com:8530"
# Force update check
wuauclt /detectnow /updatenowPowerShell Remoting:
# Enable remoting
Enable-PSRemoting -Force
# Execute on multiple servers
Invoke-Command -ComputerName server1,server2,server3 -ScriptBlock {
Get-Service | Where-Object {$_.Status -eq "Stopped"}
}
# Parallel execution
$servers = Get-Content servers.txt
$servers | ForEach-Object -Parallel {
Test-Connection -ComputerName $_ -Count 1
} -ThrottleLimit 10Rarity: Common
Difficulty: Hard
Conclusion
Senior system administrator interviews require deep technical expertise and leadership experience. Focus on:
- Virtualization: Hypervisors, resource management, migration
- High Availability: Clustering, failover, replication
- Automation: Scripting, configuration management, orchestration
- Configuration Management: Ansible, Puppet, IaC at scale
- Disaster Recovery: Backup strategies, replication, testing
- Security: Hardening, compliance, monitoring
- Performance: Optimization, capacity planning, troubleshooting
- Monitoring: Prometheus, Grafana, alerting, SLO/SLA
- Enterprise Management: AD, GPO, centralized administration
Demonstrate real-world experience with complex infrastructure and strategic decision-making. Good luck!




