disaster-recovery-testing

aj-geddes

Updated Today

23 views

Testingwordtesting

About

This skill enables developers to execute systematic disaster recovery tests to validate recovery procedures and measure RTO/RPO. It's used for annual DR exercises, infrastructure changes, and compliance requirements. The skill helps identify gaps in recovery plans and documents lessons learned from testing.

Documentation

Disaster Recovery Testing

Overview

Implement systematic disaster recovery testing to validate recovery procedures, measure RTO/RPO, identify gaps, and ensure team readiness for actual incidents.

When to Use

Annual DR exercises
Infrastructure changes
New service deployments
Compliance requirements
Team training
Recovery procedure validation
Cross-region failover testing

Implementation Examples

1. DR Test Plan and Execution

# dr-test-plan.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: dr-test-procedures
  namespace: operations
data:
  dr-test-plan.md: |
    # Disaster Recovery Test Plan

    ## Test Objectives
    - Validate backup restoration procedures
    - Verify failover mechanisms
    - Test DNS failover
    - Validate data integrity post-recovery
    - Measure RTO and RPO
    - Train incident response team

    ## Pre-Test Checklist
    - [ ] Notify stakeholders
    - [ ] Schedule 4-6 hour window
    - [ ] Disable alerting to prevent noise
    - [ ] Backup production data
    - [ ] Ensure DR environment is isolated
    - [ ] Have rollback plan ready

    ## Test Scope
    - Primary database failover to standby
    - Application failover to DR site
    - DNS resolution update
    - Load balancer health checks
    - Data synchronization verification

    ## Success Criteria
    - RTO: < 1 hour
    - RPO: < 15 minutes
    - Zero data loss
    - All services operational
    - Alerts functional

    ## Post-Test Activities
    - Document timeline
    - Identify gaps
    - Update procedures
    - Schedule post-mortem
    - Update team documentation

---
apiVersion: batch/v1
kind: Job
metadata:
  name: dr-test-executor
  namespace: operations
spec:
  template:
    spec:
      serviceAccountName: dr-test-sa
      containers:
        - name: executor
          image: alpine:latest
          env:
            - name: TEST_ID
              value: "dr-test-$(date +%s)"
            - name: BACKUP_BUCKET
              value: "s3://my-backups"
            - name: DR_NAMESPACE
              value: "dr-test"
          command:
            - sh
            - -c
            - |
              apk add --no-cache aws-cli kubectl jq postgresql-client mysql-client

              echo "Starting DR Test: $TEST_ID"

              # Step 1: Create test namespace
              echo "Creating isolated test environment..."
              kubectl create namespace "$DR_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -

              # Step 2: Restore database from backup
              echo "Restoring database from latest backup..."
              LATEST_BACKUP=$(aws s3 ls "$BACKUP_BUCKET/databases/" | \
                sort | tail -n 1 | awk '{print $4}')

              aws s3 cp "$BACKUP_BUCKET/databases/$LATEST_BACKUP" - | \
                gunzip | psql postgres://user:pass@dr-db:5432/testdb

              # Step 3: Deploy application to DR namespace
              echo "Deploying application to DR environment..."
              kubectl set image deployment/myapp \
                myapp=myrepo/myapp:production \
                -n "$DR_NAMESPACE"

              # Step 4: Run health checks
              echo "Running health checks..."
              for i in {1..30}; do
                if curl -sf http://myapp-dr/health > /dev/null; then
                  echo "Health check passed"
                  break
                fi
                echo "Waiting for service to be healthy... ($i/30)"
                sleep 10
              done

              # Step 5: Run smoke tests
              echo "Running smoke tests..."
              kubectl exec -it deployment/myapp -n "$DR_NAMESPACE" -- \
                npm run test:smoke || exit 1

              # Step 6: Validate data integrity
              echo "Validating data integrity..."
              PROD_RECORD_COUNT=$(psql postgres://user:pass@prod-db:5432/mydb \
                -t -c "SELECT COUNT(*) FROM users;")
              DR_RECORD_COUNT=$(psql postgres://user:pass@dr-db:5432/testdb \
                -t -c "SELECT COUNT(*) FROM users;")

              if [ "$PROD_RECORD_COUNT" -eq "$DR_RECORD_COUNT" ]; then
                echo "Data integrity verified"
              else
                echo "Data integrity check failed"
                exit 1
              fi

              # Step 7: Record metrics
              echo "Recording DR test metrics..."
              kubectl logs deployment/myapp -n "$DR_NAMESPACE" | \
                grep "startup_time" | jq '.' > /tmp/dr-metrics-$TEST_ID.json

              echo "DR Test Complete: $TEST_ID"

          restartPolicy: Never

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: dr-test-sa
  namespace: operations

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: dr-test
rules:
  - apiGroups: [""]
    resources: ["namespaces"]
    verbs: ["create", "get", "list"]
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["create", "get", "list", "patch", "set"]
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["create", "get"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: dr-test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: dr-test
subjects:
  - kind: ServiceAccount
    name: dr-test-sa
    namespace: operations

2. DR Test Script

#!/bin/bash
# execute-dr-test.sh - Comprehensive DR test execution

set -euo pipefail

TEST_ID="dr-test-$(date +%Y%m%d-%H%M%S)"
LOG_FILE="/tmp/dr-test-${TEST_ID}.log"
METRICS_FILE="/tmp/dr-metrics-${TEST_ID}.json"

# Logging
exec 1> >(tee -a "$LOG_FILE")
exec 2>&1

log_info() {
    echo "[INFO] $(date '+%Y-%m-%d %H:%M:%S') $1"
}

log_error() {
    echo "[ERROR] $(date '+%Y-%m-%d %H:%M:%S') $1"
}

# Start time
START_TIME=$(date +%s)

log_info "Starting DR Test: $TEST_ID"

# Disable production monitoring
log_info "Disabling production alerts..."
aws sns set-topic-attributes \
    --topic-arn "arn:aws:sns:us-east-1:123456789012:prod-alerts" \
    --attribute-name DisplayName \
    --attribute-value "DR Test - Alerts Disabled"

# Phase 1: Backup Validation
log_info "Phase 1: Validating backups..."
if ! aws s3 ls s3://my-backups/databases/ | grep -q "sql.gz"; then
    log_error "No valid backups found"
    exit 1
fi

# Phase 2: Environment Setup
log_info "Phase 2: Setting up DR environment..."
LATEST_BACKUP=$(aws s3 ls s3://my-backups/databases/ | \
    sort | tail -n 1 | awk '{print $4}')

log_info "Using backup: $LATEST_BACKUP"
aws s3 cp "s3://my-backups/databases/$LATEST_BACKUP" - | gunzip > /tmp/restore.sql

# Phase 3: Database Restoration
log_info "Phase 3: Restoring database..."
psql -h dr-db.internal -U postgres -d postgres -f /tmp/restore.sql > /dev/null 2>&1

# Phase 4: Application Deployment
log_info "Phase 4: Deploying application..."
kubectl create namespace dr-test --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -f dr-deployment.yaml -n dr-test
kubectl rollout status deployment/myapp -n dr-test --timeout=10m

# Phase 5: Health Checks
log_info "Phase 5: Running health checks..."
HEALTH_CHECK_START=$(date +%s)

for i in {1..60}; do
    if curl -sf --max-time 5 http://myapp-dr.internal/health > /dev/null 2>&1; then
        HEALTH_CHECK_TIME=$(($(date +%s) - HEALTH_CHECK_START))
        log_info "Health check passed in ${HEALTH_CHECK_TIME}s"
        break
    fi
    if [ $i -eq 60 ]; then
        log_error "Health check timeout"
        exit 1
    fi
    sleep 10
done

# Phase 6: Data Integrity
log_info "Phase 6: Validating data integrity..."
PROD_HASH=$(psql -h prod-db.internal -U postgres -d mydb -t -c \
    "SELECT md5(string_agg(CAST(id AS text), ',')) FROM users ORDER BY id;")
DR_HASH=$(psql -h dr-db.internal -U postgres -d mydb -t -c \
    "SELECT md5(string_agg(CAST(id AS text), ',')) FROM users ORDER BY id;")

if [ "$PROD_HASH" = "$DR_HASH" ]; then
    log_info "Data integrity verified"
else
    log_error "Data integrity check failed: $PROD_HASH != $DR_HASH"
fi

# Phase 7: Smoke Tests
log_info "Phase 7: Running smoke tests..."
kubectl exec -it deployment/myapp -n dr-test -- npm run test:smoke || \
    log_error "Smoke tests failed"

# Record metrics
END_TIME=$(date +%s)
TOTAL_TIME=$((END_TIME - START_TIME))
RTO=$TOTAL_TIME
RPO=$(date -d "$(aws s3api head-object --bucket my-backups --key databases/$LATEST_BACKUP --query 'LastModified' --output text)" +%s)

log_info "DR Test Complete"
log_info "Total time: ${TOTAL_TIME}s"
log_info "RTO: ${RTO}s (target: 3600s)"
log_info "RPO: $(date -d @$RPO)"

# Generate report
cat > "$METRICS_FILE" <<EOF
{
  "test_id": "$TEST_ID",
  "start_time": $START_TIME,
  "end_time": $END_TIME,
  "rto_seconds": $RTO,
  "rpo_timestamp": $RPO,
  "data_integrity": "PASS",
  "health_check": "PASS",
  "smoke_tests": "PASS"
}
EOF

log_info "Metrics saved to: $METRICS_FILE"

# Re-enable monitoring
log_info "Re-enabling production alerts..."
aws sns set-topic-attributes \
    --topic-arn "arn:aws:sns:us-east-1:123456789012:prod-alerts" \
    --attribute-name DisplayName \
    --attribute-value "Production Alerts"

log_info "Test artifacts: $LOG_FILE, $METRICS_FILE"

3. DR Test Automation

# scheduled-dr-tests.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: quarterly-dr-test
  namespace: operations
spec:
  # Run quarterly on first Monday of each quarter at 2 AM
  schedule: "0 2 2-8 1,4,7,10 MON"
  jobTemplate:
    spec:
      backoffLimit: 0
      template:
        spec:
          serviceAccountName: dr-test-sa
          containers:
            - name: dr-test
              image: myrepo/dr-test:latest
              command:
                - /usr/local/bin/execute-dr-test.sh
              env:
                - name: SLACK_WEBHOOK
                  valueFrom:
                    secretKeyRef:
                      name: dr-notifications
                      key: slack-webhook
                - name: TEST_MODE
                  value: "full"
          restartPolicy: Never

Best Practices

✅ DO

Schedule regular DR tests
Document procedures in advance
Test in isolated environments
Measure actual RTO/RPO
Involve all teams
Automate validation
Record findings
Update procedures based on results

❌ DON'T

Skip DR testing
Test during business hours
Test against production
Ignore test failures
Neglect post-test analysis
Forget to re-enable monitoring
Use stale backup processes
Test only once a year

DR Test Levels

Tabletop: Documentation and discussion
Simulation: Controlled partial failover
Full DR: Complete system failover
Continuous: Ongoing shadow operations

Key Metrics

RTO: Recovery Time Objective
RPO: Recovery Point Objective
MTPD: Mean Time to Detect
MTTR: Mean Time to Recover

Resources

Quick Install

/plugin add https://github.com/aj-geddes/useful-ai-prompts/tree/main/disaster-recovery-testing

Copy and paste this command in Claude Code to install this skill

GitHub 仓库

aj-geddes/useful-ai-prompts

Path: skills/disaster-recovery-testing

Related Skills

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

webapp-testing

Testing

This Claude Skill provides a Playwright-based toolkit for testing local web applications through Python scripts. It enables frontend verification, UI debugging, screenshot capture, and log viewing while managing server lifecycles. Use it for browser automation tasks but run scripts directly rather than reading their source code to avoid context pollution.

View skill

finishing-a-development-branch

Testing

This skill helps developers complete finished work by verifying tests pass and then presenting structured integration options. It guides the workflow for merging, creating PRs, or cleaning up branches after implementation is done. Use it when your code is ready and tested to systematically finalize the development process.

View skill

disaster-recovery-testing

About

Documentation

Disaster Recovery Testing

Overview

When to Use

Implementation Examples

1. DR Test Plan and Execution

2. DR Test Script

3. DR Test Automation

Best Practices

✅ DO

❌ DON'T

DR Test Levels

Key Metrics

Resources

Quick Install

GitHub 仓库

Related Skills

evaluating-llms-harness

webapp-testing

finishing-a-development-branch

llamaindex