Most disaster-recovery plans I see are PDFs. They get written, signed, filed, and never tested. The first time the team finds out the runbook is wrong is during the actual incident. This post is about a DR programme we built and ran — live, in production, every quarter — that consistently fails over in 10 to 15 minutes.
The architecture
- Primary region: us-east-1. Active. EKS, RDS Multi-AZ, S3 (bucket replication enabled), KMS keys with cross-region replication.
- DR region: us-west-2. Warm standby. Same EKS topology at smaller node count, RDS read replica, S3 destination buckets, EventBridge rules pre-deployed.
- Akamai in front of both regions for edge caching and as a global health-aware origin shield.
- Route 53 with weighted records and health checks for DNS-level failover.
- AWS Systems Manager (SSM) Automation runbooks for the actual flip — promote replica, scale out DR EKS, update Route 53 weights, validate.
RTO and RPO targets
- RTO 10–15 minutes. From declared incident to the DR region serving production traffic.
- RPO 60 seconds. Bounded by RDS cross-region read replica lag and S3 replication SLA.
- MTTR for partial degradation: under 5 minutes. Many incidents don’t need a full region flip — they need a service or AZ shift.
The runbook, end to end
- Declare. Incident commander runs a single SSM Automation document with a confirmation prompt. The document is the runbook.
- Drain primary. Mark primary unhealthy at the load balancer; existing connections drain.
- Promote DB. SSM step promotes the cross-region read replica to standalone primary. Validate writes.
- Scale DR. EKS node groups auto-scale from warm baseline to full capacity. Pre-pulled images keep this under two minutes.
- Flip DNS. Route 53 weighted records updated; Akamai origin priority swapped.
- Validate. Synthetic checks across critical user journeys; if any fail, automated rollback path. Status page updated.
A runbook nobody has run is fiction. The whole point of the drill is to find the page where the fiction lives.
What the live drill always uncovers
- Stale IAM. Someone added a permission to the prod role and forgot the DR role. We catch one of these almost every quarter.
- Pre-pull drift. The DR cluster wasn’t pulling the latest image tags. Now part of the daily reconciliation job.
- Hard-coded endpoints. Two services had primary-region URLs in environment variables. Moved to a config service and a parameterised endpoint.
- Quota surprises. AWS service quotas in the DR region had crept under what a full cutover needs. Now monitored explicitly.
- People. Drill three people; lose two of them on the live day. Cross-train wide, not just deep.
The single highest-leverage habit we’ve built into client DR programmes is the quarterly live drill. Tabletop exercises catch the obvious gaps; live drills catch the IAM, the quotas, the hard-coded endpoints and the human ones.
Ransomware-specific additions
- Immutable backups with object lock + a separate AWS account that prod can’t reach. Even a fully compromised prod account can’t delete the backups.
- Restore-not-failover plan. If the data itself is poisoned, you don’t want to fail over — you want to restore to a known-good point. Different runbook, different gates.
- Network isolation runbook. A single SSM document that severs east-west connectivity in seconds while preserving observability.
- A clean-room rebuild option. A documented path to stand up production from infra-as-code in a brand-new account. Tested annually, not quarterly.
If your DR plan hasn’t been tested in the last 90 days, it’s a hope, not a plan. Talk to us about a tabletop and a live drill.