Eighteen months ago a client brought us in to look at their AWS bill. It was $980,000 a month, growing 6% month-on-month, and nobody on the team could tell us why. We didn’t change a single product behaviour. Twelve weeks in, the bill was $560,000. Here’s the playbook, and the parts of it that always work versus the parts that won’t apply to you.
Step 0 — instrument first, cut second
Before touching anything we got the basics in place:
- Tagging policy enforced via SCP. No tag, no spin-up. Existing untagged resources got an “owner: unknown” tag and a 30-day deadline.
- Cost & Usage Reports into Athena with a few canned queries (top 20 services by week, fastest growers, idle EC2, untagged spend).
- Anomaly detection on every account. Cheap, automatic, surfaced two issues we’d have missed.
Where the money went
- EC2 right-sizing — 14% saved. Compute Optimizer + actual P95 CPU/memory data. Roughly 40% of fleet was over-provisioned by 1–2 sizes. We moved everything that wasn’t latency-critical to Graviton at the same time and picked up another single-digit win on price/performance.
- Savings Plans & RIs — 11% saved. Three-year compute savings plan covering the steady baseline (~70% of compute), one-year EC2 RI for a known-stable workload, on-demand for the rest. The mistake people make here is over-committing — if you commit 100% you’re paying for waste.
- S3 lifecycle & storage tiering — 8% saved. Logs older than 30 days → Glacier Instant Retrieval. Old build artefacts → Deep Archive. One bucket alone (image originals nobody had touched in 4 years) saved $11k/month on its own.
- Idle resource clean-up — 6% saved. Unattached EBS volumes, idle NAT Gateways in dev VPCs, old AMIs, dangling load balancers. Boring, automatable, recurring.
- One architectural change — 4% saved. Replaced a chatty cross-AZ pattern in the data pipeline with an SQS-fed batched consumer. Killed inter-AZ data transfer for that flow.
The boring 90% of cloud cost work is tagging, right-sizing and lifecycle. The exciting 10% is architectural — and you can’t see the architectural wins until you’ve done the boring 90%.
What we deliberately did not do
- We didn’t touch dev/staging until prod was stable. Easy savings, but you don’t want a noisy dev change to land at the same time as a real prod migration — the blast radius gets confusing.
- We didn’t move databases. RDS → Aurora migrations are real work for sometimes-marginal savings; we shortlisted them for a later phase.
- We didn’t adopt every shiny suggestion in the AWS console. Half of them assumed workload patterns that didn’t match reality.
Set Savings Plan utilisation targets at 95%, not 100%. The last 5% of coverage costs more in lost flexibility than it saves in commitment discount.
What stayed cut
Eighteen months later, the bill is $610k/month on roughly 30% more traffic. The savings stuck because we wrote them into platform guardrails: tag enforcement, idle-resource Lambda sweeps, and a quarterly right-sizing job. Cost optimisation isn’t a project, it’s a control loop.
If your AWS bill is growing faster than your traffic and nobody on the team can say why, come and talk to us. We’ll do a free 30-minute review.