Cutting AWS spend by 43%

Eighteen months ago a client brought us in to look at their AWS bill. It was $980,000 a month, growing 6% month-on-month, and nobody on the team could tell us why. We didn’t change a single product behaviour. Twelve weeks in, the bill was $560,000. Here’s the playbook, and the parts of it that always work versus the parts that won’t apply to you.

Step 0 — instrument first, cut second

Before touching anything we got the basics in place:

Tagging policy enforced via SCP. No tag, no spin-up. Existing untagged resources got an “owner: unknown” tag and a 30-day deadline.
Cost & Usage Reports into Athena with a few canned queries (top 20 services by week, fastest growers, idle EC2, untagged spend).
Anomaly detection on every account. Cheap, automatic, surfaced two issues we’d have missed.

Where the money went

EC2 right-sizing — 14% saved. Compute Optimizer + actual P95 CPU/memory data. Roughly 40% of fleet was over-provisioned by 1–2 sizes. We moved everything that wasn’t latency-critical to Graviton at the same time and picked up another single-digit win on price/performance.
Savings Plans & RIs — 11% saved. Three-year compute savings plan covering the steady baseline (~70% of compute), one-year EC2 RI for a known-stable workload, on-demand for the rest. The mistake people make here is over-committing — if you commit 100% you’re paying for waste.
S3 lifecycle & storage tiering — 8% saved. Logs older than 30 days → Glacier Instant Retrieval. Old build artefacts → Deep Archive. One bucket alone (image originals nobody had touched in 4 years) saved $11k/month on its own.
Idle resource clean-up — 6% saved. Unattached EBS volumes, idle NAT Gateways in dev VPCs, old AMIs, dangling load balancers. Boring, automatable, recurring.
One architectural change — 4% saved. Replaced a chatty cross-AZ pattern in the data pipeline with an SQS-fed batched consumer. Killed inter-AZ data transfer for that flow.

The boring 90% of cloud cost work is tagging, right-sizing and lifecycle. The exciting 10% is architectural — and you can’t see the architectural wins until you’ve done the boring 90%.

What we deliberately did not do

We didn’t touch dev/staging until prod was stable. Easy savings, but you don’t want a noisy dev change to land at the same time as a real prod migration — the blast radius gets confusing.
We didn’t move databases. RDS → Aurora migrations are real work for sometimes-marginal savings; we shortlisted them for a later phase.
We didn’t adopt every shiny suggestion in the AWS console. Half of them assumed workload patterns that didn’t match reality.

Engineering note

Set Savings Plan utilisation targets at 95%, not 100%. The last 5% of coverage costs more in lost flexibility than it saves in commitment discount.

What stayed cut

Eighteen months later, the bill is $610k/month on roughly 30% more traffic. The savings stuck because we wrote them into platform guardrails: tag enforcement, idle-resource Lambda sweeps, and a quarterly right-sizing job. Cost optimisation isn’t a project, it’s a control loop.

If your AWS bill is growing faster than your traffic and nobody on the team can say why, come and talk to us. We’ll do a free 30-minute review.

Cutting AWS spend by 43% — without breaking a single SLA

Step 0 — instrument first, cut second

Where the money went

What we deliberately did not do

What stayed cut

Have a job we could help with?

Cutting AWS spend by 43% — without breaking a single SLA

Step 0 — instrument first, cut second

Where the money went

What we deliberately did not do

What stayed cut

Keep reading

10-minute deploys: CI/CD across 300+ services

Tabletop to live drill: ransomware DR

Grafana unified observability dashboard

Have a job we could help with?