10-minute deploys: CI/CD across 300+ services

One of the largest CI/CD efforts I’ve led was a set of shared GitLab CI templates that ended up running 300+ microservices for a global telco client. When we started, an average service took 70 minutes from git push to a deployed pod. When we finished, the median was under 10. Here’s exactly what we changed.

The diagnosis

Every team had copied the same 600-line .gitlab-ci.yml with small mutations. Builds re-pulled the same dependencies on every job. Tests ran serially. Security scans were tacked on at the end of pipelines as separate, unparallelised stages. Deploys were a hand-rolled mix of kubectl, Helm and Ansible per team.

The four moves that did most of the work

One shared CI template, included not copied. A central ci-templates repo with a versioned service.yml. Services include it; they don’t fork it. Updating the template once reaches the whole estate.
A common Docker base image. JDK / Node / Python base layers built nightly, scanned, and re-tagged. Service builds start from a pre-warmed image with deps cached. Cold-start build time dropped 60%.
Parallel test + scan stages. Unit, integration, contract and SonarQube scans all run in parallel. The pipeline graph went from a long staircase to a wide fan.
Helm + Argo for deploys. One templated Helm chart per service shape (HTTP, worker, batch). Argo CD reconciles the cluster from Git. kubectl apply from a runner is gone.

What the template enforces

SAST + dependency scan + secrets scan, all required, all parallel.
SonarQube quality gate — coverage ≥ 70%, no new criticals, no leaked secrets.
Image signed (cosign) and pushed only on a green pipeline.
Trivy scan on the final image; high/critical CVEs block promotion to staging.
Canary by default in staging; manual gate for prod.

The point of a shared template isn’t to save typing. It’s to make “safe” the default and “unsafe” the deliberate exception you have to argue for in a pull request.

The unsexy details that mattered

Cache hygiene. Per-branch caches with sane keys; otherwise teams spent half their saved time waiting on cache restore.
Runner sizing. Three classes of runner (light / heavy / GPU) tagged on jobs, autoscaled independently. One pool for everything is always wrong.
Pipeline visibility. A small dashboard showing slowest stages and flakiest jobs by service. Optimising what you can’t see is guesswork.
Versioned templates. Services pin ref: v3. We can ship a v4 without breaking anyone, then migrate teams in waves.

Takeaway

You can’t patch your way out of 300 forked pipelines. Centralise the template, version it, make safety the default, and put the deploy on rails (Argo or equivalent).

What I’d do differently

Roll out the SonarQube gate before the speed work, not after. Quality regressions are easier to catch on the slow pipeline you already have than on the fast one nobody’s used to yet.
Build the per-service dashboard on day one. We built it in week six and immediately wished we’d had it from week one.

If your CI/CD is a copy-paste fleet of half-broken pipelines, come and talk to us. The fix is rarely the tool; it’s almost always the template.

From 70-minute deploys to 10 minutes — the GitLab CI templates we run on 300+ microservices

The diagnosis

The four moves that did most of the work

What the template enforces

The unsexy details that mattered

What I’d do differently

Have a job we could help with?

From 70-minute deploys to 10 minutes — the GitLab CI templates we run on 300+ microservices

The diagnosis

The four moves that did most of the work

What the template enforces

The unsexy details that mattered

What I’d do differently

Keep reading

Cutting AWS spend by 43%

AI-driven pipeline triage with LLMs + MCP

Grafana unified observability dashboard

Have a job we could help with?