One of the largest CI/CD efforts I’ve led was a set of shared GitLab CI templates that ended up running 300+ microservices for a global telco client. When we started, an average service took 70 minutes from git push to a deployed pod. When we finished, the median was under 10. Here’s exactly what we changed.
The diagnosis
Every team had copied the same 600-line .gitlab-ci.yml with small mutations. Builds re-pulled the same dependencies on every job. Tests ran serially. Security scans were tacked on at the end of pipelines as separate, unparallelised stages. Deploys were a hand-rolled mix of kubectl, Helm and Ansible per team.
The four moves that did most of the work
- One shared CI template, included not copied. A central
ci-templatesrepo with a versionedservice.yml. Servicesincludeit; they don’t fork it. Updating the template once reaches the whole estate. - A common Docker base image. JDK / Node / Python base layers built nightly, scanned, and re-tagged. Service builds start from a pre-warmed image with deps cached. Cold-start build time dropped 60%.
- Parallel test + scan stages. Unit, integration, contract and SonarQube scans all run in parallel. The pipeline graph went from a long staircase to a wide fan.
- Helm + Argo for deploys. One templated Helm chart per service shape (HTTP, worker, batch). Argo CD reconciles the cluster from Git.
kubectl applyfrom a runner is gone.
What the template enforces
- SAST + dependency scan + secrets scan, all required, all parallel.
- SonarQube quality gate — coverage ≥ 70%, no new criticals, no leaked secrets.
- Image signed (cosign) and pushed only on a green pipeline.
- Trivy scan on the final image; high/critical CVEs block promotion to staging.
- Canary by default in staging; manual gate for prod.
The point of a shared template isn’t to save typing. It’s to make “safe” the default and “unsafe” the deliberate exception you have to argue for in a pull request.
The unsexy details that mattered
- Cache hygiene. Per-branch caches with sane keys; otherwise teams spent half their saved time waiting on cache restore.
- Runner sizing. Three classes of runner (light / heavy / GPU) tagged on jobs, autoscaled independently. One pool for everything is always wrong.
- Pipeline visibility. A small dashboard showing slowest stages and flakiest jobs by service. Optimising what you can’t see is guesswork.
- Versioned templates. Services pin
ref: v3. We can ship a v4 without breaking anyone, then migrate teams in waves.
You can’t patch your way out of 300 forked pipelines. Centralise the template, version it, make safety the default, and put the deploy on rails (Argo or equivalent).
What I’d do differently
- Roll out the SonarQube gate before the speed work, not after. Quality regressions are easier to catch on the slow pipeline you already have than on the fast one nobody’s used to yet.
- Build the per-service dashboard on day one. We built it in week six and immediately wished we’d had it from week one.
If your CI/CD is a copy-paste fleet of half-broken pipelines, come and talk to us. The fix is rarely the tool; it’s almost always the template.