← Back to News & Media

AI-driven pipeline triage with LLMs + MCP — turning failed builds into fixed PRs

18m → 90smedian triage
62%auto-resolved
0writes to main
MCPtyped tools

The single most expensive thing in a 300-microservice estate is the time engineers spend triaging the same five categories of pipeline failure over and over. Flaky test, expired credential, dependency drift, image pull rate-limit, broken migration. Most failures are recognisable; only a few are genuinely novel. So we built an agent to do the recognising.

I’ve published a working prototype at the AI triage tool repo; this post is the production-grade version of the same idea, deployed against real GitLab + Kubernetes + Splunk infrastructure.

What it does

  1. Ingests a failed pipeline event from GitLab webhooks.
  2. Pulls context — pipeline logs, job logs, recent commits on the branch, related K8s events, last 15 minutes of Splunk logs scoped to the service.
  3. Classifies the failure against a taxonomy (flaky / infra / config / code / dependency / external).
  4. Drafts an action. Re-run, open an issue, ping a team, or open a draft merge request with a proposed fix.
  5. Hands off to a human with full context for anything irreversible or anything outside the model’s confident range.

Why MCP changed the build

The Model Context Protocol turned out to be the right abstraction for this. Instead of teaching the model bespoke clients for GitLab, Kubernetes, Splunk and Vault, every system is exposed as an MCP server with a typed tool surface:

  • gitlab.get_pipeline(id), gitlab.get_job_log(id), gitlab.open_mr(branch, title, description)
  • k8s.get_events(namespace, since), k8s.describe_pod(name)
  • splunk.search(query, since, until)
  • vault.read(path) — read-only, scoped to the agent’s own short-lived role

Each tool has a JSON schema, a deny-list for destructive variants, and per-call audit logging. The model never sees raw clients or credentials; it sees tools.

MCP isn’t magic, but the typed-tool boundary forces a discipline that’s hard to maintain by hand: every capability the model has is something you reviewed and named.

Guardrails that earned their keep

  • No write to main. The agent can open a draft MR; only a human can merge.
  • No production cluster writes. Read-only on prod K8s. It can re-run a job; it can’t kill a pod.
  • Bounded retries. If the agent suggests “re-run” three times in a row on the same pipeline, escalate to a human.
  • Confidence thresholds per action. Re-running a flaky test needs 0.7 confidence. Opening an MR with a code change needs 0.9 plus a passing test in CI on the proposed branch.
  • Cost cap per incident. Hard token budget per ticket, hard wall-clock budget. If it can’t solve in budget, escalate.

The eval harness

  • 800 historical failures with known root cause and known correct action. Run on every prompt change.
  • Per-class precision/recall (flaky vs. real, infra vs. code) tracked over time.
  • “Ghost mode” in production for two weeks before going live — agent runs, drafts, but takes no action; humans review every draft and we measure agreement.
  • After go-live, weekly review of every escalation and every action the agent took unattended.

What changed

  • Median triage time on recognisable failures: 18 minutes → under 90 seconds.
  • ~62% of pipeline failures resolved without a human (mostly flakes and known transient infra issues).
  • ~22% resolved with a human approving the agent’s proposed MR — the human reviews a one-line diff with a clear explanation, not a 4,000-line log.
  • ~16% genuinely novel and routed to humans with full context.
Takeaway

Triage is the perfect AI use-case: high volume, recognisable patterns, clear ground truth, and a natural human-in-the-loop checkpoint (the MR review). Start with the boring 80% and leave the novel 20% for people.

Further reading

If pipeline noise is eating your engineers’ mornings, talk to us. The first agent is usually the easiest one to scope.

Have a job we could help with?

30-minute call. Senior engineer on the line. We’ll tell you whether AI, cloud or a custom build is the right tool — even if the answer’s no.