AI-driven pipeline triage with LLMs + MCP

The single most expensive thing in a 300-microservice estate is the time engineers spend triaging the same five categories of pipeline failure over and over. Flaky test, expired credential, dependency drift, image pull rate-limit, broken migration. Most failures are recognisable; only a few are genuinely novel. So we built an agent to do the recognising.

I’ve published a working prototype at the AI triage tool repo; this post is the production-grade version of the same idea, deployed against real GitLab + Kubernetes + Splunk infrastructure.

What it does

Ingests a failed pipeline event from GitLab webhooks.
Pulls context — pipeline logs, job logs, recent commits on the branch, related K8s events, last 15 minutes of Splunk logs scoped to the service.
Classifies the failure against a taxonomy (flaky / infra / config / code / dependency / external).
Drafts an action. Re-run, open an issue, ping a team, or open a draft merge request with a proposed fix.
Hands off to a human with full context for anything irreversible or anything outside the model’s confident range.

Why MCP changed the build

The Model Context Protocol turned out to be the right abstraction for this. Instead of teaching the model bespoke clients for GitLab, Kubernetes, Splunk and Vault, every system is exposed as an MCP server with a typed tool surface:

gitlab.get_pipeline(id), gitlab.get_job_log(id), gitlab.open_mr(branch, title, description)
k8s.get_events(namespace, since), k8s.describe_pod(name)
splunk.search(query, since, until)
vault.read(path) — read-only, scoped to the agent’s own short-lived role

Each tool has a JSON schema, a deny-list for destructive variants, and per-call audit logging. The model never sees raw clients or credentials; it sees tools.

MCP isn’t magic, but the typed-tool boundary forces a discipline that’s hard to maintain by hand: every capability the model has is something you reviewed and named.

Guardrails that earned their keep

No write to main. The agent can open a draft MR; only a human can merge.
No production cluster writes. Read-only on prod K8s. It can re-run a job; it can’t kill a pod.
Bounded retries. If the agent suggests “re-run” three times in a row on the same pipeline, escalate to a human.
Confidence thresholds per action. Re-running a flaky test needs 0.7 confidence. Opening an MR with a code change needs 0.9 plus a passing test in CI on the proposed branch.
Cost cap per incident. Hard token budget per ticket, hard wall-clock budget. If it can’t solve in budget, escalate.

The eval harness

800 historical failures with known root cause and known correct action. Run on every prompt change.
Per-class precision/recall (flaky vs. real, infra vs. code) tracked over time.
“Ghost mode” in production for two weeks before going live — agent runs, drafts, but takes no action; humans review every draft and we measure agreement.
After go-live, weekly review of every escalation and every action the agent took unattended.

What changed

Median triage time on recognisable failures: 18 minutes → under 90 seconds.
~62% of pipeline failures resolved without a human (mostly flakes and known transient infra issues).
~22% resolved with a human approving the agent’s proposed MR — the human reviews a one-line diff with a clear explanation, not a 4,000-line log.
~16% genuinely novel and routed to humans with full context.

Takeaway

Triage is the perfect AI use-case: high volume, recognisable patterns, clear ground truth, and a natural human-in-the-loop checkpoint (the MR review). Start with the boring 80% and leave the novel 20% for people.

AI-driven pipeline triage with LLMs + MCP — turning failed builds into fixed PRs

What it does

Why MCP changed the build

Guardrails that earned their keep

The eval harness

What changed

Further reading

Have a job we could help with?

AI-driven pipeline triage with LLMs + MCP — turning failed builds into fixed PRs

What it does

Why MCP changed the build

Guardrails that earned their keep

The eval harness

What changed

Further reading

Keep reading

What we shipped: 9 AI agents in production

10-minute deploys: CI/CD across 300+ services

When to put a human in the loop

Have a job we could help with?