What we shipped: 9 AI agents in production

This is a debrief, not a sales pitch. Six months in to running production AI work for clients — alongside the cloud, DevOps and platform work we’ve done for the better part of a decade — here’s what survived contact with real users, and the patterns we threw away.

Of the 14 agents we’ve scoped this year, 9 are running in production today. The other 5 either got descoped to a deterministic pipeline (because the LLM wasn’t adding anything a SQL query couldn’t) or got killed at eval time. That ratio is the most important number in this post.

What worked

If you take one thing from this post, take this: narrow scope, real evals, observable everything. Every agent that’s still in production today follows those three rules.

Small, well-defined tools. One thing each. Easy to test. Easy to reason about. Easy to deny when an agent reaches for the wrong one.
A real eval harness. Not vibes, not screenshots in Slack — a versioned set of inputs, expected behaviour, and a CI job that fails the build when an agent regresses on accuracy or token cost.
Humans in the loop where the cost of being wrong is high. Approvals, soft-deletes and a clean audit trail beat “LLM-only” for anything irreversible. We wrote about this in detail in When to put a human in the loop.
Boring infrastructure. Postgres, Redis, an HTTP service. The novelty should be in the model and prompt, not the deploy story.

The agents that stayed in production are the ones we could explain to a junior engineer in fifteen minutes.

What we threw away

We tried — and abandoned — a handful of patterns that felt clever in a demo and broke in production:

Self-reflective “agent of agents” topologies that doubled latency and tripled cost without measurable accuracy gain.
Free-form natural-language tool selection where a typed schema did the job for a tenth of the tokens.
“Just give the LLM admin access” — this is a category of pattern, not a single mistake. Don’t.
Long-running “autonomous” loops without checkpoints. Every agent in production today either finishes a single task or yields back to a queue.

Engineering note

If your agent demos beautifully but you can’t answer “what happens when this misfires at 3am with no human around,” you’re not ready to ship.

The shape of an agent that ships

By the time it reaches our staging environment, every agent on our work has the same skeleton:

A Tool interface with strict input / output schemas.
A Policy layer that vetoes unsafe calls before the model can fire them.
A Memory store that’s explicit, scoped and printable.
An Eval suite that runs in CI on every prompt change.
An Audit log that tells a human exactly what happened, in plain English, on a bad day.

None of this is sexy. All of it is what the difference between a demo and a product looks like.

What’s next

Over the next few weeks we’ll write up the eval harness in detail (see RAG done right), the policy layer, and a deep dive on the AI incident-triage agent built on LLMs and the Model Context Protocol — covered in AI-driven pipeline triage.

And if you have a job an agent could actually do — not just chat about — come and talk to us.

What we shipped: 9 AI agents in production — the patterns that survived contact with real users

What worked

What we threw away

The shape of an agent that ships

What’s next

Have a job we could help with?

What we shipped: 9 AI agents in production — the patterns that survived contact with real users

What worked

What we threw away

The shape of an agent that ships

What’s next

Keep reading

RAG done right: under 2% hallucination on 200k docs

When to put a human in the loop

AI-driven pipeline triage with LLMs + MCP

Have a job we could help with?