This is a debrief, not a sales pitch. Six months in to running production AI work for clients — alongside the cloud, DevOps and platform work we’ve done for the better part of a decade — here’s what survived contact with real users, and the patterns we threw away.
Of the 14 agents we’ve scoped this year, 9 are running in production today. The other 5 either got descoped to a deterministic pipeline (because the LLM wasn’t adding anything a SQL query couldn’t) or got killed at eval time. That ratio is the most important number in this post.
What worked
If you take one thing from this post, take this: narrow scope, real evals, observable everything. Every agent that’s still in production today follows those three rules.
- Small, well-defined tools. One thing each. Easy to test. Easy to reason about. Easy to deny when an agent reaches for the wrong one.
- A real eval harness. Not vibes, not screenshots in Slack — a versioned set of inputs, expected behaviour, and a CI job that fails the build when an agent regresses on accuracy or token cost.
- Humans in the loop where the cost of being wrong is high. Approvals, soft-deletes and a clean audit trail beat “LLM-only” for anything irreversible. We wrote about this in detail in When to put a human in the loop.
- Boring infrastructure. Postgres, Redis, an HTTP service. The novelty should be in the model and prompt, not the deploy story.
The agents that stayed in production are the ones we could explain to a junior engineer in fifteen minutes.
What we threw away
We tried — and abandoned — a handful of patterns that felt clever in a demo and broke in production:
- Self-reflective “agent of agents” topologies that doubled latency and tripled cost without measurable accuracy gain.
- Free-form natural-language tool selection where a typed schema did the job for a tenth of the tokens.
- “Just give the LLM admin access” — this is a category of pattern, not a single mistake. Don’t.
- Long-running “autonomous” loops without checkpoints. Every agent in production today either finishes a single task or yields back to a queue.
If your agent demos beautifully but you can’t answer “what happens when this misfires at 3am with no human around,” you’re not ready to ship.
The shape of an agent that ships
By the time it reaches our staging environment, every agent on our work has the same skeleton:
- A
Toolinterface with strict input / output schemas. - A
Policylayer that vetoes unsafe calls before the model can fire them. - A
Memorystore that’s explicit, scoped and printable. - An
Evalsuite that runs in CI on every prompt change. - An
Auditlog that tells a human exactly what happened, in plain English, on a bad day.
None of this is sexy. All of it is what the difference between a demo and a product looks like.
What’s next
Over the next few weeks we’ll write up the eval harness in detail (see RAG done right), the policy layer, and a deep dive on the AI incident-triage agent built on LLMs and the Model Context Protocol — covered in AI-driven pipeline triage.
And if you have a job an agent could actually do — not just chat about — come and talk to us.