“Should there be a human in the loop?” is the wrong question. The right question is where, on which actions, with what context, and how do we keep them from becoming the bottleneck. Here’s the framework we use across every AI engagement.
The four axes
- Reversibility. Can we undo this in under five minutes if it’s wrong? Sending an internal Slack message: yes. Issuing a refund, deleting a customer record, posting on a public account: no.
- Blast radius. Does this affect one record or ten thousand? A bulk action is a different risk class from a single one even if the per-item risk is the same.
- Audit need. Will a regulator, an auditor or a customer ever ask “who decided this?” If yes, the human needs to be the deciding signal — not a rubber stamp.
- Confidence gradient. How does the model’s self-reported confidence map to actual accuracy on this task? On well-evaluated tasks you can trust a 0.95 threshold. On novel ones you can’t.
The decision matrix we actually use
- Reversible + low blast + low audit need: auto-execute. Log everything; sample 1% for human review.
- Reversible + medium blast OR medium audit need: auto-execute with a 30-second “cancel” window the user can interrupt.
- Irreversible OR high blast OR regulated: human approval required, every time, with full context inline.
- Bulk actions: approval required at the action level, not the item level. Show a sample, show the count, show the predicted outcome distribution.
The mistake teams make is putting the human in the loop everywhere — which trains them to click “approve” without reading. A human approver who doesn’t read isn’t a control. They’re a liability with extra steps.
Designing the approval UX
- One screen, full context. The reviewer sees the input, the proposed action, why the model chose it, the confidence, the affected records. No clicking around.
- Highlight the dangerous bits. If the action involves money, a public surface, or PII — surface that prominently in the approval card.
- Easy reject with reason. Reject reasons go straight into the eval set. The bottleneck becomes training data.
- Bulk-approve with sampling. Show 5 random examples from a batch. Approver checks those; system applies the policy to the rest. Massive throughput unlock when used carefully.
- SLA on the queue. If approvers are slower than the inbound rate, the system has to throttle, not silently back-pressure.
Anti-patterns we’ve seen (and removed)
- Approval queues with no SLA — users gave up and the team disabled the AI.
- “Approve all” buttons with no sampling — turns the human into a checkbox.
- Approval cards missing the model’s reasoning — reviewers have to re-derive the answer to know if it’s right.
- Putting the same human on both the action and the audit — defeats the point of the audit.
Takeaway
Humans in the loop work when the loop is short, the context is complete, and the job is “decide,” not “rubber-stamp.” Anything else is theatre.
Designing an AI workflow and not sure where the humans go? Talk to us — thirty minutes with a senior engineer, no slide decks.