Measure · A concept I propose

Escalation Fit

It is not enough that an AI system escalates to a human. It has to escalate the right things. Send too little and the agent decides hard cases alone; send too much and the human is buried and starts approving on reflex.

Manj Chenna · Founder, Sanctity · Building human judgment infrastructure · Amsterdam

Escalation fit is a proposed way to ask whether the decisions reaching a human are the ones that genuinely needed one. I offer it as a concept, at v0.1, not a finished formula. The point it captures is simple and important: good oversight is not about the volume of escalations but their aim. Both failures, escalating too little and escalating too much, quietly destroy meaningful human oversight, and they look nothing alike, so it helps to name the thing they both miss.

The two failure modes

Too few. The system handles consequential, uncertain calls on its own, and the human only sees the easy ones. Oversight looks calm and is absent where it matters. Too many. Every borderline case lands on a person, who cannot possibly attend to all of them and learns to approve on reflex, recreating the rubber stamp. Good fit is the narrow band between, where the decisions a human sees are the decisions a human was needed for.

Why fit, not volume

Because a high escalation count can mean diligent oversight or a drowning reviewer, and a low one can mean a well-tuned system or an agent quietly acting alone. The raw number cannot tell you which. Fit asks the question the number cannot: of the calls that truly needed a human, how many got one, and of the ones that did not, how many wasted a person's attention. I hold the exact math deliberately; the concept is the contribution here.

Read on

See when an AI agent should escalate and the rubber-stamp problem.