Open standard · Measurement
The Meaningful Override Rate
Human oversight of AI is either real or it is theater. This is the measurement that tells them apart. It is open. Use it, run it, cite it, and hold any system to it, including mine.
A law can require a human in the loop. It cannot, by itself, tell you whether that human ever exercises command. And a power nobody uses looks identical, from the outside, to a power that was never there. So we are left certifying oversight we cannot see. The Meaningful Override Rate, or MOR, is the smallest honest instrument I know of for fixing that. It measures the use, not the clause.
The definition
Of the decisions where a human could have changed the machine's output, the fraction where the human actually changed it and the change stood.
In plain words: of the calls a person could have reversed, how many did they really reverse. Not clicks. Not time on screen. Not a signature at the bottom of a form. A meaningful override is a reversal the human initiated and the organization honored. Authorizing is not overseeing.
Two companions, so the number is honest
MOR on its own can be gamed or misread. It travels with two checks.
Contestation latency. How much time and context the human realistically had to disagree. Three seconds per item is a turnstile, not oversight, at any override rate. Reversal validity. When humans do override, are they right more often than the machine would have been? An override rate that is high but wrong is its own failure, not a success.
What makes a MOR number valid
A figure is only worth citing if it clears all five of these. Anything short of this is an anecdote wearing a percent sign.
- Stated sample. At least roughly one hundred reviewable decisions, not a demo reel.
- Pre-registered meaning. What counts as meaningful is fixed before anyone looks at the results, and published alongside the number.
- One named workflow. A specific task in a specific place, not "enterprises" in general.
- An explicit denominator. Overrides divided by reviewable decisions, with no cherry-picking of the wins.
- An inter-rater check. More than one judge agrees on what counted, so meaningful is not one person's taste.
Report it in exactly this shape, with its own caveat attached:
Presence is not oversight
There is a tempting shortcut: prove oversight by logging that a human was present, or that a human clicked. It is tidy, it is auditable, and it measures the wrong thing. Proof of presence is not proof of command. A confirm button that is always pressed is precisely the theater this metric exists to expose. MOR asks the only question that matters: when it counted, did the human change the outcome.
A human who can technically say no, but never does, and gets blamed when the machine is wrong, is not oversight. It is a crumple zone with a job title.
Status, and why the standard comes before the number
The bet
Here is my falsifiable wager. Across real deployments, measured honestly, the meaningful override rate of most of what is sold today as "human oversight" sits near zero. If you run this and show me a field of systems where humans routinely and validly overrule the machine, I am wrong, and I will say so here. Until then, the burden sits where it belongs: on the claim of oversight to show its number.
Use it. Cite it. Hold me to it.
This standard is free to use. Run it on your systems and on mine. If you want the argument behind it, read Human Oversight Is Mostly Theater. If you want to see what I am building, start here.