Essay · Measurement
How do you measure whether oversight is real?
You cannot manage what you have not named, and human oversight of AI has gone unnamed for too long. Here is the small set of numbers that, taken together, tell command apart from theater.
Almost everyone says a human is in the loop. Almost no one says how they would know. The answer is not one number but a short dashboard, because any single metric can be gamed, and the value is in how they check each other. Each links to its own page. Read together they form an honest picture of whether oversight is exercised or merely present.
The dashboard
Meaningful Override Rate. Of the decisions a human could have changed, the share they actually changed and the change stood. The core measure of whether anyone is exercising command. Rubber-Stamp Rate. Its shadow: the share approved with no change at all. Time-to-Human. How long until a real person stands behind a flagged decision. Oversight Budget. The human attention actually available per thousand decisions, which is the resource all of the above runs on.
Why a dashboard, and not a number
A high override rate looks like vigorous oversight until you learn the overrides were wrong more often than the machine. A fast Time-to-Human looks like responsiveness until you learn the human spends two seconds and waves it through. Each metric has a failure mode that another one catches. That is the point of measuring them together: the set is harder to fake than any member of it.
The rule that keeps it honest
Publish the method before the number. A measurement that is defined after you know how you perform on it is marketing. Every metric here is meant to be defined in the open, run on your own systems and on mine, and reported with its limits attached. A ruler is only fair if it is printed before the race.
Start measuring
Begin with the core: the Meaningful Override Rate, my proposed open standard, v0.1. For why this matters at all, read Human Oversight Is Mostly Theater.