Can AI Grade Its Own Homework? The Self-Assessment Problem in AI Governance
Why self-evaluation is not a safety mechanism — and how to tell when a safety report is a marketing document.
A model that cannot reliably identify its own errors in real deployment cannot be trusted to audit itself — yet self-evaluation has become a standard substitution for independent oversight in AI safety reporting.
When you ask an AI tool to evaluate its own output, the results look useful. It flags factual hedges, rewrites with cleaner structure on a second pass, and catches the kinds of errors it was explicitly trained to avoid. For routine tasks — summaries, drafts, code fixes — the before-and-after delta is real.
But the self-grader fails in the same direction every time: it cannot identify the errors it was most likely to make. It consistently misses gaps in its own training data. It rates confident-sounding but incorrect answers as high quality, because from inside the model, confident and accurate look identical. It does not know what it does not know — and it cannot flag that gap, because flagging it would require knowing it exists.
This is not a bug. It is structural. A model trained to produce fluent, confident text will evaluate fluent, confident text as good output. The self-assessment problem is the same problem as hallucination confidence — applied to the evaluation layer instead of the generation layer.
Several major AI companies publish safety evaluations that rely substantially on model self-assessment — either directly, by asking the model to evaluate its own outputs, or structurally, by using the same model family to evaluate outputs from the same model family. The limitations of this approach are documented in the academic literature. They are rarely disclosed in the safety cards or system cards published alongside model releases.
This is Framework 04 — Policy vs Enforcement — applied to evaluation methodology. A safety process that uses the tool being evaluated to evaluate the tool is not a safety process. It is a documentation exercise with extra steps.
The companies that do this are not necessarily acting in bad faith. Self-evaluation is faster, cheaper, and produces results the company controls before publication. Independent evaluation is slower, more expensive, and produces results the company cannot edit. Given those incentives, the outcome is predictable.
External red-teaming by teams without access to the model's training data, reward signals, or internal safety documentation. If the red team knows what the model was trained to refuse, they will find what the model was trained to handle — not what it wasn't.
Out-of-distribution evaluation — testing on inputs the model was not optimised for, in deployment contexts that differ from the training environment. Most safety failures happen at distribution edges, not in the scenarios the company tested.
Published failure rates alongside accuracy figures. A safety report that only reports what the model gets right is a marketing document. The failure distribution — what it gets wrong, under what conditions, at what rate — is the safety-relevant information.
Post-deployment audit trails that can be reviewed after the model is in production — not only before release. Most harm events happen in deployment, not in controlled evaluation. Without logging and retrospective review capability, safety evaluation is a pre-release ritual.
The next time an AI company publishes a safety report, check the methodology section. Who ran the evaluation? Did they have access to the model's training data or RLHF reward signals? Is the failure rate published alongside the accuracy figure? Was the evaluator independent of the developer?
If the evaluation was performed primarily by the same model or model family being evaluated, the report is a marketing document. If failure rates are absent, the accuracy figure is not a safety claim. Label both accordingly.
QUESTIONS
Can AI systems evaluate their own outputs reliably?
For narrow, well-defined tasks where the evaluation criteria are explicit, AI self-evaluation can be a useful tool. For safety evaluation — where the goal is to identify unknown failure modes — it is structurally inadequate. A model cannot reliably identify errors that arise from its own training distribution gaps, because identifying those gaps would require knowledge the model does not have.
What is AI red-teaming?
AI red-teaming is adversarial evaluation — deliberately attempting to make a model produce harmful, incorrect, or policy-violating outputs, in order to identify failure modes before deployment. Effective red-teaming uses external teams without insider knowledge of the model's training, and tests deployment-realistic scenarios rather than abstract worst cases.
How do I identify a credible AI safety evaluation?
A credible evaluation discloses who conducted it and their relationship to the developer, publishes failure rates alongside accuracy figures, describes the test distribution and how it was selected, and covers deployment-realistic scenarios. If any of these are absent, the evaluation cannot be treated as a safety claim — regardless of the accuracy figures it reports.