Whitepaper·Standards

A standard for AI readiness: the methodology behind Prism

How a 0–100 readiness score is built to be proof-based, comparable and resistant to gaming.

Learning Science· May 2026· 24 min

Abstract

This paper describes the design of Prism, a measure of applied AI and soft-skill readiness expressed as a 0–100 score. We set out the goals the measure must satisfy to be useful — proof-based, comparable, inclusive and resistant to gaming — and explain the design choices that follow from them. Our aim is not to claim a finished science but to publish a methodology open enough to be trusted and challenged.

1. The problem with self-report

Most existing measures of readiness rest on self-report: surveys, self-rated skill sliders, or completion of content. Self-report is cheap to collect and almost worthless for high-stakes decisions, because the people least able to judge their own competence are often the most confident, and because anyone with an incentive can inflate their answers. A measure intended to gate opportunity cannot be built on assertions.

2. Design goals

We hold the measure to four explicit goals, and reject design choices that violate any of them:

Proof-based — the score must reflect demonstrated performance on real tasks, not claims about ability.
Comparable — two scores from different people, places and times must mean the same thing, so the measure can be aggregated and tracked.
Inclusive — the measure must not encode advantages of language, bandwidth or background that are irrelevant to the skill itself.
Resistant to gaming — it must be substantially harder to fake a high score than to actually earn one.

3. How the score is constructed

Rather than a single exam, the score aggregates performance across a set of task families that probe distinct, transferable capabilities — for example structured reasoning, communication, problem decomposition and effective use of AI tools. Each task is scored against a rubric, with open responses evaluated for the qualities a human assessor would look for rather than for surface keywords.

Performance is combined using confidence-aware aggregation: the score reflects not only how well someone did but how much evidence supports that estimate. A single strong answer moves the score less than a consistent pattern across tasks, which is what we would expect of a defensible measure of capability.

4. Resistance to gaming

A measure that gates opportunity will be attacked, so anti-gaming is a first-class design concern rather than a patch. Defences include task variation so the same item is rarely seen twice, evaluation of reasoning rather than final answers alone, detection of response patterns inconsistent with genuine problem-solving, and — crucially — designing tasks so that the easiest way to score well is to actually possess the skill.

The strongest anti-gaming defence is a test where studying for it and learning the skill are the same activity.

5. Comparability and drift

For a score to be aggregated into an index, a 72 today must mean what a 72 meant last quarter. We treat comparability as an ongoing calibration problem: monitoring for drift as tasks and cohorts change, and adjusting so the scale stays stable. Without this discipline, any population-level measure built on the score would be measuring its own noise.

6. Validity, limitations and openness

No single number captures a person, and we do not pretend otherwise. The score is a floor, not a ceiling: a credible signal that gets a capable person into the room, to be complemented by human judgement, not replaced by it. We are explicit about limitations — construct coverage, cultural and linguistic fairness, and predictive validity are areas of continuing work — and we publish methodology precisely so that others can scrutinise and improve it. A standard earns trust by being open to challenge, not by appearing certain.

Subscribe to The AI Growth Brief