How to write rubrics that teach AI to think

When AI handles a hard question well — a treatment plan, a legal argument, a layered explanation — something specific made that happen. It wasn't just the volume of training data. It was a rubric.
What is a rubric?
A rubric is a checklist of criteria attached to a specific task. You take a prompt, imagine what a genuinely good answer looks like, then break that ideal down into individual checkboxes — each one binary, each one independent.
For example, if the prompt is "explain the causes of inflation," a rubric might include:
Did the response identify at least three contributing factors? (yes / no)
Did it distinguish between demand-pull and cost-push inflation? (yes / no)
Is it under 400 words? (yes / no)
Each box can be checked without looking at the others. Each one is either met or it isn't. Together, they define what "good" looks like for that specific prompt — precisely enough that a model can learn from the signal.
The difficulty comes from a tension at the heart of rubric design: criteria need to be specific enough to grade objectively while remaining flexible enough to accommodate multiple valid responses. Getting that balance right is the craft.
The three rules of good rubric design
Every criterion needs to meet three standards. Fail any one of them, and the rubric won't produce clean, reliable grades.
Rule 1: Atomicity — one thing at a time
Each criterion covers exactly one thing. When criteria bundle two items together — "the response identifies the correct answer and explains the reasoning" — you create a grading problem: what score does a response get if it gets one right and the other wrong?
✗ Avoid — bundled | ✓ Use this — atomic |
|---|---|
The response identifies the correct answer and explains the reasoning | Two separate criteria: one for the answer, one for the reasoning |
Compound criteria can't be graded cleanly. The fix is always to split them.
Rule 2: Specificity — binary, no guesswork
Criteria need to be true or false in a way that any grader could apply consistently. Vague language creates interpretive wiggle room that produces inconsistent scores.
✗ Too vague | ✓ Specific and binary |
|---|---|
The response is concise | The response is 500 words or fewer |
The response is well-formatted | The response uses bullet points or a numbered list" |
The practical test: could two graders, working independently, reach the same score without discussing it? If not, tighten it.
Rule 3: Self-containment — no outside research needed
Every criterion should include everything a grader needs to evaluate it. Criteria that require prior expertise to apply aren't rubrics — they're trivia questions.
✗ Requires a lookup | ✓ Self-contained |
|---|---|
The response names a Nobel Prize winner in Physics from 2023 | The response names one of the 2023 Nobel Prize winners: Pierre Agostini, Ferenc Krausz, or Anne L'Huillier |
The response is well-formatted | The response uses bullet points or a numbered list" |
This matters especially when rubrics are evaluated at scale, by graders with varying levels of domain knowledge.
The five dimensions to check
A rubric focused only on accuracy, for a prompt that also specifies a format, is an incomplete rubric. Working through each dimension is the most reliable way to avoid gaps.
Accuracy — are the facts correct?
Completeness — were all parts of the prompt addressed, including implicit ones?
Communication quality — is the explanation clear and well-structured?
Instruction-following — did the model do what was actually asked?
Context awareness — did the model understand the user's situation and role?
Weights: not all criteria are equal
Some criteria are mandatory — a cookie recipe without flour isn't a recipe. Some are valuable but not essential — vanilla extract improves the cookies, but they'd still be cookies without it. Including tree nuts in a recipe for a child with a nut allergy is an active failure, not just a missed inclusion.
Negative-weight criteria are particularly useful in domains where safety or compliance matter. They let you penalize specific failure modes explicitly — which produces more reliable model behavior than simply withholding reward from a response that gets most things right but one thing dangerously wrong.
Strong rubrics don't just define what an ideal response includes. They give the model the information it needs to tell the difference between a response that meets the brief — and one that exceeds it.
That distinction is how rubrics do more than grade outputs. They teach models to aim higher.
Where rubrics fit in AI training
AI models learn from two main sources of feedback.
For tasks with objectively correct answers — solve this equation, answer this factual question — the model is rewarded when it gets the answer right. The signal is clean and direct.
For highly subjective tasks — write me a poem, suggest a creative direction — human preferences shape the model's outputs over time.
Rubrics address the gap in between. They're designed for tasks that are partly open-ended but still have better and worse responses — and where "better" can be articulated in specific, checkable terms. Complex explanations, medical diagnoses, structured arguments, legal analysis: these are the tasks rubrics are built for.
The goal for any rubric is to be MECE: mutually exclusive and collectively exhaustive. Criteria shouldn't overlap, and taken together they should fully define what an ideal response looks like. The three rules above — atomicity, specificity, self-containment — are how you get there in practice.
Why this skill is worth developing
Rubric writing sits at the intersection of domain expertise and clear thinking. It requires knowing what a good response looks like in a given field — and being able to articulate that judgment precisely enough that someone else could apply it without your background.
It's genuinely learnable. The principles are clear. The practice builds quickly. And the impact is measurable: well-written rubrics produce better-trained models, which produce better outputs that eventually reach real people.
If you want to go deeper, the course below covers the full process — with examples, exercises, and the edge cases that matter most in practice.
Learn to write rubrics that actually work
A structured course covering the full rubric-writing process — MECE principles, weights, dimensions — with worked examples at every stage.
in collaboration with Scale AI
Share this article on

