What AI safety evaluation looks like

When you ask an AI how a particular chemical reaction works, the model has to decide how much to share. A thorough answer might help you finish a research project. The same answer with a few more details could help someone cause harm. Someone has to check whether the model made the right call, and that checking process is what safety evaluation is.
How AI models get tested on sensitive questions
Safety policies cover a range of sensitive content, from instructions that could enable violence to medical advice that crosses from general education into specific guidance. Each category has defined boundaries, but the cases that require the most judgment sit in between, where the same content could be helpful to one person and harmful to another.
This is what safety researchers call the dual-use problem: the same information can serve a legitimate purpose or a harmful one, depending on context. It comes up across almost every field where detailed knowledge exists, from cybersecurity to biology to medicine. A model that refuses to explain basic chemistry to a student is overcorrecting. One that provides synthesis instructions for something lethal has shared too much. The people who evaluate these responses are typically subject-matter experts who can tell the difference between a response that educates and one that enables, because that distinction depends on knowing the field.
How the calibration works
The framework classifies questions into three risk levels (benign, borderline, and extreme), and models need to respond differently to each.
For benign questions, the model should engage fully. A question about the history of explosives in mining is legitimate, and a model that hedges or refuses has misjudged the risk. Borderline cases call for partial engagement: sharing general information while stopping short of specifics that could cause harm. For extreme cases, the model should refuse, and even the refusal has two forms: a soft refusal redirects toward a safe alternative ("I can't help with that, but here's a related topic I can address"), while a hard refusal declines and explains why. Getting this calibration right is what the evaluation process is designed to measure.
Why this evaluation depends on people
AI safety evaluation requires human judgment because the boundary between helpful and harmful shifts with context, and automated systems can't make that call reliably. A chemistry question from a student and the same question from someone with dangerous intent look identical to a filter. A person with expertise in the subject can read the model's response and assess whether it shared the right amount for the right situation.
That combination of domain knowledge and structured evaluation criteria is what makes safety evaluation work. The people who do it well tend to have backgrounds in fields where information carries real consequences: medicine, chemistry, law, cybersecurity. If that describes you, Outlier has safety projects where that expertise is put to use.
Share this article on


