- premise
- Charlie is blue. Charlie is cold. Dave is quiet. If someone is big and quiet then they are round. Big, rough people are round.
- question
- Charlie is cold.
True
Methodology
We measure reasoning by flip rate — a paired-item test that precludes the surface-feature heuristics most models use to game conventional benchmarks. Each test consists of two items: one canonical, one perturbed. The pair is the unit of measurement, not the item. A model that gets one item right and the other wrong has not reasoned; it has matched. Flip rate measures how often that happens.
Conventional accuracy rewards surface-feature heuristics. Flip rate punishes them. The result is a far harder, far more honest signal — and one that any lab can apply to its own model.
True
False
One fact has been flipped — Charlie is cold becomes Charlie is not cold — and the answer to the identical question must flip with it. A reasoner tracks the change through the rules; a surface-matcher gives the same answer to both, and fails. The pair above is a real record from the RuleTaker rail of the eval kit; flip rate measures how often models pass this test across the whole benchmark.
Surfaces
Each benchmark is taken in its original form and extended with paired perturbation items. Released open-source. Other labs are invited to apply their own models.
Available at launch