Methodology

The perturbation paradigm.

We measure reasoning by flip rate — a paired-item test that precludes the surface-feature heuristics most models use to game conventional benchmarks. Each test consists of two items: one canonical, one perturbed. The pair is the unit of measurement, not the item. A model that gets one item right and the other wrong has not reasoned; it has matched. Flip rate measures how often that happens.

Conventional accuracy rewards surface-feature heuristics. Flip rate punishes them. The result is a far harder, far more honest signal — and one that any lab can apply to its own model.

One fact has been flipped — Charlie is cold becomes Charlie is not cold — and the answer to the identical question must flip with it. A reasoner tracks the change through the rules; a surface-matcher gives the same answer to both, and fails. The pair above is a real record from the RuleTaker rail of the eval kit; flip rate measures how often models pass this test across the whole benchmark.

fig. i — method, with a verbatim pair from the RuleTaker rail

Surfaces

Five public benchmarks, paired.

Each benchmark is taken in its original form and extended with paired perturbation items. Released open-source. Other labs are invited to apply their own models.

Available at launch

fig. ii — surfaces

Colophon

Company

Sophontic, Inc.

Structure

Delaware C-corporation

Founded

2026

Contact

contact@sophontic.ai