Methodology

The perturbation paradigm.

We measure reasoning by flip rate — a paired-item test that precludes the surface-feature heuristics most models use to game conventional evaluations. Each test consists of two items: one canonical, one perturbed. The pair is the unit of measurement, not the item. A model that gets one item right and the other wrong has not reasoned; it has matched. Flip rate measures how often that happens.

Conventional accuracy rewards surface-feature heuristics. Flip rate punishes them. The result is a far harder, far more honest signal — and one that any lab can apply to its own model.

Difficulty 3 / 10

1 hop 2 distractors

fig. i.a canonical

premise
question

fig. i.b perturbed

premise
question

Pairs above are verbatim records from the ConvergeMini rail of the eval kit. The slider scales two axes together: hop count (the number of inference steps from premise to answer) and distractor density (the number of facts irrelevant to the conclusion). The single fact flipped between canonical and perturbed — marked with an indigo bar — is the load-bearing change. The correct answer flips with it; a reasoner tracks the change through the chain, a surface-matcher does not.

fig. i — method, with live pairs from the ConvergeMini rail

Flip-rate rule

The pair is the unit of measurement.

A model must answer both sides coherently. One correct item is not enough when the load-bearing fact has moved.

Surfaces

Evaluation rails, paired.

The release surface is organized by the measurement grammar: canonical records, load-bearing perturbations, and source-level provenance. Some rails are public transfer tasks; others are calibrated anchors. The rail set can evolve without changing the standard.

Canonical record The original item, kept as the baseline measurement surface.
Perturbed record A minimal load-bearing change that should force the answer to move.
Flip rate The paired metric: both sides must be answered coherently.
Provenance Each release rail carries source and construction metadata for audit.
Transfer The same method applies across public tasks and calibrated anchors.

Available at launch

fig. ii — surfaces

Contact

First contact.

For research, capital, and deployment conversations at the frontier of reason.

Colophon

Company

Sophontic, Inc.

Structure

Delaware C-corporation

Founded

2026

Contact

Contact form