Why DTS — reading the model's own language

01 — The problem

What’s wrong with prompting today

There is an enormous appetite to tinker with prompts — to coax a better answer out of the model. But the changes don’t translate the way we expect. We treat the model as an intelligence that understands and follows instructions, and we’re caught off guard when it doesn’t.

It’s a little like talking to another person. We say something, we expect a response, and we’re sometimes disappointed. But when a human doesn’t do what we asked, we can usually make sense of it — because we’re human too, and because the person tends to tell us why.

Models aren’t like that. They ignore an instruction, or quietly reinterpret it, and we’re left with no idea what happened. We have no mental model for how a model reads what we wrote. So we iterate — tweak, re-run, tweak again — until it feels broadly right. It’s slow, it’s effortful, and the outcome is never guaranteed.

The question we set out to answer: what if you could actually see how a model interprets your instructions — before you ship the prompt, and without a thousand re-runs?

02 — The approach

Reading the model’s own language

When a prompt passes through a model, it doesn’t just produce an answer at the end. It leaves a trail. At every one of the dozens of layers in the network, there are measurable traces of how hard the model is working, where it’s uncertain, and what it’s juggling. Most of that signal goes completely unused.

It’s the model’s own language — and we’ve only recently started learning to read it.

These are mechanistic signals — observable quantities as a prompt makes its way through the network. Think of it as a different language altogether.

We host a model, send your prompt through it, and read these traces. Crucially, what we learn about this prompt-model interaction generalises: it lets us reason about how a broad range of models in use today would likely behave on the same prompt.

03 — What it has shown us

Four things we’ve learned

Prompt difficulty can be measured on four axes

A prompt isn’t just “easy” or “hard.” It’s demanding in specific ways. We decompose that demand along four axes:

Reasoning — depth of thinking Format — output constraints to satisfy Parallel — how much to hold in mind at once Knowledge — how rare the facts it needs

The same axes reveal which model will succeed

Demand is only half the story. We put each model’s ability on the same four rulers — so “what the prompt needs” and “what the model can give” are finally in the same units. A model is a good fit when its ability clears the demand on every axis. Pick the cheapest one that does.

The reasoning demand (needle) sits above the small and mid models but below the frontier — so the small/mid picks are a coin-flip, and the cheapest reliable choice is clear. Do this on every axis at once and you get a recommendation, not a guess.

They also show how to make success more likely

Because we can see what’s stressing the model, we can do something about it. Two levers: change the prompt to relieve the stress, or move to a model that handles it better. The right lever depends on the axis.

Some demands are capability ceilings — deep reasoning, or holding many things in mind at once. A weaker model genuinely can’t do them; the honest move is to escalate. But others are satisfiable — a tight format is a matter of instruction, not horsepower (a cheap model follows a strict JSON schema as well as a frontier one), and a rare fact is a retrieval problem (supply the fact and even a small model gets it right). For those, the fix is a better prompt, not a bigger bill.

And where this method stops

We’d rather tell you the edges than oversell the middle. Mechanistic signals are one lens on prompt-model interaction: they’re blind to things they structurally can’t see from a single forward pass — whether a fact is actually true out in the world, how a multi-step agent will behave once tools and state enter the loop, and the run-to-run variance of the model itself. They are, in our view, the best available way to read these interactions today — but on their own they account for something like 50–60% of the behaviour, not all of it.

~10×

more cost-efficient routing than a strong learned baseline, at matched quality

0.84

how well our reasoning-demand read tracks true difficulty (vs 0.39 for length)

tied

routing quality is statistically indistinguishable from much pricier baselines — the win is the price

04 — Distribution

How we’re putting this in your hands

We think this insight is valuable to anyone working with prompts and models — which is most builders and enterprises today. We want to keep investing in the research, and at the same time get the value in front of as many people as possible.

So we’re launching two things:

Preliminary research findings behind these signals — shared openly for people to poke at.
A website to test your own prompts — see their demand fingerprint, the likely model fit, and where the stress is. And, just as importantly, to tell us what’s useful and what isn’t.

Further out, we’ll expose this through an MCP interface and start actively suggesting changes that raise a prompt’s likelihood of success — closing the loop from diagnosis to fix.

We’ve done some of the work of explaining how these models actually read what you write. We hope it helps you get a better outcome from yours.