How this works
DTS reads a prompt and rates how hard it is to answer well, on
four independent axes. Each axis catches a different way prompts
go wrong — so a prompt can be easy on three and brutal on the
fourth, and that’s the one that decides which model you need.
Reasoning load
How many logical steps the model must work through to get the
answer right. A lookup question (“what’s the capital of
France?”) is low. A multi-step proof, a nested decision tree, or
“compare these 5 options against 3 criteria” is high.
High reasoning prompts are where smaller models silently fail —
they answer fluently but skip steps. The fluent surface fools you;
the missing step is the bug.
Format load
How rigid the output structure has to be. “Write a
paragraph” is low. “Return JSON with exactly these 7 fields,
no markdown, ISO-8601 dates, never use the letter e” is
high.
High format prompts are where models follow the spirit but break
the letter — and that’s the kind of failure that breaks
downstream code, not the kind that looks wrong to a human.
Parallel demands
How much simultaneous bookkeeping the model has to do.
“Summarise this paragraph” is low. “For each of these 12
customers, apply rules A through D and flag conflicts” is high.
The model has to carry every item plus every rule in working
memory at the same time. Past a certain count, things start
dropping silently — usually the items in the middle of the list.
Knowledge dependency
How much specialised knowledge the answer relies on. Common-domain
prompts are low. Specialised case law, recent papers, internal
company terminology, or fast-moving fields (regulation, infosec,
current events) are high.
High-knowledge prompts are where retrieval-augmented generation
beats raw model size — feeding the right document in matters more
than picking a bigger model.
How to read the score
Each axis runs 0–100. Anything above 50 on an axis means that
axis is contributing real difficulty, not just being a number. The
recommended model is picked so it can comfortably handle every
axis you scored high on — not just the average.
That’s why a prompt with a moderate average can still get
routed to a stronger model: one axis that’s out of range
decides the pick, even when the others are easy.