Évaluation modèles linguistiques : éviter l’obsolescence

Ever spent weeks building the perfect LLM benchmark only to watch it crumble within a few months?

Clean problems, elegant difficulty curves, proper statistical controls. New model drops. Perfect scores across the board. Your tests got trained on. Weeks of work, completely worthless.

So you pivot. Make the tests harder, more complex, more creative. Models improve with time. Now everyone clusters at 90-95%. 8B models are defeating it. Your benchmark has become a participation trophy. This happened to my previous evaluation, *Can-Ai-Code*, twice.

Fine, you say. Random test generation it is! No more memorization, no more clustering. But congratulations, you’ve just unlocked new nightmares: Did you accidentally make your « hard » tests easier than your « easy » ones? Is your random number generator secretly biased? How do you even validate that hundreds of thousands of randomly generated problems « make sense »?

You solve that with clever statistical rigor, only to discover configuration explosion hell. You’d like to test different prompting templates and sampling parameters, but that’s 5 templates times 5 samplers times 50 million tokens (a conservative estimate) equals 1.25 billion tokens per model. Your GPUs scream in horror.

You’re now burning millions of tokens achieving 0.005 confidence intervals on trivial problems while critical hard points sit at 0.02 intervals begging for attention like abandoned puppies. Dynamic sampling helps – generate more tests for uncertain points, fewer for confident ones – but how to avoid p-hacking yourself?

That’s when the guessing realization hits. This binary classifier task scored 60%! Amazing! Wait… that’s only 20% above random chance. Your « 75% accurate » multiple choice task is actually 50% accurate when you subtract lucky guesses. Everything is statistical lies. How are you supposed to compare models across boolean, multiple-choice and write-in answer tasks that have fundamentally different « guess rates »?

Finally, truncation waste arrives to complete your suffering: Model given tough task hits context limits, burns 8,000 tokens, returns a loop of gibberish. You sample 10x more to maintain statistical power. That’s 80K tokens wasted for one data point but with no useful answers. You’re overflowing your KV caches while the confidence intervals laugh at you.

After drowning in this cascade of pain for months, I did what any reasonable person would do: I built an evaluation system to solve every single practical problem I encountered.

# ReasonScape treats language models as information processing systems, not text completion black boxes.

It generates infinite, parametric, tokenization-aware test variations, applies statistical corrections for guessing, dynamically allocates sampling based on uncertainty, handles truncations intelligently, and visualizes the results as both enhanced leaderboards and explorable 3D cognitive landscapes.

[C2: All Models x All Tasks Surface Comparison. Green Sphere indicates high-success. Red Square indicates high-truncation.](https://preview.redd.it/vsoidu4e4ggf1.png?width=1280&format=png&auto=webp&s=d29809860b081384d998a428bc75faeba16cedc1)

The initial C2 dataset represents \~1 billion tokens across 9 models, revealing exactly where, how and why reasoning breaks down across 4 task domains. The interactive leaderboard shows not just scores but confidence intervals, token usage and failure modes. The explorer (links at the bottom of post) lets you navigate difficulty manifolds like some kind of LLM reasoning archaeologist, digging into spectral analysis and completion token patterns. Make sure you’re on a PC – this application has too much going on to be mobile friendly!

[C2 Explorer](https://preview.redd.it/4ahuh87m4ggf1.png?width=1233&format=png&auto=webp&s=8f6e962cdc029ce01dbca46346ec3fda47a06d7d)

I built the system with progressive evaluation in mind so you can start with rapid exploration then scale to deep precision. Everything caches, everything reproduces, everything scales. ReasonScape isn’t just another benchmark. It’s a complete methodology: toolkit, evaluation framework, and growing dataset family rolled into one.

[C2 Leaderboard \(Static snapshot – the Interactive is much nicer!\)](https://preview.redd.it/rn7r2k3t4ggf1.png?width=1198&format=png&auto=webp&s=52d054e40f6f292b07b9d638d82244e8f302ce1d)

The ReasonScape experiments and the resulting datasets will grow, expand and evolve – when scores get too high we will move the difficulty grids to make the tests harder and move on to C3. I have **8 additional tasks** to bring up, and lots more reasoning models I’d like to evaluate but my 2xRTX3090 only have so much to give.

Thanks for reading this far! <3

Links:

* [ReasonScape Homepage](https://reasonscape.com/)

* [ReasonScape Leaderboard – C2](https://reasonscape.com/c2/leaderboard)

* [ReasonScape Explorer – C2](https://reasonscape.com/c2/explorer) (note: PC required, not mobile-friendly)

* [ReasonScape GitHub](https://github.com/the-crypt-keeper/reasonscape)

* [ReasonScape System Architecture](https://github.com/the-crypt-keeper/reasonscape?tab=readme-ov-file#system-architecture)

Laisser un commentaire Annuler la réponse