Fixture reference
Our tests
BLXBench runs a fixed, versioned set of JSON fixtures. Each test has a category (domain), a difficulty level, a prompt, and an automatic scorer. Models are run against this suite; the leaderboard shows aggregate quality and cost from your local `results` data.
A path to submit or share new test fixtures is planned; it is not available yet, and we will document it when the workflow is ready.
Every fixture declares a level. Use easy, medium, or hard in JSON; legacy leicht is accepted and treated the same as easy everywhere (blxbench filters, leaderboard, this site).
easy
Lighter tasks: typically shorter contexts or more constrained outputs. Same scoring pipeline, lower cognitive load for the model.
medium
Default difficulty: representative prompt length and evaluation strictness for the category.
hard
Demanding cases: stricter scorers, longer reasoning paths, or adversarial phrasing where applicable.
Six domains of fixtures, each with its own focus and scorers. Counts are from the current tree under packages/benchmark-core/tests.
coding_ui
Coding Ui
Benchmark tasks from the local fixture set.
6 fixtures
debugging
Debugging
Bug fixes, edge conditions, and minimal patch accuracy.
60 fixtures
hallucination
Hallucination
Grounded answers under adversarial or missing-context prompts.
60 fixtures
reasoning
Reasoning
Arithmetic, symbolic steps, and structured problem solving.
61 fixtures
refactoring
Refactoring
Code transformation while preserving behavior and intent.
60 fixtures
security
Security
Secure code changes, vulnerability recognition, and safe defaults.
60 fixtures
speed
Speed
Latency-sensitive tasks where concise correct output matters.
65 fixtures
Matrix
One example fixture per category and level (where defined).
| Category | easy | medium | hard |
|---|---|---|---|
| Coding Ui | Analog Clock | Thunderstorm Over City | Breakout Game |
| Debugging | — | Fix Off By One Average | Bugfix |
| Hallucination | — | Not Stated Data Residency | Not Stated |
| Reasoning | — | Weighted Average | Multi Step |
| Refactoring | — | Extract And Guard | Cleanup |
| Security | — | Ssrf Url Fetch | Review |
| Speed | — | Summary Incident Response | Summary |