Fixture reference

Our tests

BLXBench runs a fixed, versioned set of JSON fixtures. Each test has a category (domain), a difficulty level, a prompt, and an automatic scorer. Models are run against this suite; the leaderboard shows aggregate quality and cost from your local `results` data.

A path to submit or share new test fixtures is planned; it is not available yet, and we will document it when the workflow is ready.

Levels

Every fixture declares a level. Use easy, medium, or hard in JSON; legacy leicht is accepted and treated the same as easy everywhere (blxbench filters, leaderboard, this site).

easy

Lighter tasks: typically shorter contexts or more constrained outputs. Same scoring pipeline, lower cognitive load for the model.

medium

Default difficulty: representative prompt length and evaluation strictness for the category.

hard

Demanding cases: stricter scorers, longer reasoning paths, or adversarial phrasing where applicable.

Coding Ui

Benchmark tasks from the local fixture set.

6 fixtures

debugging

Debugging

Bug fixes, edge conditions, and minimal patch accuracy.

60 fixtures

hallucination

Hallucination

Grounded answers under adversarial or missing-context prompts.

60 fixtures

reasoning

Reasoning

Arithmetic, symbolic steps, and structured problem solving.

61 fixtures

refactoring

Refactoring

Code transformation while preserving behavior and intent.

60 fixtures

security

Security

Secure code changes, vulnerability recognition, and safe defaults.

60 fixtures

speed

Speed

Latency-sensitive tasks where concise correct output matters.

65 fixtures

Matrix

One example fixture per category and level (where defined).

Category	easy	medium	hard
Coding Ui	Analog Clock	Thunderstorm Over City	Breakout Game
Debugging	—	Fix Off By One Average	Bugfix
Hallucination	—	Not Stated Data Residency	Not Stated
Reasoning	—	Weighted Average	Multi Step
Refactoring	—	Extract And Guard	Cleanup
Security	—	Ssrf Url Fetch	Review
Speed	—	Summary Incident Response	Summary