BLXBenchBLXBench UI
blxbench

Benchmark

Levels

Misc

DocsDownload blxbenchOur TestsPassSponsor / Partnership
DocsDownload blxbenchOur TestsPassSponsor / Partnership
  1. Home
  2. Our Tests
blxbench

Fixture reference

Our tests

BLXBench runs a fixed, versioned set of JSON fixtures. Each test has a category (domain), a difficulty level, a prompt, and an automatic scorer. Models are run against this suite; the leaderboard shows aggregate quality and cost from your local `results` data.

A path to submit or share new test fixtures is planned; it is not available yet, and we will document it when the workflow is ready.

Levels

Every fixture declares a level. Use easy, medium, or hard in JSON; legacy leicht is accepted and treated the same as easy everywhere (blxbench filters, leaderboard, this site).

easy

Lighter tasks: typically shorter contexts or more constrained outputs. Same scoring pipeline, lower cognitive load for the model.

medium

Default difficulty: representative prompt length and evaluation strictness for the category.

hard

Demanding cases: stricter scorers, longer reasoning paths, or adversarial phrasing where applicable.

Categories

Six domains of fixtures, each with its own focus and scorers. Counts are from the current tree under packages/benchmark-core/tests.

coding_ui

Coding Ui

Benchmark tasks from the local fixture set.

6 fixtures

debugging

Debugging

Bug fixes, edge conditions, and minimal patch accuracy.

60 fixtures

hallucination

Hallucination

Grounded answers under adversarial or missing-context prompts.

60 fixtures

reasoning

Reasoning

Arithmetic, symbolic steps, and structured problem solving.

61 fixtures

refactoring

Refactoring

Code transformation while preserving behavior and intent.

60 fixtures

security

Security

Secure code changes, vulnerability recognition, and safe defaults.

60 fixtures

speed

Speed

Latency-sensitive tasks where concise correct output matters.

65 fixtures

Matrix

One example fixture per category and level (where defined).

Categoryeasymediumhard
Coding UiAnalog ClockThunderstorm Over CityBreakout Game
Debugging—Fix Off By One AverageBugfix
Hallucination—Not Stated Data ResidencyNot Stated
Reasoning—Weighted AverageMulti Step
Refactoring—Extract And GuardCleanup
Security—Ssrf Url FetchReview
Speed—Summary Incident ResponseSummary

BLXBench

Community driven leaderboardPublic benchmark runner — run in your environment, share results with the community.

© 2026 BLXBench by bitslix.com

ProvenanceAggregated from user runs
Scope0 / 7 / 372
LatestNo runs
TermsPrivacy