BLXBench Docs
BLXBench Docs
LeaderboardOur TestsSponsor / PartnershipDocumentationInstallationQuick StartTUICommandsHeadless ModeConfigurationLeaderboardOur TestsAccountAboutFAQSupport

Our Tests

Explore the BLXBench test catalog and understand test categories.

Test Catalog

BLXBench evaluates models against a fixed, versioned set of JSON fixtures under packages/benchmark-core/tests/. Each file defines prompts, scoring, and metadata; the web test catalog mirrors that tree.

Categories

Fixture folders (categories) in the suite include:

CategoryFocus
speedLatency-sensitive tasks where concise, correct output matters
securitySafer code changes, vulnerability awareness, refusal behavior
reasoningArithmetic, structured steps, logical problems
debuggingMinimal patches and edge-heavy bug fixes
refactoringRewrites that must preserve behavior
hallucinationGrounded answers when context is missing or adversarial
coding_uiHTML (and similar) artifacts; optional Playwright render stage

Difficulty Levels

Each fixture has a difficulty level:

LevelDescription
EasyLighter prompts / scoring
MediumRepresentative difficulty
HardStricter scorers or longer tasks

Viewing Tests

Browse cases in the test catalog. Detail pages show id, prompts (where exposed), category, level, scorer, and historical pass rates when data exists.

Test Format

Fixtures are JSON with fields such as:

  • id / file name — Stable identifier
  • prompt — User input
  • category — Folder / domain
  • level — easy | medium | hard (legacy aliases normalized at runtime)
  • Scorer configuration — How passes are judged (exact match, rubric, render+judge for coding_ui, etc.)

Exact schema varies by benchmark type; open a fixture in the repo for the authoritative shape.

Contributing Tests

A public “submit a fixture” workflow is not available yet. Today, changes go through the main repository: add JSON under packages/benchmark-core/tests/<category>/ and open a pull request. Tests should be deterministic, safe to run unattended, and cheap enough for community runners where possible.

Leaderboard

How to read and interpret the BLXBench leaderboard.

Account

Managing your BLXBench account, API keys, billing, and security.

On this page

Test CatalogCategoriesDifficulty LevelsViewing TestsTest FormatContributing Tests