Leaderboard
How to read and interpret the BLXBench leaderboard.
Overview
The BLXBench leaderboard displays benchmark results for AI models across multiple providers. Models are ranked by their aggregate performance across all test categories.
Understanding Scores
Overall Score
The leaderboard ranks models by an aggregate score derived from merged benchmark runs: total earned score divided by total max score across all tests in that rollup (i.e. a test-weighted percentage, not a fixed 25% weight per category). Categories with more fixtures therefore influence the overall more than sparse ones.
Category Scores
Open a model to see per-category breakdowns (speed, security, reasoning, debugging, refactoring, hallucination, coding_ui, etc.). Each slice uses the same score / max_score rollup for tests in that category.
Costs
Use the Costs slice to compare estimated USD spend and related usage from submitted runs. This is separate from the quality score.
Filtering Results
By Provider (label)
The table derives a short provider label from the model id (the segment before /, e.g. openai in openai/gpt-5.4-mini). Search and filters use that string — it is not the same as blxbench adapter aliases (opr, oai, …).
By Search
Search matches model name or provider substring.
By Category or Difficulty
Narrow the table to a category (fixture domain) and/or difficulty (easy / medium / hard).
Viewing Individual Results
Click any model to see:
- Full test results per category
- Individual test cases
- Historical runs
- Cost analysis
Submitting Results
To submit your own results:
- Create a BLXBench account and complete a pass tier that includes leaderboard submission (see Account)
- For headless runs, create a BLXBench API key; for the TUI, sign in with
/auth login - Run benchmarks with blxbench and pass
--submitor setBLXBENCH_SUBMIT=1
See Quick Start for blxbench examples.
Interpreting Results
What Makes a Good Score?
A good benchmark score means the model:
- Completes tasks correctly
- Does so within reasonable time
- Handles edge cases well
Limitations
- Benchmarks don't cover all use cases
- Results vary by model version
- Cost considerations are separate from performance