BLXBenchBLXBench UI
blxbench
BLXBenchBLXBench UI

Benchmark

Suite

Misc

DocsOur TestsPassSponsor / Partnership

Benchmarks

Suite

Misc

DocsOur TestsPassSponsor / Partnership
Updated Jun 24, 03:06 PM·32 models / 43·490 fixtures
blxbench

AI Model Benchmark Leaderboard

Category-aware model rankings from local BLXBench runs, grouped by task domain, difficulty level, pass rate, and latency.

RunByWhenTestsCost
run_7730adv2 — ResilienceBJun 17, 11:21 PM459$0.00run_7ce130v2 — ResilienceBJun 17, 07:52 PM458$1.16run_3d5451v2 — ResilienceBJun 13, 09:39 PM459$1.75Show all runs (64)
Top score
Gpt 5.577.9
Executed tests
490 available fixtures14669
Est. API spend
Sum of per-model costs from overall_ranking$67.89
Top decode
Nemotron 3 Super 120b A12b6682.6 tok/s
Categories
Coding / Ui / Debugging / Hallucination / Reasoning / Refactoring / Security / Speed / Cost9
Levels
easy / medium / hard3

Benchmark

OverallAll levelsSuite · v2 — Resilience

RankDetailModelPassScoreLatencytok/sCostInfra
Rank 1OGpt 5.5openai/gpt-5.5Suite v2 — Resilience284/45977.96.06s102.7$6.78
Rank 2OGpt 5.3 Codexopenai/gpt-5.3-codexSuite v2 — Resilience281/45977.75.24s107.2$2.63
Rank 3QQwen3.7 Maxqwen/qwen3.7-maxSuite v2 — Resilience252/45975.34.58s223.3$1.88
4MKimi K2.6moonshotai/kimi-k2.6Suite v2 — Resilience278/45974.85.82s130.2$1.10
5MMinimax M3minimax/minimax-m3Suite v2 — Resilience266/45974.716.80s46.1$0.37
6DDeepseek V4 Prodeepseek/deepseek-v4-proSuite v2 — Resilience242/45973.813.50s48.9$0.82
7XMimo V2.5xiaomi/mimo-v2.5Suite v2 — Resilience254/45973.85.57s120.7$0.45
8XMimo V2.5 Proxiaomi/mimo-v2.5-proSuite v2 — Resilience248/45973.17.42s77.9$0.80
9XGrok Build 0.1x-ai/grok-build-0.1Suite v2 — Resilience234/45972.812.09s190.9$2.01Mandatory thinking
10NNemotron 3 Super 120b A12bnvidia/nemotron-3-super-120b-a12b:freeSuite v2 — Resilience222/45972.311.80s6682.6$0.00
11ZGlm 5.1z-ai/glm-5.1Suite v2 — Resilience235/45972.13.38s163.7$1.03
12DDeepseek V4 Flashdeepseek/deepseek-v4-flashSuite v2 — Resilience226/45971.29.98s52.4$0.06
13ZGlm 5.2z-ai/glm-5.2Suite v2 — Resilience231/45870.823.04s27.4$1.16
14MMistral Medium 3 5mistralai/mistral-medium-3-5Suite v2 — Resilience221/45970.32.78s166.0$1.39
15QQwen3.7 Plusqwen/qwen3.7-plusSuite v2 — Resilience216/45969.99.44s55.1$0.35
16AClaude Opus 4.8anthropic/claude-opus-4.8Suite v2 — Resilience276/45969.710.18s187.1$9.39
17QQwen3.6 Flashqwen/qwen3.6-flashSuite v2 — Resilience210/45969.33.59s204.0$0.43
18BCobuddybaidu/cobuddy:freeSuite v2 — Resilience216/45968.721.58s50.6$0.00Mandatory thinking
19AClaude Opus 4.7anthropic/claude-opus-4.7Suite v2 — Resilience276/45668.110.01s95.1$8.89
20NNemotron 3 Nano 30b A3bnvidia/nemotron-3-nano-30b-a3b:freeSuite v2 — Resilience187/45968.01.99s250.1$0.00
21MMistral Small 2603mistralai/mistral-small-2603Suite v2 — Resilience207/45867.82.48s180.2$0.10
22XGrok 4.3x-ai/grok-4.3Suite v2 — Resilience214/45967.65.35s101.0$0.47
23IGranite 4.1 8bibm-granite/granite-4.1-8bSuite v2 — Resilience199/45967.12.89s109.8$0.01
24CNorth Mini Codecohere/north-mini-code:freeSuite v2 — Resilience199/45966.35.59s210.9$0.00
25MKimi K2.7 Codemoonshotai/kimi-k2.7-codeSuite v2 — Resilience236/45965.712.63s93.4$1.75Mandatory thinking
26IRing 2.6 1tinclusionai/ring-2.6-1t:freeSuite v2 — Resilience199/44565.39.87s105.9$0.00Mandatory thinking
27AClaude Fable 5anthropic/claude-fable-5Suite v2 — Resilience259/45964.511.50s187.1$18.57Mandatory thinking
28GGemini 3.1 Flash Litegoogle/gemini-3.1-flash-liteSuite v2 — Resilience201/45964.21.72s504.7$0.23
29NNemotron 3 Nano Omni 30b A3b Reasoningnvidia/nemotron-3-nano-omni-30b-a3b-reasoning:freeSuite v2 — Resilience167/45957.95.00s227.6$0.00
30SStep 3.7 Flashstepfun/step-3.7-flashSuite v2 — Resilience175/45955.26.14s257.3$0.74Mandatory thinking
31MMinimax M2.7minimax/minimax-m2.7Suite v2 — Resilience148/45951.015.91s211.8$1.04Mandatory thinking
32GGemini 3.5 Flashgoogle/gemini-3.5-flashSuite v2 — Resilience108/45941.56.66s338.3$5.43Mandatory thinking
1
O

Selected model

Gpt 5.5

openai/gpt-5.5
Score77.9
Pass rate61.9
Tests284/459v2 — Resilience
Avg latency6.06s
TTFT1091 ms
Decode102.7 tok/s
Slice cost$6.78
Runs1
Coding
98.5
Ui
87.5
Debugging
77.0
Hallucination
78.9
Reasoning
76.2
Refactoring
65.2
Security
74.8
Speed
89.4
Cost
91.3
Run context

Shown metrics are from your best public run (highest score %) for this model. Open that run

Snapshot: May 10, 11:46 PM. Best-run suite: v2 — Resilience.

Starred rows in Category profile are optional benchmark slices (opt-in, e.g. Roblox). n/a there means the best public run did not include that category—not a scored zero.

Open model detail

BLXBench

Community driven leaderboardPublic benchmark runner — run in your environment, share results with the community.

© 2026 BLXBench by bitslix.com

ProvenanceAggregated from user runs
Scope43 / 11 / 490
Latestrun_7730ad / 459 / $0.00
TermsPrivacy