BLXBench - Changelog

blxbench

User-facing changes by release — through v1.3.4

A readable summary of what to expect when you update blxbench. We keep jargon light; the documentation is the best place for step-by-step guides and flags.

Back to download

Showing 1–8 of 47

v1.3.4May 13, 2026

Arcade audio fixed, live run progress in the arcade & scrollable run log

Packages at 1.3.4 — `@bitslix/blxbench`, optional native CLI packages, and `@bitslix/blxbench-report-browser` (plus its platform packages) are 1.3.4. Upgrade with `npm i -g @bitslix/blxbench`; if you use the desktop report browser, run `/report browser install` again to stay in sync.
Arcade music and sound effects now work in the installed CLI — Background music and SFX were silently missing in every installed version of blxbench due to the audio asset resolver looking in the wrong location. The resolver now correctly locates the bundled `.opus` files next to the installed binary on all platforms.
Live benchmark progress shown in the arcade — When you open the arcade during a run (`a`), the game select screen and the active game header now display the current test progress and estimated time remaining (e.g. `5/20 · ETA ~3m`) so you can keep an eye on the benchmark without leaving your game.
AI run summary: retries, failure visibility, and faster manual upload — Three fixes to the AI-generated run summary flow: (1) if the summary API call fails at the end of a run (e.g. a brief network blip), blxbench now retries once automatically after 8 seconds instead of silently dropping it — this applies to both the run summary and the per-model summaries; (2) when summary generation fails in TUI mode the error is now forwarded to the run dashboard log as a warning instead of being silently swallowed; (3) pressing `s` to upload an older report (e.g. hours after the run) no longer blocks for up to 75 seconds waiting for a summary that will never arrive — blxbench now checks once immediately and, if the summary is already there, proceeds instantly; if it is not, it only waits for the remaining time budget relative to when the run finished.
Scrollable run log — The test log in the run dashboard can now be scrolled with `↑` / `↓` (one line) and `PgUp` / `PgDn` (one window). Press `Enter` while scrolled to jump straight back to the live tail. The log header shows the current scroll position and a hint when more history is available, and the footer hint strip updates to reflect the scroll keys alongside the existing run controls.

v1.3.3May 11, 2026

Multi-model upload fixes & leaderboard improvements

Packages at 1.3.3 — `@bitslix/blxbench`, optional native CLI packages, and `@bitslix/blxbench-report-browser` (plus its platform packages) are 1.3.3. Upgrade with `npm i -g @bitslix/blxbench`; if you use the desktop report browser, run `/report browser install` again to stay in sync.
Multi-model runs now upload all reports together — Previously, each model's report was sent to the server as soon as that model finished — meaning the second model's upload started while the first was still uploading. Reports are now collected and uploaded sequentially once the entire run is complete, keeping the batch coherent and making it easier to compare all models on the same run page.
Pressing `s` now submits all model reports, not just the last one — In a multi-model run with automatic upload disabled, pressing `s` previously only sent the report for the model that finished last. All model reports in the run are now uploaded when you press `s`.
Upload progress is clearly labelled in the log — When uploading multiple model reports, each entry is now tagged with its position (e.g. `(1/2)`, `(2/2)`) so you can follow progress at a glance.
Leaderboard improvements — Suite version filtering, sorting, and tooltip badges for mandatory-reasoning runs have been refined so results line up more accurately across different suite versions.

v1.3.2May 10, 2026

Mandatory-reasoning transparency

Packages at 1.3.2 — `@bitslix/blxbench`, optional native CLI packages, and `@bitslix/blxbench-report-browser` (plus its platform packages) are 1.3.2. Upgrade with `npm i -g @bitslix/blxbench`; if you use the desktop report browser, run `/report browser install` again to stay in sync.
Mandatory-reasoning models are now flagged — Some models (e.g. MiniMax M2.7 via OpenRouter) reject the `thinking_off` parameters and refuse to run without active reasoning. blxbench now detects this HTTP 400 error, retries the request without any thinking parameters, and marks the result as `thinking_forced`. This means the benchmark can still complete instead of aborting.
Run detail page shows forced-thinking indicator — Any test result where thinking could not be disabled now shows a small brain icon (🧠) next to the test name in the run detail view, with a tooltip explaining what happened.
Model detail page shows a fairness warning — When a model had mandatory reasoning active across one or more tests, a violet info banner appears in the stats section explaining how many tests were affected and that non-reasoning category scores may not be directly comparable to other models.
AI-generated model summaries mention mandatory reasoning — The worker-generated model summary (Strengths / Weaknesses section) is now instructed to note mandatory reasoning in the Weaknesses section when `thinking_forced_count` is present in the data, so the limitation is visible without opening the raw JSON.

v1.3.0May 10, 2026

Fair benchmarking across reasoning and non-reasoning models

Packages at 1.3.0 — `@bitslix/blxbench`, optional native CLI packages, and `@bitslix/blxbench-report-browser` (plus its platform packages) are 1.3.0. Upgrade with `npm i -g @bitslix/blxbench`; if you use the desktop report browser, run `/report browser install` again to stay in sync.
Thinking is now only active during reasoning tests — Previously, models with a thinking budget (e.g. DeepSeek-R1, QwQ, Claude Extended Thinking on OpenRouter) spent their reasoning tokens on every test category — inflating response times, distorting cost measurements, and giving reasoning models a silent advantage on coding, debugging, and other quality tests. blxbench now sends provider-specific thinking-enable parameters only for `reasoning` category tests, and explicitly disables thinking for all other categories. This makes scores directly comparable between reasoning and non-reasoning models on the same task.
Provider-agnostic thinking control — The new `thinking_on` / `thinking_off` adapter fields let each provider declare exactly which API parameters to send when thinking should be active or suppressed. OpenRouter (`reasoning.effort`), OpenAI (`reasoning_effort`), Ollama (`think`), LM Studio (`reasoning.effort`), and Google Gemini (`thinking.type`) are all pre-configured. Custom provider adapters can override these fields or leave them empty for models that do not support thinking at all.
Per-model run state in the dashboard — The run dashboard now tracks each model's lifecycle as it moves through pending → active → done, making it easier to follow progress when benchmarking multiple models in a single run.
Smoother CLI state transitions — Several groups of related state variables in the run dashboard, completed-run view, usage overlay, and Playwright installer have been consolidated into single atomic objects. The practical effect: the pause/resume indicator in the run dashboard no longer flickers through a half-applied state; chart mode and model navigation in the completed-run view switch cleanly in one step; the usage quota overlay and the Playwright install progress screen no longer show a momentary blank between loading and ready states. These were silent internal bugs that occasionally produced brief visual glitches — users with faster machines were less likely to notice them, but they were real and have been eliminated.

v1.2.9May 10, 2026

Thinking-model UI benchmark fixes

Packages at 1.2.9 — `@bitslix/blxbench`, optional native CLI packages, and `@bitslix/blxbench-report-browser` (plus its platform packages) are 1.2.9. Upgrade with `npm i -g @bitslix/blxbench`; if you use the desktop report browser, run `/report browser install` again to stay in sync.
Thinking-model token budget no longer consumed by reasoning — OpenRouter thinking models (e.g. GLM-5.1, DeepSeek-R1, QwQ) were spending their entire `max_tokens` budget on internal reasoning, leaving zero tokens to generate the actual HTML output. blxbench now sends a separate `reasoning: { max_tokens: 8000 }` cap so the thinking phase is bounded and `max_tokens` stays reserved for content. This fixes all UI benchmark tests that previously scored 0/1 for these models.
Partial HTML rescued from truncated reasoning output — When a thinking model's response is cut off before `</html>` (e.g. due to a tight token limit), blxbench now extracts whatever valid HTML was generated — stopping at `</body>` if present, otherwise at the end of the response — instead of saving the raw reasoning text as the artifact. This significantly improves scores for responses that contain complete or near-complete HTML but lack the closing tag.

v1.2.8May 10, 2026

Reasoning-model scoring & higher token limits

Packages at 1.2.8 — `@bitslix/blxbench`, optional native CLI packages, and `@bitslix/blxbench-report-browser` (plus its platform packages) are 1.2.8. Upgrade with `npm i -g @bitslix/blxbench`; if you use the desktop report browser, run `/report browser install` again to stay in sync.
Thinking display now works for OpenRouter reasoning models — Models like DeepSeek-R1, QwQ, and MiMo routed through OpenRouter send their chain-of-thought in a `reasoning` field (not `reasoning_content` as the direct API does). blxbench now reads both, so the live "Thinking: …" preview appears correctly and the first reasoning token is counted for TTFT even if no regular content token is ever emitted.
Truncated JSON no longer scores as zero — When a model's response is cut off before the closing brace (e.g. due to a tight token limit), blxbench now falls back to keyword-searching the raw text instead of failing the parse and awarding 0 points. Partially completed answers now receive partial credit.
Smarter JSON extraction from reasoning text — The scorer now strips `<think>…</think>` blocks before parsing, searches all fenced code blocks regardless of language tag (` ```CONTEXT `, ` ```answer `, etc.), and picks the last complete JSON object in the output — which is almost always the final answer, not an intermediate example from the model's reasoning.
Higher token limits across all benchmark categories — Token limits for every Suite V2 category have been increased to reflect real-world model output lengths. Refactoring tests (which embed actual code) now allow up to 2 000–2 500 tokens; coding tests up to 2 000–4 400 tokens; analysis categories (debugging, security, speed, reasoning, hallucination, cost) get 700–2 000 tokens — enough headroom for models that reason before answering.

v1.2.7May 9, 2026

Reasoning-model scoring fixed (OpenRouter & all providers)

Packages at 1.2.7 — `@bitslix/blxbench`, optional native CLI packages, and `@bitslix/blxbench-report-browser` (plus its platform packages) are 1.2.7. Upgrade with `npm i -g @bitslix/blxbench`; if you use the desktop report browser, run `/report browser install` again to stay in sync.
Reasoning models now score correctly — Models that emit their answer inside a `reasoning_content` / thinking field (e.g. DeepSeek-R1, QwQ, and similar models routed via OpenRouter) were previously scored as failing with 0 points because blxbench only read the regular `content` field, which was empty. The OpenAI-compatible SSE stream handler and the non-streaming JSON handler now fall back to `reasoning_content` (and `thinking`) when `content` is empty, so the actual generated code reaches the scorer.
Live thinking preview for all streaming providers — While a reasoning model is working through its chain-of-thought before producing a real answer, the run dashboard now shows a "Thinking: …" preview in yellow for every OpenAI-compatible provider (previously this only worked for native Anthropic streams).

v1.2.6May 9, 2026

OpenRouter fix & thinking previews

Packages at 1.2.6 — `@bitslix/blxbench`, optional native CLI packages, and `@bitslix/blxbench-report-browser` (plus its platform packages) are 1.2.6. Upgrade with `npm i -g @bitslix/blxbench`; if you use the desktop report browser, run `/report browser install` again to stay in sync.
OpenRouter runs no longer fail with a 400 error — An unrecognised field (`preferredMinThroughput`) was being sent in every OpenRouter request, causing all OpenRouter tests to be skipped with a provider error. The field has been removed; runs now go through as expected.
Live thinking preview — When a reasoning model (e.g. Claude with extended thinking, DeepSeek-R1) is working through its internal reasoning chain before producing its answer, the run dashboard now shows a live "Thinking: …" preview in yellow. Once the first real output token arrives the display switches back to the normal stream preview so you always know what the model is doing.
Playwright UI checks no longer hang indefinitely — A bug caused interaction-check runs (the step that clicks buttons and verifies state changes in the rendered artifact) to sometimes freeze the entire benchmark when Chromium failed to shut down cleanly. The browser-close step now has a 5-second hard deadline; if it is exceeded the process is force-killed and the run continues normally.
More reliable judgment retries — The validation model (used to judge free-form and UI outputs) now retries more robustly on transient API errors, reducing cases where a network hiccup caused a test to incorrectly receive a score of 0.
Broader provider compatibility — The provider adapter now passes additional request headers when configured, improving compatibility with self-hosted and proxy-based API endpoints.