Project page · Code · Dataset · Paper
Abstract: A modern model release reports scores on 40+ benchmarks; behind the release, evaluations were run orders of magnitude more often across checkpoints, hyperparameter sweeps, and design choices. We ask whether scores accumulated across public releases can anticipate a model's performance on benchmarks it has not yet been run on, and decide which evaluations are most worth running next.
We compile a public score matrix of 84 frontier models on 133 benchmarks (2,604 observed cells, 23.3% filled) and find its geometry is approximately rank-2: across complete submatrices, two factors explain more than 90% of the variance. We exploit this structure with BenchPress: logit-space bias-decomposed rank-2 matrix completion, which completes hidden scores within a 4.6 score-point median absolute error. A reliability analysis identifies when these predictions can be trusted — errors fall when the target model has richer observed evidence and behaviorally similar peers — calibrating 90% prediction intervals.
Finally, we stress-test deployment: five probe benchmarks predict the rest of the profile to a median absolute error of 3.93 score points (4.55 on a low-cost allowlist) while preserving 92.1% of pairwise model rankings, and reach 5.0 on brand-new releases.
- [TBD] BenchPress paper and code released.
- Step 1: Set Up Environment
- Step 2: Download the Data
- Step 3: Predict Scores
- Step 4: Reproduce Paper Experiments
- Step 5: Maintain the Living Matrix
- Step 6: Cite Us
To set up the environment for using BenchPress, please follow the steps below.
-
Clone this repository and rename it as
benchpressgit clone https://github.com/microsoft/benchpress cd benchpress -
Install Packages
Linux / Mac
conda create -n benchpress python=3.10 conda activate benchpress pip install -e . # editable install — makes `from benchpress.*` work everywhere python -m benchpress.download_data
Windows
TBA
BenchPress uses a citation-backed evaluation matrix:
- 189 frontier LLMs from 28 providers (OpenAI, Anthropic, Google, Meta, DeepSeek, Alibaba, Mistral, xAI, Moonshot AI, Zhipu AI, Microsoft, ByteDance, Amazon, MiniMax, NVIDIA, Cohere, Allen AI, IBM, Liquid AI, LG AI Research, Hugging Face, OpenBMB, TII, Sarvam AI, Shanghai AI Lab, Open Thoughts, Meituan, Mistral AI) — before filtering
- 316 benchmarks across 59 categories — before filtering
- 4,903 observed scores, each citation-backed
- Paper-canonical filter (keep models with
$\geq 15$ observed scores and benchmarks with$\geq 8$ observed models), with duplicate/setting-variant exclusions: 84 models × 133 benchmarks, 2,604 observed (23.3% fill rate) - Smart clip: only percentage-scale benchmarks are clipped to [0, 100]; Elo/rating benchmarks (Codeforces, Chatbot Arena, GDP-Val) are left unclipped
Observed (blue) vs. missing (white) cells in the paper-canonical 84 × 133 score matrix.
The public dataset is published at:
- Hugging Face: https://huggingface.co/datasets/microsoft/benchpress-score-matrix
- Local cache after download:
benchpress/data/llm_benchmark_data.json
BenchPress is a living dataset: new model releases, benchmark updates, and corrected citations can be added as the evaluation landscape changes. We welcome pull requests that add citation-backed scores, new models, new benchmarks, or provenance fixes.
After running python -m benchpress.download_data, the package creates a local
JSON cache under benchpress/data/:
benchpress/data/
├── llm_benchmark_data.json # Machine-readable scores
└── _hf_cache/ # Downloaded CSV mirror, when JSON is rebuilt from tables
The Hugging Face release is the public table export. It includes scores_all
(the pre-filter public score table), scores_paper (the paper-canonical
84-model x 133-benchmark matrix), plus model and benchmark metadata. The
downloader rebuilds llm_benchmark_data.json from this public mirror when a
full JSON artifact is not present. That rebuilt JSON is sufficient for package
use and paper-matrix reproduction, but it is not the complete internal audit
artifact: rich fields such as candidates[] and raw cost-evidence traces may be
absent.
llm_benchmark_data.json is the canonical source for code. It contains:
models[]: model metadata, including provider and release/canonical-setting fields when availablebenchmarks[]: benchmark metadata, including category, scale, canonical-setting fields, and cost evidence when availablescores[]: observed cells as{model_id, benchmark_id, score, reference_url}records
Use it directly from the package:
from benchpress.evaluation_harness import M_FULL, MODEL_IDX, BENCH_IDX
score = M_FULL[MODEL_IDX["gpt-5.2"], BENCH_IDX["gpqa_diamond"]]or load the JSON yourself:
import json
from pathlib import Path
data = json.loads(Path("benchpress/data/llm_benchmark_data.json").read_text())
scores = data["scores"]Or load the public Hugging Face mirror:
from datasets import load_dataset
ds = load_dataset("microsoft/benchpress-score-matrix", "scores_paper")# Predict all missing scores for a model
python predict.py --model gpt-5.2
# Predict a single score
python predict.py --model gpt-5.2 --benchmark gpqa_diamond
# List available models / benchmarks
python predict.py --list-models
python predict.py --list-benchmarksProvide a few known scores; BenchPress predicts the rest.
python predict.py --add-model my-model \
--scores "simpleqa=50.0,gpqa_diamond=70.0,aime_2025=55.0"What happens under the hood
BenchPress uses Logit + Bias ALS:
- Transform percentage-scale scores with logit; leave non-percentage benchmarks in their native score space.
- Z-score each benchmark column, then fit a bias-decomposed low-rank ALS model with per-model bias, per-benchmark bias, and rank-2 latent factors.
Predictions are inverted back to score space and smart-clipped: percentage-scale benchmarks are clamped to [0, 100]; Elo/rating benchmarks (Codeforces, Chatbot Arena, GDP-Val, Swelancer, Vending-Bench) are left unclipped.
| Method | MedAPE ↓ | MedAE ↓ | Within ±3 pts | Within ±5 pts | Coverage |
|---|---|---|---|---|---|
| BenchPress (Logit Bias ALS) | 7.8% | 4.60 | 36.6% | 52.8% | 100% |
Canonical fold-level BenchPress predictions are generated under benchpress/evaluation/default_predictions/benchpress_default/ when needed.
All numbers use per-model 3-fold holdout (10 seeds × 3 folds = 30 folds): each seed partitions every model's observed benchmark scores into three disjoint test folds. Primary metric: MedAPE (median absolute percentage error). Matrix: 84 models × 133 benchmarks, 2,604 observed cells.
benchpress/
├── benchpress/ # Core library (editable install)
│ ├── data/ # Canonical score/cost data + schema
│ ├── evaluation/ # Folds + generated default prediction artifacts
│ ├── methods/ # Transforms, completers, predictors, confidence
│ ├── build_benchmark_matrix/ # Raw sources → canonical matrix construction
│ ├── plot_helpers/ # Shared plotting + visual identity
│ ├── evaluation_harness.py # Matrix loader, holdout protocols, metrics
│ ├── io_utils.py # Shared JSON / gzip JSON / atomic writes
│ └── shard_utils.py # Shared shard execution/merge helpers
│
├── experiments/ # All experiments, mirroring paper sections
│ ├── sec1_intro/hero_figure/ # §1 — Hero figure
│ ├── sec3_low_rank/ # §3 — Low-rank structure (matrix viz, SVD)
│ ├── sec4_building_benchpress/ # §4 — Building BenchPress (recipe ablations)
│ ├── sec5_findings/ # §5 — Findings (predictability, ranking, robustness)
│ └── appendix_*/ # Appendix experiments
├── predict.py # CLI prediction tool
└── pyproject.toml
Each experiment leaf folder follows a consistent structure:
experiments/sec4_building_benchpress/method_comparison/
├── run.py # Run the experiment
├── plot.py # Generate figures
├── manifest.json # Generated method/transform grid
├── results.json # Generated aggregate results
├── predictions/ # Generated bottleneck fold-level predictions
└── figures/ # Generated figures (PDF + PNG)
Large generated artifacts are not checked into the release repository. This includes results.json, manifest.json, predictions/*.npz, confidence-score caches, generated figures, and generated tables. Experiment scripts follow an artifact-first policy: read an existing artifact if present, generate it from the documented upstream command if it is missing, and fail with the missing path and command if the upstream job cannot complete in the current environment.
This means plotting and table scripts can be run from a clean clone, but some first runs are intentionally expensive because they recreate fold-level prediction shards, confidence-calibration scores, API-model outputs, or other bottleneck artifacts. For example, method_comparison/plot.py will create missing predictions/*.npz, manifest.json, and results.json by running method_comparison/run.py --merge; confidence plots similarly create confidence_scores.npz and results.json via confidence_calibration/run.py --ensure. Once generated, these artifacts stay local and are reused by later scripts.
conda activate benchpress
python experiments/sec4_building_benchpress/method_comparison/run.py --merge
python experiments/sec4_building_benchpress/method_comparison/plot.pyBenchPress is maintained as a living score matrix. After adding citation-backed
models, benchmarks, or scores to benchpress/data/llm_benchmark_data.json, use
the maintenance wrappers to inspect and refresh downstream artifacts:
python maintenance/check_updates.py
maintenance/run_set.sh # dry-run preview by defaultTo execute the matrix refresh and selected downstream steps, set the relevant flags, for example:
DRY_RUN=0 RUN_MATRIX=1 RUN_GREEDY=1 RUN_PLOTS=1 RUN_WEBSITE=1 maintenance/run_set.shSee maintenance/README.md for the full checklist.
@misc{zeng2026dontneedruneval,
title={You Don't Need to Run Every Eval},
author={Yuchen Zeng and Dimitris Papailiopoulos},
year={2026},
eprint={2606.24020},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2606.24020}
}