You Don't Need to Run Every Eval

Yuchen Zeng, Dimitris Papailiopoulos

Microsoft Research, AI Frontiers

Abstract: A modern model release reports scores on 40+ benchmarks; behind the release, evaluations were run orders of magnitude more often across checkpoints, hyperparameter sweeps, and design choices. We ask whether scores accumulated across public releases can anticipate a model's performance on benchmarks it has not yet been run on, and decide which evaluations are most worth running next.

We compile a public score matrix of 84 frontier models on 133 benchmarks (2,604 observed cells, 23.3% filled) and find its geometry is approximately rank-2: across complete submatrices, two factors explain more than 90% of the variance. We exploit this structure with BenchPress: logit-space bias-decomposed rank-2 matrix completion, which completes hidden scores within a 4.6 score-point median absolute error. A reliability analysis identifies when these predictions can be trusted — errors fall when the target model has richer observed evidence and behaviorally similar peers — calibrating 90% prediction intervals.

Finally, we stress-test deployment: five probe benchmarks predict the rest of the profile to a median absolute error of 3.93 score points (4.55 on a low-cost allowlist) while preserving 92.1% of pairwise model rankings, and reach 5.0 on brand-new releases.

News 🚀

[TBD] BenchPress paper and code released.

Step 1: Set Up Environment

To set up the environment for using BenchPress, please follow the steps below.

Clone this repository and rename it as benchpress

 git clone https://github.com/microsoft/benchpress
 cd benchpress

Install Packages

Linux / Mac

conda create -n benchpress python=3.10
conda activate benchpress
pip install -e .          # editable install — makes `from benchpress.*` work everywhere
python -m benchpress.download_data

Windows

TBA

Step 2: Download the Data

BenchPress uses a citation-backed evaluation matrix:

189 frontier LLMs from 28 providers (OpenAI, Anthropic, Google, Meta, DeepSeek, Alibaba, Mistral, xAI, Moonshot AI, Zhipu AI, Microsoft, ByteDance, Amazon, MiniMax, NVIDIA, Cohere, Allen AI, IBM, Liquid AI, LG AI Research, Hugging Face, OpenBMB, TII, Sarvam AI, Shanghai AI Lab, Open Thoughts, Meituan, Mistral AI) — before filtering
316 benchmarks across 59 categories — before filtering
4,903 observed scores, each citation-backed
Paper-canonical filter (keep models with $\geq 15$ observed scores and benchmarks with $\geq 8$ observed models), with duplicate/setting-variant exclusions: 84 models × 133 benchmarks, 2,604 observed (23.3% fill rate)
Smart clip: only percentage-scale benchmarks are clipped to [0, 100]; Elo/rating benchmarks (Codeforces, Chatbot Arena, GDP-Val) are left unclipped

Observed (blue) vs. missing (white) cells in the paper-canonical 84 × 133 score matrix.

The public dataset is published at:

Hugging Face: https://huggingface.co/datasets/microsoft/benchpress-score-matrix
Local cache after download: benchpress/data/llm_benchmark_data.json

BenchPress is a living dataset: new model releases, benchmark updates, and corrected citations can be added as the evaluation landscape changes. We welcome pull requests that add citation-backed scores, new models, new benchmarks, or provenance fixes.

After running python -m benchpress.download_data, the package creates a local JSON cache under benchpress/data/:

benchpress/data/
├── llm_benchmark_data.json        # Machine-readable scores
└── _hf_cache/                      # Downloaded CSV mirror, when JSON is rebuilt from tables

The Hugging Face release is the public table export. It includes scores_all (the pre-filter public score table), scores_paper (the paper-canonical 84-model x 133-benchmark matrix), plus model and benchmark metadata. The downloader rebuilds llm_benchmark_data.json from this public mirror when a full JSON artifact is not present. That rebuilt JSON is sufficient for package use and paper-matrix reproduction, but it is not the complete internal audit artifact: rich fields such as candidates[] and raw cost-evidence traces may be absent.

llm_benchmark_data.json is the canonical source for code. It contains:

models[]: model metadata, including provider and release/canonical-setting fields when available
benchmarks[]: benchmark metadata, including category, scale, canonical-setting fields, and cost evidence when available
scores[]: observed cells as {model_id, benchmark_id, score, reference_url} records

Use it directly from the package:

from benchpress.evaluation_harness import M_FULL, MODEL_IDX, BENCH_IDX

score = M_FULL[MODEL_IDX["gpt-5.2"], BENCH_IDX["gpqa_diamond"]]

or load the JSON yourself:

import json
from pathlib import Path

data = json.loads(Path("benchpress/data/llm_benchmark_data.json").read_text())
scores = data["scores"]

Or load the public Hugging Face mirror:

from datasets import load_dataset

ds = load_dataset("microsoft/benchpress-score-matrix", "scores_paper")

Step 3: Predict Scores

Predict for an Existing Model

# Predict all missing scores for a model
python predict.py --model gpt-5.2

# Predict a single score
python predict.py --model gpt-5.2 --benchmark gpqa_diamond

# List available models / benchmarks
python predict.py --list-models
python predict.py --list-benchmarks

Add Your Own Model

Provide a few known scores; BenchPress predicts the rest.

python predict.py --add-model my-model \
  --scores "simpleqa=50.0,gpqa_diamond=70.0,aime_2025=55.0"

What happens under the hood

BenchPress uses Logit + Bias ALS:

Transform percentage-scale scores with logit; leave non-percentage benchmarks in their native score space.
Z-score each benchmark column, then fit a bias-decomposed low-rank ALS model with per-model bias, per-benchmark bias, and rank-2 latent factors.

Predictions are inverted back to score space and smart-clipped: percentage-scale benchmarks are clamped to [0, 100]; Elo/rating benchmarks (Codeforces, Chatbot Arena, GDP-Val, Swelancer, Vending-Bench) are left unclipped.

Method	MedAPE ↓	MedAE ↓	Within ±3 pts	Within ±5 pts	Coverage
BenchPress (Logit Bias ALS)	7.8%	4.60	36.6%	52.8%	100%

Canonical fold-level BenchPress predictions are generated under benchpress/evaluation/default_predictions/benchpress_default/ when needed.

All numbers use per-model 3-fold holdout (10 seeds × 3 folds = 30 folds): each seed partitions every model's observed benchmark scores into three disjoint test folds. Primary metric: MedAPE (median absolute percentage error). Matrix: 84 models × 133 benchmarks, 2,604 observed cells.

Step 4: Reproduce Paper Experiments

Repository Structure

benchpress/
├── benchpress/                           # Core library (editable install)
│   ├── data/                             #   Canonical score/cost data + schema
│   ├── evaluation/                       #   Folds + generated default prediction artifacts
│   ├── methods/                          #   Transforms, completers, predictors, confidence
│   ├── build_benchmark_matrix/           #   Raw sources → canonical matrix construction
│   ├── plot_helpers/                     #   Shared plotting + visual identity
│   ├── evaluation_harness.py             #   Matrix loader, holdout protocols, metrics
│   ├── io_utils.py                       #   Shared JSON / gzip JSON / atomic writes
│   └── shard_utils.py                    #   Shared shard execution/merge helpers
│
├── experiments/                          # All experiments, mirroring paper sections
│   ├── sec1_intro/hero_figure/           #   §1 — Hero figure
│   ├── sec3_low_rank/                    #   §3 — Low-rank structure (matrix viz, SVD)
│   ├── sec4_building_benchpress/         #   §4 — Building BenchPress (recipe ablations)
│   ├── sec5_findings/                    #   §5 — Findings (predictability, ranking, robustness)
│   └── appendix_*/                       #   Appendix experiments
├── predict.py                            # CLI prediction tool
└── pyproject.toml

Each experiment leaf folder follows a consistent structure:

experiments/sec4_building_benchpress/method_comparison/
├── run.py              # Run the experiment
├── plot.py             # Generate figures
├── manifest.json       # Generated method/transform grid
├── results.json        # Generated aggregate results
├── predictions/        # Generated bottleneck fold-level predictions
└── figures/            # Generated figures (PDF + PNG)

Artifact Policy

Large generated artifacts are not checked into the release repository. This includes results.json, manifest.json, predictions/*.npz, confidence-score caches, generated figures, and generated tables. Experiment scripts follow an artifact-first policy: read an existing artifact if present, generate it from the documented upstream command if it is missing, and fail with the missing path and command if the upstream job cannot complete in the current environment.

This means plotting and table scripts can be run from a clean clone, but some first runs are intentionally expensive because they recreate fold-level prediction shards, confidence-calibration scores, API-model outputs, or other bottleneck artifacts. For example, method_comparison/plot.py will create missing predictions/*.npz, manifest.json, and results.json by running method_comparison/run.py --merge; confidence plots similarly create confidence_scores.npz and results.json via confidence_calibration/run.py --ensure. Once generated, these artifacts stay local and are reused by later scripts.

Run a Single Experiment

conda activate benchpress

python experiments/sec4_building_benchpress/method_comparison/run.py --merge
python experiments/sec4_building_benchpress/method_comparison/plot.py

Step 5: Maintain the Living Matrix

BenchPress is maintained as a living score matrix. After adding citation-backed models, benchmarks, or scores to benchpress/data/llm_benchmark_data.json, use the maintenance wrappers to inspect and refresh downstream artifacts:

python maintenance/check_updates.py
maintenance/run_set.sh   # dry-run preview by default

To execute the matrix refresh and selected downstream steps, set the relevant flags, for example:

DRY_RUN=0 RUN_MATRIX=1 RUN_GREEDY=1 RUN_PLOTS=1 RUN_WEBSITE=1 maintenance/run_set.sh

See maintenance/README.md for the full checklist.

Step 6: Cite Us

@misc{zeng2026dontneedruneval,
  title={You Don't Need to Run Every Eval},
  author={Yuchen Zeng and Dimitris Papailiopoulos},
  year={2026},
  eprint={2606.24020},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2606.24020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github		.github
benchpress		benchpress
experiments		experiments
imgs		imgs
maintenance		maintenance
website		website
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
predict.py		predict.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

You Don't Need to Run Every Eval

Yuchen Zeng, Dimitris Papailiopoulos

Microsoft Research, AI Frontiers

News 🚀

Contents

Step 1: Set Up Environment

Step 2: Download the Data

Step 3: Predict Scores

Predict for an Existing Model

Add Your Own Model

Step 4: Reproduce Paper Experiments

Repository Structure

Artifact Policy

Run a Single Experiment

Step 5: Maintain the Living Matrix

Step 6: Cite Us

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

You Don't Need to Run Every Eval

Yuchen Zeng, Dimitris Papailiopoulos Microsoft Research, AI Frontiers

News 🚀

Contents

Step 1: Set Up Environment

Step 2: Download the Data

Step 3: Predict Scores

Predict for an Existing Model

Add Your Own Model

Step 4: Reproduce Paper Experiments

Repository Structure

Artifact Policy

Run a Single Experiment

Step 5: Maintain the Living Matrix

Step 6: Cite Us

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Yuchen Zeng, Dimitris Papailiopoulos

Microsoft Research, AI Frontiers

Packages