Skip to content

microsoft/benchpress

Repository files navigation

You Don't Need to Run Every Eval

Yuchen Zeng, Dimitris Papailiopoulos

Microsoft Research, AI Frontiers

GitHub release arXiv License

Project page · Code · Dataset · Paper

Abstract: A modern model release reports scores on 40+ benchmarks; behind the release, evaluations were run orders of magnitude more often across checkpoints, hyperparameter sweeps, and design choices. We ask whether scores accumulated across public releases can anticipate a model's performance on benchmarks it has not yet been run on, and decide which evaluations are most worth running next.

We compile a public score matrix of 84 frontier models on 133 benchmarks (2,604 observed cells, 23.3% filled) and find its geometry is approximately rank-2: across complete submatrices, two factors explain more than 90% of the variance. We exploit this structure with BenchPress: logit-space bias-decomposed rank-2 matrix completion, which completes hidden scores within a 4.6 score-point median absolute error. A reliability analysis identifies when these predictions can be trusted — errors fall when the target model has richer observed evidence and behaviorally similar peers — calibrating 90% prediction intervals.

Finally, we stress-test deployment: five probe benchmarks predict the rest of the profile to a median absolute error of 3.93 score points (4.55 on a low-cost allowlist) while preserving 92.1% of pairwise model rankings, and reach 5.0 on brand-new releases.

BenchPress hero figure

News 🚀

  • [TBD] BenchPress paper and code released.

Contents

Step 1: Set Up Environment

To set up the environment for using BenchPress, please follow the steps below.

  1. Clone this repository and rename it as benchpress

     git clone https://github.com/microsoft/benchpress
     cd benchpress
  2. Install Packages

    Linux / Mac
    conda create -n benchpress python=3.10
    conda activate benchpress
    pip install -e .          # editable install — makes `from benchpress.*` work everywhere
    python -m benchpress.download_data
    Windows TBA

Step 2: Download the Data

BenchPress uses a citation-backed evaluation matrix:

  • 189 frontier LLMs from 28 providers (OpenAI, Anthropic, Google, Meta, DeepSeek, Alibaba, Mistral, xAI, Moonshot AI, Zhipu AI, Microsoft, ByteDance, Amazon, MiniMax, NVIDIA, Cohere, Allen AI, IBM, Liquid AI, LG AI Research, Hugging Face, OpenBMB, TII, Sarvam AI, Shanghai AI Lab, Open Thoughts, Meituan, Mistral AI) — before filtering
  • 316 benchmarks across 59 categories — before filtering
  • 4,903 observed scores, each citation-backed
  • Paper-canonical filter (keep models with $\geq 15$ observed scores and benchmarks with $\geq 8$ observed models), with duplicate/setting-variant exclusions: 84 models × 133 benchmarks, 2,604 observed (23.3% fill rate)
  • Smart clip: only percentage-scale benchmarks are clipped to [0, 100]; Elo/rating benchmarks (Codeforces, Chatbot Arena, GDP-Val) are left unclipped

BenchPress score matrix observation pattern (84 models × 133 benchmarks, 23.3% filled)
Observed (blue) vs. missing (white) cells in the paper-canonical 84 × 133 score matrix.

The public dataset is published at:

BenchPress is a living dataset: new model releases, benchmark updates, and corrected citations can be added as the evaluation landscape changes. We welcome pull requests that add citation-backed scores, new models, new benchmarks, or provenance fixes.

After running python -m benchpress.download_data, the package creates a local JSON cache under benchpress/data/:

benchpress/data/
├── llm_benchmark_data.json        # Machine-readable scores
└── _hf_cache/                      # Downloaded CSV mirror, when JSON is rebuilt from tables

The Hugging Face release is the public table export. It includes scores_all (the pre-filter public score table), scores_paper (the paper-canonical 84-model x 133-benchmark matrix), plus model and benchmark metadata. The downloader rebuilds llm_benchmark_data.json from this public mirror when a full JSON artifact is not present. That rebuilt JSON is sufficient for package use and paper-matrix reproduction, but it is not the complete internal audit artifact: rich fields such as candidates[] and raw cost-evidence traces may be absent.

llm_benchmark_data.json is the canonical source for code. It contains:

  • models[]: model metadata, including provider and release/canonical-setting fields when available
  • benchmarks[]: benchmark metadata, including category, scale, canonical-setting fields, and cost evidence when available
  • scores[]: observed cells as {model_id, benchmark_id, score, reference_url} records

Use it directly from the package:

from benchpress.evaluation_harness import M_FULL, MODEL_IDX, BENCH_IDX

score = M_FULL[MODEL_IDX["gpt-5.2"], BENCH_IDX["gpqa_diamond"]]

or load the JSON yourself:

import json
from pathlib import Path

data = json.loads(Path("benchpress/data/llm_benchmark_data.json").read_text())
scores = data["scores"]

Or load the public Hugging Face mirror:

from datasets import load_dataset

ds = load_dataset("microsoft/benchpress-score-matrix", "scores_paper")

Step 3: Predict Scores

Predict for an Existing Model

# Predict all missing scores for a model
python predict.py --model gpt-5.2

# Predict a single score
python predict.py --model gpt-5.2 --benchmark gpqa_diamond

# List available models / benchmarks
python predict.py --list-models
python predict.py --list-benchmarks

Add Your Own Model

Provide a few known scores; BenchPress predicts the rest.

python predict.py --add-model my-model \
  --scores "simpleqa=50.0,gpqa_diamond=70.0,aime_2025=55.0"
What happens under the hood

BenchPress uses Logit + Bias ALS:

  1. Transform percentage-scale scores with logit; leave non-percentage benchmarks in their native score space.
  2. Z-score each benchmark column, then fit a bias-decomposed low-rank ALS model with per-model bias, per-benchmark bias, and rank-2 latent factors.

Predictions are inverted back to score space and smart-clipped: percentage-scale benchmarks are clamped to [0, 100]; Elo/rating benchmarks (Codeforces, Chatbot Arena, GDP-Val, Swelancer, Vending-Bench) are left unclipped.

Method MedAPE ↓ MedAE ↓ Within ±3 pts Within ±5 pts Coverage
BenchPress (Logit Bias ALS) 7.8% 4.60 36.6% 52.8% 100%

Canonical fold-level BenchPress predictions are generated under benchpress/evaluation/default_predictions/benchpress_default/ when needed.

All numbers use per-model 3-fold holdout (10 seeds × 3 folds = 30 folds): each seed partitions every model's observed benchmark scores into three disjoint test folds. Primary metric: MedAPE (median absolute percentage error). Matrix: 84 models × 133 benchmarks, 2,604 observed cells.

Step 4: Reproduce Paper Experiments

Repository Structure

benchpress/
├── benchpress/                           # Core library (editable install)
│   ├── data/                             #   Canonical score/cost data + schema
│   ├── evaluation/                       #   Folds + generated default prediction artifacts
│   ├── methods/                          #   Transforms, completers, predictors, confidence
│   ├── build_benchmark_matrix/           #   Raw sources → canonical matrix construction
│   ├── plot_helpers/                     #   Shared plotting + visual identity
│   ├── evaluation_harness.py             #   Matrix loader, holdout protocols, metrics
│   ├── io_utils.py                       #   Shared JSON / gzip JSON / atomic writes
│   └── shard_utils.py                    #   Shared shard execution/merge helpers
│
├── experiments/                          # All experiments, mirroring paper sections
│   ├── sec1_intro/hero_figure/           #   §1 — Hero figure
│   ├── sec3_low_rank/                    #   §3 — Low-rank structure (matrix viz, SVD)
│   ├── sec4_building_benchpress/         #   §4 — Building BenchPress (recipe ablations)
│   ├── sec5_findings/                    #   §5 — Findings (predictability, ranking, robustness)
│   └── appendix_*/                       #   Appendix experiments
├── predict.py                            # CLI prediction tool
└── pyproject.toml

Each experiment leaf folder follows a consistent structure:

experiments/sec4_building_benchpress/method_comparison/
├── run.py              # Run the experiment
├── plot.py             # Generate figures
├── manifest.json       # Generated method/transform grid
├── results.json        # Generated aggregate results
├── predictions/        # Generated bottleneck fold-level predictions
└── figures/            # Generated figures (PDF + PNG)

Artifact Policy

Large generated artifacts are not checked into the release repository. This includes results.json, manifest.json, predictions/*.npz, confidence-score caches, generated figures, and generated tables. Experiment scripts follow an artifact-first policy: read an existing artifact if present, generate it from the documented upstream command if it is missing, and fail with the missing path and command if the upstream job cannot complete in the current environment.

This means plotting and table scripts can be run from a clean clone, but some first runs are intentionally expensive because they recreate fold-level prediction shards, confidence-calibration scores, API-model outputs, or other bottleneck artifacts. For example, method_comparison/plot.py will create missing predictions/*.npz, manifest.json, and results.json by running method_comparison/run.py --merge; confidence plots similarly create confidence_scores.npz and results.json via confidence_calibration/run.py --ensure. Once generated, these artifacts stay local and are reused by later scripts.

Run a Single Experiment

conda activate benchpress

python experiments/sec4_building_benchpress/method_comparison/run.py --merge
python experiments/sec4_building_benchpress/method_comparison/plot.py

Step 5: Maintain the Living Matrix

BenchPress is maintained as a living score matrix. After adding citation-backed models, benchmarks, or scores to benchpress/data/llm_benchmark_data.json, use the maintenance wrappers to inspect and refresh downstream artifacts:

python maintenance/check_updates.py
maintenance/run_set.sh   # dry-run preview by default

To execute the matrix refresh and selected downstream steps, set the relevant flags, for example:

DRY_RUN=0 RUN_MATRIX=1 RUN_GREEDY=1 RUN_PLOTS=1 RUN_WEBSITE=1 maintenance/run_set.sh

See maintenance/README.md for the full checklist.

Step 6: Cite Us

@misc{zeng2026dontneedruneval,
  title={You Don't Need to Run Every Eval},
  author={Yuchen Zeng and Dimitris Papailiopoulos},
  year={2026},
  eprint={2606.24020},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2606.24020}
}

About

BenchPress: calibrated LLM benchmark score completion

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors