Measuring the effectiveness of these agents #94

johnbillion · 2025-10-13T10:23:26Z

johnbillion
Oct 13, 2025

I'd love to see some supporting evidence in this repo that documents the effectiveness of each of the agents compared to using a vanilla sub-agent for the same task.

Discussion #42 raised the concern that that the prompts are very generic. I use Claude Code daily and recently started using vanilla sub-agents for tasks that don't need all the context of the main workflow, and I've found it to be effective based on my prompt alone. My gut feeling is that the light-touch agents in this repo won't materially make much difference to the effectiveness of each sub-agent, but I would love to be proved wrong.

In #42 you requested quantifiable metrics to back up a claim that the sub-agents aren't effective. To flip this on its head, are there quantifiable metrics that can be used to demonstrate that the sub-agents in this repo are more effective than a vanilla prompt? I presume they must be more effective otherwise there would be no need for them to exist, but there's no mention of this in the repo description.

How can the effectiveness of the sub-agents in this repo be measured, documented, and tracked over time in order to demonstrate that they are actually more effective than a vanilla sub-agent using the same prompt?

wshobson · 2026-05-22T17:06:38Z

wshobson
May 22, 2026
Maintainer

Hey @johnbillion, fair question, and sorry for letting this sit so long. Honest answer:

What exists today. The repo has a quality-evaluation framework in plugins/plugin-eval/ (docs: docs/plugin-eval.md) that scores plugins/skills across 10 quality dimensions using three layers — deterministic static analysis, LLM-based semantic judging, and Monte Carlo simulation with confidence intervals. It produces letter grades and badges. That's the closest thing to "how good is this plugin" we have.

What plugin-eval does not answer is your actual question: do these agents outperform a vanilla subagent with the same prompt on the same task? That's a head-to-head benchmark and I don't have one. I'd love to.

The reason it's hard: a fair comparison needs a task corpus, a reproducible harness, and an evaluator that doesn't bias toward the specialized prompt. The current docs/round-trip-results.md covers real-CLI verification recipes (does the generated artifact actually load and run on each harness?), but that's a correctness check, not an effectiveness benchmark.

What I'd accept as evidence either way. If anyone in this thread (or yourself) wants to wire up a small benchmark — pick 10 tasks, run each through (a) vanilla subagent, (b) a relevant specialized agent from this repo, score completions blind — I'd surface the results in the README regardless of which side wins. Your point in #42 stands: if the agents don't materially help, that's worth knowing.

No promises on when I'll build this myself, but the door is open for contribution.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Measuring the effectiveness of these agents #94

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Measuring the effectiveness of these agents #94

Uh oh!

johnbillion Oct 13, 2025

Replies: 1 comment

Uh oh!

wshobson May 22, 2026 Maintainer

johnbillion
Oct 13, 2025

wshobson
May 22, 2026
Maintainer