Measuring the effectiveness of these agents #94
Replies: 1 comment
-
|
Hey @johnbillion, fair question, and sorry for letting this sit so long. Honest answer: What exists today. The repo has a quality-evaluation framework in What The reason it's hard: a fair comparison needs a task corpus, a reproducible harness, and an evaluator that doesn't bias toward the specialized prompt. The current What I'd accept as evidence either way. If anyone in this thread (or yourself) wants to wire up a small benchmark — pick 10 tasks, run each through (a) vanilla subagent, (b) a relevant specialized agent from this repo, score completions blind — I'd surface the results in the README regardless of which side wins. Your point in #42 stands: if the agents don't materially help, that's worth knowing. No promises on when I'll build this myself, but the door is open for contribution. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'd love to see some supporting evidence in this repo that documents the effectiveness of each of the agents compared to using a vanilla sub-agent for the same task.
Discussion #42 raised the concern that that the prompts are very generic. I use Claude Code daily and recently started using vanilla sub-agents for tasks that don't need all the context of the main workflow, and I've found it to be effective based on my prompt alone. My gut feeling is that the light-touch agents in this repo won't materially make much difference to the effectiveness of each sub-agent, but I would love to be proved wrong.
In #42 you requested quantifiable metrics to back up a claim that the sub-agents aren't effective. To flip this on its head, are there quantifiable metrics that can be used to demonstrate that the sub-agents in this repo are more effective than a vanilla prompt? I presume they must be more effective otherwise there would be no need for them to exist, but there's no mention of this in the repo description.
How can the effectiveness of the sub-agents in this repo be measured, documented, and tracked over time in order to demonstrate that they are actually more effective than a vanilla sub-agent using the same prompt?
Beta Was this translation helpful? Give feedback.
All reactions