How are you testing skills or ensuring their quality? #333
Replies: 2 comments 1 reply
-
|
Great question — this is one of the underexplored problems as the skill ecosystem scales. I've been building production agentic systems for a while now (including Braid, an open-source LangGraph agent builder), and testing agent behaviors is fundamentally different from testing deterministic code. Here's a framework I've landed on: Three layers of skill quality assessment: 1. Functional Coverage (does it do what it claims?)
2. Output Quality (is the result actually good?)
3. Ecosystem Fit (does it play well with others?)
For PR review specifically, I'd suggest requiring:
The bigger question this raises is whether there should be a standard skill testing harness in the repo — something like |
Beta Was this translation helpful? Give feedback.
-
|
The framework above is solid. Wanted to share what actually worked in practice when I built a testing setup for my own skills. The biggest lesson: golden-prompt tests are necessary but not sufficient. LLM outputs shift between model versions, so your assertions need to be behavioral, not textual. I landed on a pattern like this: # test_my_skill.py
def test_skill_triggers_correctly():
result = run_skill_with_prompt("generate a SQL query for user signups")
assert "SELECT" in result.upper()
assert "users" in result.lower() or "signups" in result.lower()
assert not result.startswith("I'm sorry") # no refusals
def test_skill_does_not_trigger_on_adjacent_input():
result = run_without_skill("tell me about databases")
# Should work fine without the skill loaded
assert len(result) > 50For output quality, LLM-as-judge works surprisingly well if you give the judge a rubric. A simple 1-5 scale with clear anchors beats a vague "is this good?" prompt every time. The piece that's hardest to test is ecosystem fit — does adding skill X degrade skill Y? I've seen skills that work perfectly in isolation but eat 3k tokens of context when co-loaded, starving other skills. A simple smoke test that loads all skills together and runs a diverse set of prompts catches these regressions. One practical suggestion: a |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
For the skills included in this report, how do you assess the quality of each skill? How do you determine whether they deliver the functionality they claim? That it delivers quality results? And if someone submits a PR that it enhances the overall quality of a skill?
Beta Was this translation helpful? Give feedback.
All reactions