How are you testing skills or ensuring their quality? #333

ColinEberhardt · 2026-02-04T05:08:00Z

ColinEberhardt
Feb 4, 2026

For the skills included in this report, how do you assess the quality of each skill? How do you determine whether they deliver the functionality they claim? That it delivers quality results? And if someone submits a PR that it enhances the overall quality of a skill?

chasewhughes · 2026-03-28T20:49:36Z

chasewhughes
Mar 28, 2026

Great question — this is one of the underexplored problems as the skill ecosystem scales.

I've been building production agentic systems for a while now (including Braid, an open-source LangGraph agent builder), and testing agent behaviors is fundamentally different from testing deterministic code. Here's a framework I've landed on:

Three layers of skill quality assessment:

1. Functional Coverage (does it do what it claims?)

Golden-prompt regression tests: a curated set of 5-10 representative inputs per skill with expected behavioral outcomes (not exact string matches — behavioral assertions like "response includes a SQL query targeting the users table")
Edge case prompts that test boundary conditions of the skill's trigger descriptions
Negative tests: inputs that shouldn't trigger the skill but are semantically adjacent

2. Output Quality (is the result actually good?)

This is the hard one. LLM-as-judge evaluation works surprisingly well here — have a second model score outputs on rubrics specific to the skill's domain (e.g., for a "write user stories" skill: Does it follow INVEST? Are acceptance criteria testable?)
A/B comparison: run the same prompt with and without the skill loaded, compare outputs. The skill should demonstrably improve the result
Human spot-checks on a rolling basis, especially after PRs

3. Ecosystem Fit (does it play well with others?)

Trigger collision detection: automated check that new/modified skill descriptions don't create ambiguous overlaps with existing skills
Context budget analysis: how much of the context window does this skill consume, and is that proportional to its value?
Composability test: does the skill degrade when other skills are co-loaded?

For PR review specifically, I'd suggest requiring:

At least 3 golden-prompt test cases included with every skill submission
A tests/ directory convention in the skill bundle
CI that runs the golden prompts against the current model and flags regressions

The bigger question this raises is whether there should be a standard skill testing harness in the repo — something like skill test my-skill/ that runs a conventional test suite. That would make quality enforcement scalable rather than relying on manual review.

1 reply

bbrewington Jun 7, 2026

Did Braid move to link below? The link in your post is throwing HTTP 404: https://github.com/hughes7370/braid

https://github.com/braid-ink/braid

mj-deving · 2026-04-17T07:51:36Z

mj-deving
Apr 17, 2026

The framework above is solid. Wanted to share what actually worked in practice when I built a testing setup for my own skills.

The biggest lesson: golden-prompt tests are necessary but not sufficient. LLM outputs shift between model versions, so your assertions need to be behavioral, not textual. I landed on a pattern like this:

# test_my_skill.py
def test_skill_triggers_correctly():
    result = run_skill_with_prompt("generate a SQL query for user signups")
    assert "SELECT" in result.upper()
    assert "users" in result.lower() or "signups" in result.lower()
    assert not result.startswith("I'm sorry")  # no refusals

def test_skill_does_not_trigger_on_adjacent_input():
    result = run_without_skill("tell me about databases")
    # Should work fine without the skill loaded
    assert len(result) > 50

For output quality, LLM-as-judge works surprisingly well if you give the judge a rubric. A simple 1-5 scale with clear anchors beats a vague "is this good?" prompt every time.

The piece that's hardest to test is ecosystem fit — does adding skill X degrade skill Y? I've seen skills that work perfectly in isolation but eat 3k tokens of context when co-loaded, starving other skills. A simple smoke test that loads all skills together and runs a diverse set of prompts catches these regressions.

One practical suggestion: a skill test CLI command that runs a conventional test suite per skill would make quality enforcement much more scalable. Something like skill test my-skill/ --model=sonnet --runs=3 that averages over multiple runs to account for non-determinism.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are you testing skills or ensuring their quality? #333

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How are you testing skills or ensuring their quality? #333

Uh oh!

ColinEberhardt Feb 4, 2026

Replies: 2 comments · 1 reply

Uh oh!

chasewhughes Mar 28, 2026

Uh oh!

bbrewington Jun 7, 2026

Uh oh!

mj-deving Apr 17, 2026

ColinEberhardt
Feb 4, 2026

Replies: 2 comments 1 reply

chasewhughes
Mar 28, 2026

mj-deving
Apr 17, 2026