Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
295 changes: 295 additions & 0 deletions docs/supervisor-profiles.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,295 @@
# Supervisor prompt + agent-profile ideas

Four `AgentProfile` sketches for the research/knowledge domain — a **researcher worker**, a
**verifier driver**, a **research supervisor**, and a thin **dedup-first verifier** — plus a port
plan for the existing `~/code/supervisor-lab`.

This is a *design doc*, not shipped code. Each sketch names the real primitive it composes (in
`agent-knowledge` or `supervisor-lab`) so building it is "assemble", not "invent". Where a primitive
already exists, the sketch says so and points at it — we extend, we do not fork.

## What we borrowed from emilkowalski/skills

[`emilkowalski/skills`](https://github.com/emilkowalski/skills) is a small pack of Claude skills for
design engineering (`emil-design-eng`, `review-animations`). It is not about agents, but its *skill
shape* is worth copying, and one structural choice maps directly onto our verifier:

- **Tight frontmatter, two fields.** `name` + a one-paragraph `description` that says *when* to
invoke. Nothing else. Our profiles' `description` should read the same way: a trigger, not a
feature list.
- **Terse non-negotiable rules over prose.** `review-animations` is "The Ten Non-Negotiable
Standards" — numbered, imperative, absolute ("Animate `transform` and `opacity` only").
A worker system prompt is more legible as a short numbered contract than as paragraphs.
- **"Default to flagging; approval is earned."** The review skill is *adversarial by posture* — it
exists to reject, and its output is a findings table + a verdict. This is exactly the right posture
for a verifier, and it informs the dedup point below.
- **`disable-model-invocation: true` on the reviewer.** The review skill is not auto-invoked by the
model; it is routed in deliberately. Our verifier is the same: a driver *calls* it at a gate, the
worker never invokes it on a whim.
- **Cheap because narrow.** Each skill does one thing. The review skill carries no design *generation*
ability — it only judges. That separation (generate vs. judge) is the whole reason the judge can be
thin.

### The dedup insight (why the verifier is a thin profile, not a heavy one)

`review-animations` is mostly a deduplicated rule set: the craft knowledge lives in `emil-design-eng`,
and the reviewer is a thin lens that *applies the same rules* to a diff. It does not re-derive taste;
it checks against a list.

Our verifier is the same. In this repo the "rules" are already deterministic code:

- `createResearcherValidator` (`src/profiles/researcher.ts:235`) — citation-density floor + namespace
check, **no LLM**.
- the `lint` / `validate --strict` CLI path — citation, link, and schema checks over markdown, **no
LLM** (`docs/architecture.md`, "The CLI ... does not call an LLM").
- `assessAuthoredProfile` / `profileRichnessFinding` in supervisor-lab's bench — a deterministic
"is this profile a stub" gate.

So the verifier profile should be **thin and cheap**: it mostly *calls the deterministic checker and
reports the verdict*, escalating to an LLM judge **only** for the residual a deterministic check can't
cover (does the cited source actually support the claim — a semantic check). A heavy verifier that
re-reasons every claim from scratch is wasted spend: the dedup already happened in the validator.
**Spend the model budget on the worker (generation is hard); keep the verifier a cheap lens (judging
against a list is easy).**

---

## Sketch 1 — `grounded-researcher` (leaf worker)

**Status: already exists, twice.** `agent-knowledge` ships `researcherProfile()`
(`src/profiles/researcher.ts:130`) — `tools: { web_search, fs, shell }`, propose-don't-apply, with a
matching `createResearcherValidator`. `supervisor-lab` ships
`profiles/research/grounded-researcher.ts` (web-search MCP, search→cite→synthesize). This sketch is
the **canonical shape both converge on** — build new only if you need a variant; otherwise
`mergeAgentProfiles` over the existing one.

**When to use:** one self-contained research question, spawned one-per-sub-topic by a driver. The
leaf — it does not spawn.

**Tools / resources:** `web_search` (the `tcloud mcp` stdio server, or the `web_search: true` tool),
`fs` for writing the cited dossier, `shell` for `agent-knowledge index/lint`.

**System prompt (the contract):**

```
You are a grounded research WORKER. You answer ONE question, and every load-bearing
claim is traceable to a real source you actually retrieved.

THE LOOP — search → cite → synthesize:
1. SEARCH issue multiple focused web_search queries; never stop at the first hit.
2. CITE attach every load-bearing claim to its source (title + url/identifier).
A claim you cannot cite is a claim you do NOT make. Never invent a source.
3. SYNTH lead with the answer; note where sources agree, conflict, or leave a gap.

NON-NEGOTIABLE:
- Honest "the evidence is thin / conflicting" beats a confident hallucination.
- Citation density floor is enforced downstream — under-citing is a hard fail, not a warning.
- You are the LEAF. You do not spawn sub-workers.

Return your cited synthesis as the settled output (a structured object if a schema was passed).
```

**Why this shape:** the validator (`createResearcherValidator`) checks citation density and namespace
*deterministically*, so the prompt's job is only to make the worker *produce* citable output — the
checking is not the worker's job. That split is what keeps the verifier cheap (Sketch 4).

---

## Sketch 2 — `verifier-driver` (the gate, NOT the judge)

**Status: composes existing primitives.** This is `agent-runtime`'s `verify({ implement, verifier })`
combinator (`src/runtime/personify/combinators.ts:333`) wired to a research worker + a thin checker.
**Do not write a new verify-loop** — the 2-node implement→gate already exists.

**When to use:** when "did the worker actually deliver" must be proven before the result counts —
i.e. always, on a graded run. A driver that runs ONE worker behind ONE gate.

**Tools:** the coordination verbs (`spawn_agent`, `await_event`, `steer_agent`) over a `Scope`, plus
the ability to call the deterministic checker (`agent-knowledge lint` / `createResearcherValidator`).

**System prompt (the contract):**

```
You are a VERIFIER-DRIVER. You spawn ONE research worker, then you GATE its output —
you never accept the worker's own say-so.

THE GATE — generate → check → decide:
1. SPAWN author a grounded-researcher worker for the question and spawn it.
2. AWAIT pull its settled output (await_event). Read the REAL output, not a summary.
3. CHECK run the DETERMINISTIC checker first (citation density, namespace, lint,
schema). This is cheap and catches most failures. Only if it PASSES do you
escalate to the one semantic question a checker can't answer:
"does each cited source actually support the claim it's attached to?"
4. DECIDE PASS → settle. FAIL → steer_agent with the SPECIFIC gap named
("claim X cites source Y but Y does not say X"), or respawn a sharper worker.

NON-NEGOTIABLE:
- selector != judge. You GATE (pass/fail against a contract); you do not re-rank or
rewrite the worker's content yourself.
- "Settled a file" is not delivery. The deterministic check passing is delivery.
- Default to FAILING. A pass is earned by surviving the checker, not by looking plausible.
```

**Why this shape:** the driver's intelligence is *deciding what to check and how to steer on a
fail*, not re-doing the research. The expensive judgment (does the source support the claim) is gated
behind the cheap deterministic checker, so on most runs the LLM-judge step never fires.

---

## Sketch 3 — `research-supervisor` (fan-out → fuse → recurse)

**Status: exists as `research-then-build-driver` + `parallel-fanout-supervisor` in supervisor-lab.**
This sketch is the **research-specialized merge** of those two — build new only as a thin overlay via
`mergeAgentProfiles`; the orchestration body is identical.

**When to use:** a research question too broad for one worker — needs decomposition into disjoint
sub-topics, parallel grounded workers, and a fused cited synthesis. The top of a research tree.

**Tools / resources:** coordination MCP (`spawn_agent` / `await_event` / `steer_agent` /
`answer_question` / `stop`) over a `Scope`; `web_search` to *ground the decomposition itself*; skills
`dynamic-workflows`, `orchestrating-workers`, `authoring-agent-profiles` (all already in
`supervisor-lab/skills/`).

**System prompt (the contract):**

```
You are a research SUPERVISOR. You do NOT research the topic yourself — you ground the
decomposition, fan out grounded-researcher workers, gate them, and FUSE their cited findings.

THE SHAPE — ground → fan-out → gate → fan-in:
1. GROUND web_search FIRST to see what actually exists. Let the real landscape, not
your priors, define the sub-topics. Cite what you find.
2. DECOMPOSE split into DISJOINT sub-questions — non-overlapping ownership.
3. FAN OUT author a grounded-researcher profile per sub-question; spawn ALL of them in
ONE wave before awaiting any. Vary their angles. Never spawn-await-spawn.
4. GATE drain await_event. For each settled worker, run the deterministic checker
(Sketch 4). REJECT uncited or unsupported claims — a finding without a
source is not a finding. Steer or respawn on a fail.
5. FAN IN author a FINAL synthesis worker that fuses ONLY the gated findings into one
cited answer. Never paste raw worker dumps. Check each result for failure first.
6. RECURSE if a sub-question is itself large, spawn a SUB-supervisor (this profile +
the coordination MCP) to own it.

NON-NEGOTIABLE:
- Author EACH worker as a FULL profile (name, description, prompt, model, tools, skills).
A 2-sentence prompt with no tools is a stub, not a worker — the richness gate will flag it.
- Never go blind: keep pulling until every spawned worker is terminal.
- Settle only on gated, cited output. Budget is conserved and depth is bounded by the Scope.
```

**Why this shape:** the supervisor's only real lever is *the quality of the profiles it authors* —
that is the capability `supervisor-lab` measures (`thinProfileRatio`,
`bench/supervise-topology.ts`). The gate (step 4) reuses the cheap verifier so the supervisor doesn't
burn its budget re-judging.

---

## Sketch 4 — `dedup-verifier` (thin, cheap, mostly deterministic)

**This is the emilkowalski dedup lesson made into a profile.** It is deliberately *thin* — the
"taste" lives in the deterministic validator, the profile is just the lens that applies it and
escalates only the residual.

**When to use:** called by a driver/supervisor at a gate (Sketch 2 step 3, Sketch 3 step 4). Never
auto-invoked — the verifier equivalent of `disable-model-invocation: true`.

**Tools:** `shell` (to run `agent-knowledge lint` / `validate --strict`), the
`createResearcherValidator` call, and an LLM *only* for the one semantic check. **No `web_search`, no
`fs` write, no generation tools** — a verifier that can edit content is a verifier that can cheat.

**System prompt (the whole thing — it's short on purpose):**

```
You are a VERIFIER. You judge ONE research output against a fixed contract. You default
to FAILING; a pass is earned. You never edit, research, or rewrite — you only judge.

ORDER (cheapest check first, stop at the first hard fail):
1. DETERMINISTIC run the validator/lint: citation density >= floor, namespace match,
links resolve, schema valid. If any fails → FAIL with the rule name.
~90% of failures die here, at zero LLM cost.
2. SEMANTIC ONLY if (1) passes: for each cited claim, does the cited source
actually support it? This is the one thing a checker can't do. Be terse.

OUTPUT: a findings table (claim | source | supported? | why) + a verdict (PASS / FAIL),
grouped by severity. No prose. A violation is a finding.

You carry NO generation ability by design. If you find yourself wanting to fix the
output, STOP — that is the worker's job, not yours.
```

**Why thin beats heavy here:** the dedup already happened in `createResearcherValidator` and the lint
path. A heavy verifier that re-reasons every claim with an LLM pays full model cost to re-derive a
check the validator does for free, and (worse) a verifier with generation tools can drift into doing
the work and grading itself. Thin + deterministic-first is both cheaper *and* harder to game. Measured
claim to test when built: on a real run, the LLM (step 2) should fire on **< ~10–20% of gated outputs**
because the deterministic check (step 1) absorbs the rest — if it fires on most, the validator floor
is mis-set, not the verifier.

---

## supervisor-lab — status and port plan

**It exists.** `~/code/supervisor-lab` is a real, non-trivial repo
(`github.com/tangle-network/supervisor-lab`), not a stub. `ls ~/code | grep -i supervisor` → it's
there.

It already is what one might propose to "create": the product/experiment layer for supervisor agents
over the `agent-runtime` substrate, dependency-one-way (`supervisor-lab → agent-runtime →
agent-eval`). It ships:

- **the two-agent loop, already ported** — `bench/supervise-topology.ts` is exactly the "two-agent
loop": a supervisor mounts the coordination MCP over a live `Scope`/`Supervisor`
(`createSupervisor`, `serveCoordinationMcp`), authors worker `AgentProfile`s as code, spawns them,
drains the bus (`await_event`), steers (`steer_agent`), and grades on a real judge. It has an
**OFFLINE $0 scripted path** and a LIVE cli-bridge path. This is the loop to plug these sketches
into — it does not need to be built.
- **profile archetypes** — `profiles/research/{grounded-researcher,research-then-build-driver}.ts`,
`profiles/engineering/{parallel-fanout-supervisor,long-horizon-plan-driver,hardware-auditor}.ts`.
Sketches 1 and 3 above are *already these files*.
- **a skill layer** — `skills/{authoring-agent-profiles,orchestrating-workers,dynamic-workflows,
spawning-research-loops}/SKILL.md`, each in the exact emilkowalski frontmatter+rules shape.
- **a catalog + ingest pipeline** — `src/catalog/`, `src/ingest/skills.ts` ingests vendored
Claude-Code-style skill packs (the same shape as `emilkowalski/skills`) by convention.
- **the richness gate** — `assessAuthoredProfile` / `profileRichnessFinding` / `thinProfileRatio`,
the deterministic "is this a stub" check that is the dedup-verifier's first line.

### How to port the two-agent loop (it's already there — this is how to extend it)

Because the loop already exists, "porting" means **adding these four profiles + the thin-verifier gate
into the existing harness**, not rebuilding it:

1. **Land the profiles as catalog files.** Sketches 1 and 3 are already
`profiles/research/grounded-researcher.ts` and `research-then-build-driver.ts`. Sketch 2
(`verifier-driver`) and Sketch 4 (`dedup-verifier`) are new files under
`profiles/research/` — author them as `defineAgentProfile` entries, resolve their skills via the
`skills(...)` helper in `profiles/_shared.ts` (fails closed on an unknown skill).

2. **Wire the dedup-verifier into the gate.** In `bench/supervise-topology.ts`, the worker's
`runWorker` already grades on `adapter.judge`. For research tasks, replace/augment that with the
deterministic `createResearcherValidator` + `agent-knowledge lint` path FIRST, and only escalate to
the LLM semantic check on a deterministic pass. This is the dedup point made executable: the
existing `assessAuthoredProfile` richness gate is the *profile* check; the validator is the
*output* check. Both run before any LLM judge.

3. **Add a research bench arm.** `supervise-topology.ts` auto-detects domain (repo vs. answer/text).
A research arm is an "answer/text" task whose `adapter.judge` is the cited-output validator. Add
`BENCH=research` that loads a research question + namespace and grades via
`createResearcherValidator`. The OFFLINE scripted path already proves the harness at $0; extend
`scriptedRun.workerResult` to emit a cited/uncited synthesis so the offline smoke exercises the
verifier without spend.

4. **Run the existing experiment knob.** The whole point of the lab is the one knob: catalog/skills
ON vs. OFF (`CATALOG=1`). With these profiles + the thin verifier in the catalog, run
capability-aware vs. baseline on a hard research task and read `thinProfileRatio` + delivery rate —
does giving the supervisor the research archetypes + the cheap gate produce better-composed teams
and more *cited, gated* deliveries.

### Cross-repo note

`agent-knowledge` already owns the deterministic research checker (`createResearcherValidator`, the
`lint`/`validate` CLI) and the researcher profile (`researcherProfile`,
`multiHarnessResearcherFanout`). The thin verifier should **call those**, not reimplement them —
`supervisor-lab` depends on `agent-runtime`, and the research checks live in `agent-knowledge`, so the
research bench arm pulls `@tangle-network/agent-knowledge` for the verifier's deterministic line. That
keeps the dedup in one place: the validator is authored once, the verifier profile is a thin lens over
it.
Loading