Skip to content

Latest commit

 

History

History
184 lines (114 loc) · 21 KB

File metadata and controls

184 lines (114 loc) · 21 KB

AgentOps 3.0: the north star

This is the single source of truth for what AgentOps 3.0 is. Decision records: ADR-0002 (hookless) and ADR-0009 (daemon deletion / in-session-only identity). Mechanism: docs/cdlc.md. The loop model: canonical-loop-model.md. Execution discipline: operating-loop.md. Seams: ports-and-adapters.md. Component routing and trim/defer posture: component-map.md. Everything else in the repo should be consistent with this page; when a doc disagrees, this page and the executable behavior win.

AgentOps 3.0 is autonomous code validation for coding agents: the in-session operating loop, verification membrane, and context compiler that answer whether agent-written code is right. RPI naming disambiguation (/rpi skill vs ao rpi CLI vs operating loop): codebase-overview — RPI terminology. It compiles the best of software engineering into compact, verifiable, reusable agent context, proves or rejects the output, then compounds that evidence across every session, model, and harness. The product is the validation-centered in-session loop that produces validated output with proof — it catches the agent declaring done when it isn't. The .agents/ knowledge corpus it also compounds is a demoted, unproven hypothesis (ADR-0004), not the product. It is not a hook bundle, not a daemon, and not a hosted control plane.


The thesis in one paragraph

A software factory runs the same loop at every scale: turn declared intent into validated code, then recycle the exhaust as context for the next turn. AgentOps names that loop, runs it in session, and makes the proof trail first-class. Skills are the portable agent runtime. rpi is the inner loop, one research-plan-implement-validate cycle. evolve is the outer loop, N rpi cycles toward a goal. crank and swarm are the in-session agent teams that fan a wave out across worktrees. The loops are fractal: the same shape at every layer, run by a human or by a stand-in agent. Context is the engineering artifact handed off at every loop edge, and it compounds at every level. The proof trail — verdicts, catches, and escapes (CONFIRMED-then-wrong) — is where the durable asset accumulates: the escape-corpus (the false-dones the membrane has learned to catch) is the candidate moat, with a deterministic gradient the knowledge corpus never had. The knowledge .agents/ corpus is demoted to an unproven hypothesis (ADR-0004) — its delta did not hold at the frontier-single-task altitude — and is not the headline. Running that loop out of session — always-on, scheduled, queue-driven — is earned by evidence on the Autonomy Ladder and delegated to an orchestration substrate.


Why a loop and not a pipeline: the navigator model

The reason 3.0 is a loop with re-routing and not a linear pipeline is structural: orchestration determinism runs inverse to worker determinism. A deterministic worker takes rails — a script or DAG is correct, because the route is the track. An agent is stochastic: told one step, it has real probability of confidently doing the wrong thing. So you do not script the route, you set the destination (the acceptance behavior, "done"), and the system traverses to it.

That traversal has two layers, and keeping them distinct is the whole discipline:

  • The map is deterministic and trusted. The role-topology — the loop's stages, the legal transitions between them, and the gates — is fixed. BDD intent → tests → bounded build → both-ends validation → ratchet is the map. It does not change per goal.
  • The route is dynamic and recalculated. The path a given goal takes through that map is chosen at each node and re-routed on failure (fresh agent on failure, structured rejection fed back as the next prompt). Linear crank / swarm waves are the rails-shaped special case — valid only when write scopes do not collide; the general shape is goal-directed traversal with live re-routing.

Trust the environment, not the agent. The map plus the gates supply the reliability the stochastic worker cannot. And the gates are not optional polish — they are the windshield. The agent's signature failure is not a wrong turn on a real map; it hallucinates roads that do not exist (an API, a file, a fact) and drives toward them. Re-routing cannot save you from a road that was never there — only deterministic ground-truth (the test that actually runs, the gate that actually checks) catches it. This is why the both-ends validation in the loop below is load-bearing: it is the sensor that keeps a confident hallucination from reaching "accepted."


The 3.0 loop (the whole product, in one picture)

The SDLC is a loop for producing reliable code from non-deterministic humans. The CDLC (Context Development Life Cycle) is the same loop for producing reliable context for non-deterministic agents. 3.0 runs that loop:

SDLC (code)  ─────────────  CDLC (context, the same loop)

intent  ──▶  behavior, written as BDD / Gherkin     ← behavior-driven PLANNING
            (Given / When / Then is the plan)
        ──▶  tests derived from the behavior          ← TDD: the behavior's executable proof
        ──▶  implement inside ONE bounded context     ← DDD vocabulary + boundaries
             through explicit ports                    ← hexagonal seams (domain never imports the shell)
        ──▶  validate BOTH ENDS against behaviors + contracts
               entry: does the planned behavior hold up?   (/pre-mortem, /council)
               exit:  does the implementation satisfy it?   (/vibe, CI gates)
        ──▶  feedback: pass/fail on behavior + contract IS "is it good?"
        ──▶  ratchet the learning back into the corpus  ← compounding, dense
        └───────────────────────── loop ──────────────────────────┘

Behavior is the plan. Tests come from behavior. Contracts and behaviors are what you validate on each end of an implementation, and that is how the system feeds back whether the work is good. Between every step, context density is the rule: every high-value token carries intent, a boundary, evidence, a decision, a constraint, or a next action, and nothing else, so the loop stays cheap.

This loop runs end-to-end in a plain session today, with zero AgentOps-managed always-on infrastructure. It was proven end-to-end (T4/T5) on 2026-05-24: .agents/discovery/2026-05-24-gvkj6-e2e-proof.md.


Fractal loops: one shape at every scale

The 3.0 loop above is not a single ceremony. It is one tick that nests inside itself.

  • rpi is one tick: research, plan, implement, validate over one bead, one behavior, one acceptance proof.
  • evolve runs N rpi ticks toward a goal, selecting the next-best work, running a post-mortem, and repeating.
  • crank / swarm fan one wave of an rpi tick across an in-session team of agents in isolated worktrees.

Each scale has the same five beats and the same ratchet. The only differences between scales are the driver (a human in a session, or a stand-in agent) and the stop policy (a session ends with its budget; an out-of-session driver stops only on an operator marker). This is why the model collapses cleanly to one loop body, two drivers, one inner tick, one config, documented in canonical-loop-model.md.

This shape has a lineage. It is the reconciliation loop applied to agent work (Kubernetes-era control plane, 2025), refined into typed loop code with explicit ratchet rules (Mount Olympus, 2026), and crystallized here as named inner and outer loops over a compounding context corpus. The recurring unit was always a loop that turns intent into validated work and feeds its own output back in. 3.0 names it and makes the compounding artifact, context, the point.


What makes the loop compound instead of repeat: the ratchet rules

A loop that runs without rules repeats flat. Three invariants turn repetition into compounding:

Rule What it forces
No self-grade The agent that did the work never validates its own work. Validator is not implementer.
Fresh agent on failure On failure, spawn a fresh-context agent rather than retrying inside a saturated window. A failed attempt's context is exhaust, not seed.
Knowledge becomes constraints A learning is not durable until it compiles into a gate, a test, or a rule that prevents the next regression. Prose advice decays; a constraint holds.

These are the difference between a corpus that grows into a landfill and one that grows into a moat. They are enforced by the promotion ratchet in operating-loop.md, not by hooks. The doctrine has one implementation — AgentOps, with the acceptance gate shipped in-repo as the ratchet pawl-gate — and these three rules are what make that implementation honest rather than aspirational.


The four load-bearing practices (the narrow waist)

3.0 introduces no new methodology. It composes four proven practices into a small, executable waist, and everything else attaches to it:

Practice What it gives the agent
BDD / Gherkin Observable intent as behavior; the densest signal-per-token planning channel there is
DDD Shared vocabulary and bounded contexts humans and agents both use
Hexagonal (ports & adapters) Explicit seams: the domain core never imports the runtime; tools, vendors, and substrates live outside as adapters
TDD A local, executable done-condition for every slice

Around the waist: CI/CD repeats the proof, SRE/DORA measure fitness, ADRs and provenance preserve why-memory, the wiki and ratchet preserve durable learning, Agile/XP keeps work in vertical slices. The atomic unit is one behavior, one bounded context, one failing test, one write scope, one acceptance proof. A new learning is added only when it would change a future run.


Hookless-first: density over noise

Hookless-first is the most misunderstood part, so be precise.

3.0 ships zero hooks. The hooks were deleted, not demoted. Context flows through explicit, bounded channels instead of being pushed in as noise.

  • The 2.x failure mode: hooks were the architecture. Dozens fired on every event and pushed context into the window: standards for files you were not touching, learnings that did not apply, warnings that did not fire. Resident context stacked (a real RPI run hit ~10.35M tokens at 97.6% cache-read), and the A/B eval showed the injected context made no difference (delta = 0). The problem was never hooks; it was noise stacking in the prompt.
  • The 3.0 model: context is pulled through explicit channels. BDD/Gherkin for intent, execution and context packets through ports for the phase-scoped working set, hexagonal boundaries for the seams. Dense and just-in-time, not stacked.
  • The default install works with zero hooks. Workflow is guided by skills plus the ao CLI, and the local cockpit gate is the routine release authority in this repo. GitHub Actions remain an optional/manual backstop and release/tag signal. If you want a bounded gate of your own (block a dangerous operation, bootstrap a session, run a parity check), author it with the hooks-authoring skill. AgentOps does not ship one.

The rule in one line: context is pulled, dense, and just-in-time; nothing sprays noise into a prompt.


In session is the product; out of session is the substrate's job

3.0 draws a clean line between what AgentOps owns and what an orchestration substrate owns.

AgentOps owns the in-session loop and the context: rpi and evolve, crank and swarm, the ratchet rules, the skills runtime, the context compiler (ao inject, ao compile), and the .agents/ corpus. All of it runs in a plain session with zero AgentOps-managed daemon, scheduler, or always-on runner.

An orchestration substrate owns out-of-session execution: when work runs unattended, the queue, which agents run it, how they are supervised and coordinated, and where the human sits on the loop. AgentOps does not ship any of this. The standalone daemon, scheduler, and overnight runner were deleted in the 3.0 rearchitecture (not deprecated, not retained-behind-a-flag — removed; see ADR-0009). Out-of-session orchestration is delegated to a substrate, and AgentOps ships a reference one.

The governing test: is it about when, where, who supervises, or coordination? That is the substrate. Is it about what the loop does or how context compounds? That is AgentOps.

Domain Owner Primitives
Orchestration: when / where / who-supervises / coordination Substrate (reference: NTM + MCP + managed-agents) a tmux agent swarm (NTM), the MCP tool surface (ao mcp serve), managed/agent-SDK drivers (ao agent), cron/event triggers, the bead queue, human on the loop and merge, runtime providers
The in-session loop + the context: what the agent does, how context compounds AgentOps the /rpi and /evolve loops (run as skills), crank/swarm, the ratchet rules, skills, ao inject / compile / maturity, the .agents/ corpus

rpi is never re-expressed in the substrate. Decomposing it into substrate-side steps would duplicate the loop shape (the exact surface-sprawl disease, re-introduced across the seam) and pit the substrate's retry machinery against the ratchet rules. Whoever owns the loop owns its invariants, and AgentOps owns the loop. The substrate dispatches a whole loop as one unit — it spawns an agent that runs the /rpi skill over the bead; it never drives the loop's insides.


Running the loop out of session

AgentOps ships no out-of-session driver of its own — it adopts a substrate. The reference substrate is the trio AgentOps actually runs on, none of it AgentOps-owned:

  • NTM (a local tmux agent swarm) — spawns and supervises worker agents that each run the /rpi skill over a bead; the operator, or a lead agent, runs BEADS_DIR=$PWD/_beads br ready and dispatches the next bead.
  • MCP via ao mcp serve (shipped) — exposes the ao tool surface to any MCP-aware harness, so coordination and tool calls cross the seam without bespoke glue.
  • managed-agents / agent-SDK via ao agent (agent-native) — hosted, unattended drivers (e.g. Anthropic Managed Agents) that run the loop on a schedule.

In every case the substrate dispatches a whole loop as one unit — an agent running the /rpi (or /evolve) skill — while ao inject / ao validate --gate run in the dispatched workdir, and the corpus compounds in .agents/. AgentOps never moves its loop, ratchet, or context into the substrate — it adopts orchestration, it does not own it.


One doctrine, one implementation — and where the acceptance gate lives

The CDLC loop is one doctrine, and it has one implementation: AgentOps. There is no separate Mount Olympus product or daemon in the running factory. AgentOps is the single self-contained factory — it produces work, tests and validates it, and accepts it — all in-repo, fully local, no cloud required.

The acceptance gate ships in-repo as the ratchet pawl-gate (docs/contracts/pawls.md + scripts/pawl-verdict.sh + scripts/reconcile-pr.sh). At the merge pawl, work reaches "accepted" only through a fresh-context reviewer (context_id != author; model-agnostic by default, ≥2 distinct model families opt-in per pawl), evidence-bound and commit-bound (head_sha), enforced fail-closed (no CONFIRMED verdict → HOLD), with auto-redo on REFUTED and circuit-breaker escalation. This is the same three ratchet rules — no self-grade, fresh agent on failure, knowledge becomes constraints — encoded at the gate.

A binding cross-vendor pre-merge gate with a non-author override is table stakes, not a moat — it already ships commercially (CodeRabbit, Qodo, GitHub Copilot Code Review). AgentOps does not position on the gate. The durable differentiator is sovereignty: the in-session loop and the .agents/ corpus run with zero dependencies and stay portable across model and harness, owned by the operator. Any quality edge is an unproven corpus delta, measured on the eval, not asserted here.

"Mount Olympus" survives only as two things, neither a co-equal running product:

  • Lineage. The pawl-gate descends from the typed-loop / explicit-ratchet-rules work of 2026 (Mount Olympus). That is where the typed loop and the explicit ratchet rules were first written down; the gate it became now ships in-repo.
  • An optional, parked high-assurance build. A binding daemon with OS-account / peercred isolation, for adversarial multi-tenant deployments. A single operator running their own factory does not need it; it stays parked until an adversarial-tenancy requirement earns it. See ADR-0009 for why AgentOps ships no daemon of its own.

What "3.0-ready" means

3.0-ready is reconciled, hardened, and grounded: the repo matches reality and the doctrine is encoded end-to-end.

  • This north star is the single referenced source of truth; README, PRODUCT, GOALS, and the docs index point here.
  • Live docs are being reconciled to reality: the north-star and first-read routers are the source of truth, while downstream historical docs are being moved behind explicit archive/generated boundaries. Current-path docs must not teach retired tracker paths, stale counts, or CI-as-routine-authority claims.
  • Hexagon complete: every skill carries hexagonal_role plus accurate consumes/produces plus canonical .feature acceptance, with the deprecated/internal/meta exceptions enumerated in the readiness level-set.
  • Hookless by default: AgentOps 3.0 ships zero hooks. Workflow is guided by skills plus the ao CLI, with the local cockpit gate as this repo's routine authority and CI as optional/manual backstop. Author your own via the hooks-authoring skill if you want them.
  • The loop is executable in session: BDD intent → tests → bounded build → both-ends validation → ratchet, documented in operating-loop.md and exercised end-to-end.
  • The loop model is named: one loop body, two drivers, one inner tick, one config, documented in canonical-loop-model.md. Out-of-session orchestration is the substrate's job.
  • Every AgentOps bead is worked or closed against this north star. (Reconciliation waves closed; remaining 3.0-aligned coherence and ledger beads are open and tracked.)

Status: see the 3.0-readiness level-set for the honest box-by-box state, fitness snapshot, and named remaining work.


Non-goals (settled)

  • No new practices or methodology; 3.0 is composition, not invention.
  • No coupling to a consumer's domain vocabulary; bounded contexts are named by the consuming repo.
  • No always-on daemon, scheduler, or overnight runner shipped by AgentOps. Those surfaces were deleted; out-of-session orchestration is the substrate's job (reference: NTM + MCP + managed-agents). The in-session loop is the zero-dependency sovereignty floor. ao orchestrate is the out-of-session instrument arm (preflight/verify/tools — deterministic windshield, not a driver); ATM, /goal, and cron remain swappable driving adapters per orchestration-ports.md.
  • No new required tooling; the floor is unchanged — git + an agent runtime stay hard-required, br is required for the bead-tracked workflow ("No bead, no PR"), and the ratchet + ao CLI carry the rest.
  • No required cloud, telemetry, or hosted control plane; state lives in .agents/ in your repo and is portable across model and harness.
  • Hooks-as-default-primitive: abandoned (the A/B delta = 0 result); 3.0 ships zero hooks.

Canonical references: ADR-0002 · ADR-0009 · cdlc.md · canonical-loop-model.md · operating-loop.md · ports-and-adapters.md · component-map.md · PRODUCT.md · GOALS.md