| last_reviewed | 2026-06-14 |
|---|
AgentOps is autonomous code validation for agentic software development. It answers the questions that matter before you trust an agent's work: is this code right, and is this agent output proven enough to receive more autonomy? The mechanism is an SDLC control-plane shape around a repo-local corpus: markdown in .agents/ next to your code, produced and consumed by the agents that work there. The product is the loop that produces validated output with proof — it catches the agent saying done when it isn't. The asset that compounds is the escape-corpus (the false-dones the membrane has learned to catch); the knowledge corpus is a demoted, unproven hypothesis (ADR-0004). The internal lifecycle is the CDLC: context is developed, tested, delivered, observed, and improved because context is what LLM agents consume.
The canonical definition of what 3.0 is — the hookless-first CDLC loop and the four-practice waist — lives in docs/3.0.md. The component and trim/defer routing map that keeps that thesis from sprawling lives in docs/architecture/component-map.md. This file is consistent with both.
AgentOps is the one self-contained software factory: it researches, plans, validates the plan, implements, and proves work end-to-end, and it carries its own acceptance gate in-repo. The gate is the ratchet pawl-gate — docs/contracts/pawls.md + scripts/pawl-verdict.sh + scripts/reconcile-pr.sh. At the merge pawl (the one-way door) work reaches accepted only through a fresh-context reviewer whose context_id ≠ author (model-agnostic by default; ≥2 distinct model families opt-in per pawl), with a verdict that is evidence-bound and commit-bound (head_sha) and enforced fail-closed (no CONFIRMED verdict → HOLD; green CI alone cannot merge), plus auto-redo on REFUTED and circuit-breaker escalation to a human only when a tunable breaker trips. The trust is the separation of duties: a stochastic worker never approves its own work, and the author cannot choose, brief, or transcribe its own reviewer. The destination (autonomous goal → verified done) lives in GOALS.md.
There is no separate Mount Olympus product or daemon in the running factory. The gate is in-repo policy at the pawl, not an external
olympusdservice consuming claims. "Mount Olympus" survives in two reduced forms: (1) lineage — the typed-loop, explicit-ratchet-rules work (2026) the pawl-gate descends from; and (2) an optional high-assurance Linux build — a binding daemon with OS-account / peercred isolation, for adversarial multi-tenant deployments where a hostile worker might forge its own approval. That threat model does not apply to a single operator running their own factory, so the daemon is parked until earned at scale; the in-repo pawl-gate is the gate for everyone else.
The gate is table stakes, not the moat. Binding cross-vendor pre-merge review with non-author override is already shipped commercially — CodeRabbit, Qodo, and GitHub Copilot Code Review all gate merges on independent, often cross-model review (adversarial siege, 2026-06-14). So the fail-closed cross-family gate is the correctness floor a serious factory needs, not a differentiator. AgentOps positions on the membrane and its self-improvement: the candidate moat is the escape-corpus — the accumulated false-dones the membrane has learned to catch, a deterministic compounding gradient. The earlier corpus-delta bet was measured and did not hold (ADR-0004); the gate is the correctness floor, the self-improving membrane is the differentiator.
AgentOps is the validation layer for agent teams: software-engineering practice encoded so LLM-written code can be judged, corrected, and trusted under token scarcity.
It gives agents the shared practices humans use to build complex software together: domain context, standards, tests, reviews, issues, handoffs, verdicts, wikis, operating loops, and release discipline.
The highest-leverage input to coding agents is context: what the system knows, what it has tried, what failed, what the codebase decided, and what gates must hold. AgentOps automates the bookkeeping agents do not reliably do for themselves, then turns that record into an engineering artifact: typed, versioned, retrieved, validated, and fed back into the next run.
It is not a packet generator. Packets are one artifact. The public wedge is autonomous code validation; the SDLC control-plane shape and the Context Development Life Cycle (CDLC) are the mechanism that make validation repeatable. CDLC is the DevSecOps SDLC translated to context, plus the operating practices of multi-agent work: isolated context per worker, stigmergic coordination through a shared corpus, and planner/implementer/validator separation. Software engineering spent decades learning how to get fallible humans to work together in massive codebases. Those practices translate to fallible agents. AgentOps packages Extreme Programming, pragmatic engineering, TDD, DDD, BDD/Gherkin, SRE, DevOps, product discovery, and release discipline into composable skills, gates, standards, artifacts, and CI/substrate adapters. The RPI workflow is the canonical instance — /discovery produces the planner artifact, /crank runs implementer agents in fresh-context waves, /validate runs validator agents that have not seen the code. The four layers — validation membrane, evidence trail, context compilation, and knowledge ratchet — are the public product model. Scheduled or unattended compounding belongs to an orchestration substrate.
One factory, two AgentOps operator surfaces, four compounding layers: an in-harness plugin (skills for Claude Code, Codex, Cursor, OpenCode) and the ao CLI (terminal/CI control plane). Out-of-session orchestration is delegated to the substrate boundary (reference: NTM + MCP + managed-agents). AgentOps owns execution packets, explicit gates, skills, and corpus compounding; it does not ship a daemon, scheduler, or default hook layer.
The bet is sovereignty, not features. Vendors will ship managed memory, councils, and dreaming natively — and lock them to their runtime. Your corpus stays in .agents/ in your repo, runs on whichever harness you already pay for, and is portable across whichever frontier model wins next quarter. The model gets smarter. The corpus stays yours. Humans choose the posture: stay in the loop during discovery and validation, or sit on the loop while an external substrate dispatches AgentOps loops.
Canonical contract: docs/context-lifecycle.md Lineage: AgentOps positions explicitly against EveryInc's Compound Engineer — see docs/comparisons/vs-compound-engineer.md for the in-depth contrast (operator-driven trunk vs. autonomy overlays, capture/scoring/injection, council validation). Internal lineage: the systems-theory work that preceded AgentOps lives in the Lineage section, but users do not need that vocabulary to understand the product.
The software factory that gets better at proving code with each use. Every session produces code, evidence, decisions, attempts, lessons, and stronger constraints — the next session starts with more knowledge, tighter gates, and less wasted work. The model stays the same. The escape-corpus compounds — the membrane gets better at catching false-"done" with every escape it learns from (ADR-0004).
The aspiration is factory-grade confidence for agent-written code first, then throughput: enough structure that agents can run against a defined process, with the operator setting cadence, rigor, and escalation boundaries. Same shape that turned software delivery into an engineering discipline — applied to coding agents.
The thesis is simple: indeterministic workers need disciplined systems. DevOps proved this for engineers. SRE proved it again with SLOs and error budgets. Kubernetes proved it for declarative infrastructure with control loops that reconcile actual state to desired state. Coding agents are the next indeterministic worker class. Same playbook. New substrate. The asset that survives — yours, not ours — is the corpus the system compounds on your behalf.
AgentOps starts with autonomous code validation, not autonomous orchestration. The first-order question is: how do you know the code is right? The second is: how do you know the agent output is right enough to let it run farther next time? Autonomy is earned by evidence. When the validation path holds for a lane, that lane can climb the Autonomy Ladder: more steps before review, more repeated ticks, and eventually substrate-driven execution. The validation boundary does not loosen as autonomy rises; it is the reason autonomy can rise at all.
The deepest "why" under the corpus and the gates is this: agentic workers are inherently stochastic, so AgentOps is a goal-directed navigator, not a workflow. A deterministic worker — compiled code — takes rails: the route is the track, and a script, DAG, or pipeline is the right shape. An agent has no rails. Told "turn left," it has real probability of confidently driving into the lake. So the orchestration paradigm inverts:
- Orchestration determinism runs inverse to worker determinism. Deterministic worker → script / DAG / pipeline (rails). Stochastic worker → a navigator that assumes deviation and self-corrects continuously.
- You don't script the route — you set the destination. A goal is a destination — acceptance, "done" — not a turn-by-turn workflow. Linear
crank/swarmwaves are rails: the brittle special case for when scopes happen not to collide. The general shape is a goal-directed traversal of a deterministic role-topology with live re-routing: the map (roles, stages, legal transitions, gates) is fixed and trusted; the route (the path a given goal takes) is dynamic — chosen at each node, recalculated on failure. - Trust the environment, not the agent. Trust the GPS — the deterministic map plus the gates — not the driver. The environment supplies the reliability the agent structurally cannot.
- The windshield is non-negotiable. The agent's signature failure is not a wrong turn on a real map; it hallucinates roads that do not exist — invents an API, a file, a fact, and drives toward it confidently. Re-routing cannot save you from a road that was never there. Only deterministic ground-truth — the gate, the eval, the test that actually runs — is the windshield that catches the lake. This is why the validation-gate / evidence layer is the load-bearing floor, not an optional add-on.
This is the reliability-into-stochastic-systems thesis stated as architecture: you do not make the agent reliable — stochastic is what it is — you build the navigator that makes the work reliable in spite of it.
They will. Anthropic's Managed Agents is the first move; others will follow. That's fine — the value isn't in this tool. It's in the corpus you build with it.
AgentOps is bridge infrastructure. Your .agents/ directory is plain markdown in your repo. If a frontier vendor ships native equivalents in 12 months, your corpus carries forward. If we get acquired or change direction, your corpus is yours. If you outgrow the tool entirely, fork it, customize it, replace it — the corpus is what matters.
Open source forever. Built so you own the asset, not the tool.
AgentOps 3.0 narrows the release wedge without retreating from the software-factory thesis. The primary user is the agent-heavy maintainer with recurring codebase context debt: someone already using coding agents on real repos, already feeling session amnesia, repeated mistakes, low trust in agent output, and scattered validation evidence.
The 3.0 promise is:
Agent-heavy maintainers can use AgentOps to keep books, compile the right context, validate agent work, curate durable learning, and start the next session smarter without giving the corpus to a hosted control plane.
The hero capability is council inside an agreed engineering domain, not generic multi-model chat. AgentOps lets the operator define the domain through PRODUCT.md, GOALS.md, standards, skills, issues, test expectations, and evidence rules; then Claude, Codex, and other agents judge product, design, and implementation decisions against that shared operating context. The model can focus on the work because AgentOps carries the process and context boundaries around it.
Bring your agent and bring your harness. If it can consume plugins or skills, AgentOps plugs in. The loop, ao, skills, and corpus are the product surfaces operators use to run disciplined agent work. The sharpened posture is that day-one value must not require an unlimited-token cloud factory, hidden runtime hooks, or a fully configured automation lane. Out-of-session compounding is a second-stage accelerator after the user has seen the first evidence trail work, and it runs through an orchestration substrate rather than an AgentOps-shipped daemon.
3.0 public claims are evidence-gated. The release train tracks the PMF scenario in soc-m6v5.8; launch claims that cite PMF or productivity evidence must point to exported artifacts under docs/releases/ or evals/workbench/results/, not only local .agents/ notes.
Anthropic's Managed Agents (and peers like Cursor and Factory) appear to converge on the same architecture AgentOps implements — convergent, not derived-from (see the convergence note below):
| Anthropic Concept | AgentOps Equivalent | Status |
|---|---|---|
| Learning Loop — memory extraction, dream cycle consolidation, future session context | Knowledge Flywheel — /post-mortem --quick → /forge → ao lookup / ao context assemble, tiered promotion (learning → pattern → rule), plus substrate-run compounding |
Live with bounds. On-demand capture/promotion works, and adopted substrates provide an operator-started compounding lane. GitHub nightly is the public proof harness for the contracts, not the user's private runtime. |
| Skillify — AI watches patterns, packages them as reusable skills, compound growth | Skills system — generated registry, /heal-skill audit, /converter cross-runtime export, SKILL-TIERS classification |
Prototype built. ao flywheel close-loop now drafts review-only skills from repeated patterns; promotion polish is the remaining gap. |
| Verification Agent — adversarial AI auditing AI, VERDICT system for human review | Council architecture — /council, /pre-mortem, /validate, /post-mortem with multi-model consensus, prediction tracking. Stage 4 behavioral validation adds holdout scenarios + satisfaction scoring in STEP 1.8. |
Live on demand. STEP 1.8 fires automatically inside /validate when that skill is invoked. |
| Managed Agents Dreaming (May 2026) — scheduled session review, pattern extraction, memory curation between sessions | Substrate-driven compounding + .github/workflows/nightly.yml proof jobs when needed |
Live with operator setup. The bounded private compounding lane runs harvest → forge → close-loop → defrag when the operator or substrate starts it. AgentOps itself no longer ships the daemon executor. |
| Managed Agents Outcomes (May 2026) — rubric-driven separate-context grader with iterate-until-pass | Live at three scopes: project — GOALS.md (rubric) + ao goals measure (each gate runs as separate subprocess; cli/internal/goals/measure.go:132-164) + /evolve (can iterate a worst-failing gate under operator limits; skills/evolve/SKILL.md:379-388); plan — /pre-mortem council judges as separate-context graders; code — /vibe council judges. An internal council review (2026-05-06) found these capabilities present across rubric authoring, separate-context grading, iterate-until-pass, and pinpoint-what-changed; this is an internal finding, not an audited external-parity claim. |
Live at the capability layer. Empirical workbench A/B (2026-05-06): Δ=+0.0000 across 12 cases at v1 difficulty (both legs 12/12) — task difficulty floor exhausted; v2 substrate (realistic agent tasks where the hook layer differentiates) is roadmap. Counter-stat artifact: evals/workbench/results/2026-05-06-yjzp9-counterstat.json. |
Read the convergence table the right way: AgentOps and every harness like it gets absorbed into the model layer over time — Anthropic's 2026-05-06 Managed Agents launch is the textbook example. Memory primitives, learning loops, even validation gates — frontier vendors will ship them natively. What stays yours is the corpus. AgentOps is the bridge tool that helps you build that corpus now, with current models, before the harness layer commoditizes (whether it compounds into a quality moat is the unproven bet; that it stays yours is the durable one).
- Goal: Keep a real codebase moving with coding agents while preserving context, evidence, and release judgment across sessions.
- Pain point: Each agent session starts cold. The maintainer knows there were prior attempts, warnings, decisions, and fixes, but they are scattered across chats, commits, notes, and memory.
- Gap exposure: Bookkeeping, context compilation, judgment validation, and durable learning. This is the 3.0 PMF wedge.
- Goal: Ship fewer but higher-confidence releases while using agents for more of the work.
- Pain point: Agents can produce coherent-looking changes that miss edge cases, violate repo conventions, or leave no reviewable proof trail.
- Gap exposure: Judgment validation, claim governance, release readiness, and evidence export.
- Goal: Run multiple agents or repeated agent loops on a shared codebase without losing coordination or repeating mistakes.
- Pain point: Parallel and repeated agent work creates file conflicts, stale assumptions, duplicated investigations, and context drift.
- Gap exposure: Worktree isolation, planner/implementer/validator separation, loop closure, and substrate-driven compounding.
- One-off prompt users who only need a single answer and do not care whether the repo remembers it.
- Cloud-control-plane buyers who want a hosted autonomous factory more than local corpus ownership.
- Teams that will not inspect artifacts and only want an agent to say "done."
- New agent users with no repeated workflow yet; they may benefit later, but the first 3.0 wedge is users who already feel agent-session context debt.
- Repo-local memory for agent work. Attempts, decisions, citations, verdicts, handoffs, findings, retros, and run packets become artifacts the next session can use.
- Context starts warm. AgentOps compiles prior work into phase-scoped context instead of asking every agent to rediscover the repo from scratch.
- Judgment becomes a gate.
/pre-mortem,/vibe, and/counciladd fresh-context review before risky plans and code ship. - Engineering practice becomes executable. Pragmatic engineering, XP, TDD, DDD, BDD/Gherkin, SRE, DevOps, and product-management practices become reusable skills, standards, gates, issue flows, and acceptance proofs the operator can jump into, automate, or review.
- Learning compounds under operator control.
/forge,ao flywheel, and substrate-run maintenance turn completed work into reusable constraints without requiring an AgentOps cloud. - The corpus stays portable. The same local knowledge base can outlive a model, chat session, harness, or vendor.
For the 3.0 target audience, a 10-star experience is not "configure every automation lane." It is the first hour proving the factory is worth trusting.
- Install fits their runtime. They can use Claude Code, Codex, OpenCode, or another skills-compatible agent without changing how they work.
- The domain packet is visible. They can see the product, goals, standards, issue context, and test expectations the agents will operate inside.
- Council shows the taste layer. Claude and Codex judge the same product/design/engineering decision against the same domain context and produce a verdict artifact.
- A first real repo task leaves evidence.
/rpi "small goal",/vibe recent, or/council validate this PRproduces a verdict and artifact trail the maintainer can inspect. - The next session starts smarter. The prior evidence, decision, or learning is retrievable and changes what the next agent sees.
- The user owns the substrate. Artifacts are local, grep-able, diff-able, and removable. No AgentOps-hosted control plane is required.
- Automation is introduced after trust. Dream and substrate-driven scheduling are framed as second-stage compounding lanes, not prerequisites for first value.
AgentOps is autonomous code validation backed by a wiki for your agents. .agents/ is markdown in your repo, version-controlled with your code, that agents read, traverse, and contribute to — the kind of wiki your team should already have, except agents do the maintenance. AgentOps automates the discipline of proving work and preserving the proof: capture, retrieval, validation, and compounding all happen mechanically so the wiki stays current instead of bitrotting.
That wiki is the substrate underneath a software factory with three surfaces and four user-facing layers: Validation Membrane, Evidence Trail, Context Compiler, and Knowledge Ratchet. Dream is a bounded compounding skill, not a separate peer product or AgentOps-owned scheduler. The SDLC control plane executes the CDLC — the Context Development Life Cycle — so agent work can move through small, bounded, evidence-bearing vertical slices. The deeper proof-contract framing — identity, reproducibility, evaluation, evidence, recovery — lives in docs/trust-factory.md.
The same substrate, reached three ways:
- In-harness plugin — skills for Claude Code, Codex, Gemini/Antigravity, Cursor, OpenCode. Context moves through explicit packets and skill workflows first. Runtime hooks are not an AgentOps default; teams that want custom hooks author them separately with the
hooks-authoringskill. Install viainstall-claude.sh,install-codex.sh,install-agy.sh, or the skills.sh package. aoCLI — the terminal and CI control plane.ao context assemble,ao lookup,ao compile,ao goals measure,ao flywheel close-loop— the same compiler, scriptable. Repo-native, with no required AgentOps cloud control plane.- Out-of-session substrate — optional, off-API, off-vendor orchestration outside the AgentOps product core. The reference boundary is NTM + MCP + managed-agents: it dispatches agents that run whole AgentOps loops (
ao rpi, compile/maturity jobs) while AgentOps stays daemonless and schedulerless.
Before the four layers run, AgentOps helps the operator define the domain the agents operate inside. This is the pragmatic-engineering, DDD, TDD, and BDD-shaped part of the product: product docs define intent, goals define fitness, issues define current work, standards define style and constraints, tests or Gherkin-like scenarios define expected behavior, and skills encode the process.
The point is to take process burden off the model. The LLM still researches, plans, implements, reviews, and explains, but AgentOps curates the context in and out of each phase so the model does not have to invent the development methodology while doing the work.
The narrow waist is BDD/Gherkin + DDD + Hexagonal + TDD:
| Practice | Product role |
|---|---|
| BDD / Gherkin | Express intent as observable behavior and acceptance examples |
| DDD | Keep human and agent inside the same bounded vocabulary |
| Hexagonal architecture | Keep adapters, tools, runtimes, and vendors outside the core loop |
| TDD | Give every slice a first failing test and executable done condition |
All other practices attach there: CI/CD reruns the proof, SRE/DORA measures fitness, ADRs and provenance preserve decision memory, wikis and ratchets preserve learning, and Agile/XP keeps work atomic instead of waterfall-shaped.
The same model used in the README: bookkeeping records the work, the context compiler feeds the next run, validation gates enforce judgment, and the flywheel compounds the corpus.
Problem: Agents do not keep their own operational memory. They forget what they tried, why they changed course, which warnings mattered, what passed validation, and what should be reused next time.
What the compiler emits: File-backed traces of agent work: attempts, decisions, citations, handoffs, findings, retros, post-mortems, council verdicts, and run packets. This is the raw material for context compilation and compounding.
.agents/— repo-local bookkeeping substrate/post-mortem --quickand/post-mortem— capture decisions, lessons, and follow-up work/provenanceand/trace— connect artifacts back to their sourceao metrics citeand citation logs — record what knowledge was used- RPI packets and council verdicts — preserve plan/build/validation evidence
Problem: Every session starts from zero. Agents need the right slice of prior work, policy, constraints, and decisions before they can act well.
What the compiler emits: Decay-ranked, phase-scoped context packets built from the bookkeeping trail and curated corpus.
ao context assemble— phase-scoped packet assembly with token budgetingao lookup— decay-ranked retrieval for on-demand knowledgeao context assemble— phase-scoped context packetsao compile— rebuild the knowledge wiki (mine, grow, defrag, lint)- Reusable skills — generated from
skills/**/SKILL.mdand packaged across Claude Code, Codex, Gemini/Antigravity, and OpenCode - Runtime install one-liners plus Day-2 operations — install, update, backup, permission repair, recovery, and escalation are product surfaces, not afterthoughts
Problem: Agents ship confident garbage. No review, no second opinion, no gate between "agent thinks this is good" and "this goes into production."
What the compiler emits: Multi-model consensus validates plans before build and code before commit. Gates block, not advise. Independent judges debate against the same product/domain context and return one auditable verdict.
/pre-mortem— validate plans before implementation/vibe— validate code after implementation/council— multi-model adversarial review (Claude + Codex judges) over agreed product, goals, standards, and issue context- 63 eval suites + 12-task workbench — deterministic context quality testing
- Baseline A/B — skill-on vs skill-off delta measurement
Problem: Each session ends and the lessons disappear. Same mistakes get made. Same solutions get rediscovered. Nothing compounds.
What the compiler emits: Bookkeeping becomes reusable knowledge. Learnings get scored on specificity, actionability, novelty. High-scoring learnings promote to permanent patterns. Patterns become planning rules. Scheduled dream cycles defrag and compound the corpus without competing with foreground engineering. Next session starts loaded.
/forge— extract structured learnings from completed sessionsao flywheel close-loop— score, promote, curate automatically
/evolve— bounded reconciliation: reads goals, targets the worst gap, validates, and can repeat under operator limits- Substrate-run compounding — bounded private compounding lane
- Out-of-session substrate (reference: NTM + MCP + managed-agents) — operator-owned cadence for unattended rpi, compile, maturity, and related jobs
.github/workflows/nightly.yml— public proof harness for the contracts (not your private runtime)- MemRL feedback — cited artifacts receive session reward, utility scores update
- Corpus health snapshot — the flywheel reports whether current citation flow is compounding or decaying
Session starts → compiler delivers context → Agent works →
bookkeeping records attempts, decisions, citations, and evidence →
validation gates emit verdicts → Session ends → flywheel promotes learnings →
Dream or substrate-driven maintenance defrags the corpus → Next session starts with better context
That's the CDLC inside the SDLC control plane. Generate, compile, test, distribute, deliver, observe, adapt. Same shape as the DevOps SDLC. Different substrate. The model stays the same. The corpus compounds.
The CDLC is not a packet pipeline. It is the operating discipline that ensures every high-value context token carries intent, boundary, evidence, decision, constraint, or next action.
- Coordination plane —
/swarm,/crank, waves, worktree isolation for parallel agents. Scale by adding workers, not overloading context. - Temporal compounding — dream cycles (hours), session forge (minutes), pattern promotion (weeks). Multiple clocks, one flywheel.
- Multi-runtime — same skills, same corpus across Claude Code, Codex CLI, and OpenCode.
/converterexports to Cursor rules. The discipline lives in the system, not the model. - Local-first operation — all AgentOps state lives in local
.agents/directories. No required AgentOps product telemetry or hosted control plane; operators choose model runtimes, networks, installers, and remotes.
The bet is sovereignty, not features. Every harness — ours included — gets absorbed into the model. Memory primitives, learning loops, validation gates: frontier vendors will ship them natively. What they won't ship is your corpus — what your repo learned, what your team scarred, what your codebase decided.
Position on the ruler, not the gate. The fail-closed cross-family gate is table stakes — binding cross-vendor pre-merge review with non-author override already ships (CodeRabbit, Qodo, Copilot). It is a necessary correctness floor, not a wedge, and the messaging does not lean on it. The only candidate moat is the corpus delta: does the compounding .agents/ corpus measurably improve agent output versus the same models with no corpus? That is unproven — the one A/B on record (ADR-0002, Δ=0) was a hook-layer test, not a corpus A/B, and the corpus delta itself (ag-8p8o) is still being measured. So the moat is a hypothesis gated on a ruler, not a claim; the load-bearing work is building that ruler. Fungible dispatch — identical generalist workers, any-agent-any-bead, mixed families coordinated through the durable board rather than a shared context window — is the mechanism that makes the cheap cross-family gate, disposable-context management, and vendor-cost arbitrage possible; it is not itself the moat. The durable, non-quality differentiator is sovereignty: the corpus stays in your repo and outlives any model, harness, or vendor.
When Anthropic ships native scheduling, councils, Skillify, and Dreaming inside Claude Code in the next 6 months, what specifically does AgentOps still do? The honest answer:
- Cross-runtime corpus. The corpus stays in
.agents/in your repo. It runs the same way on Claude Code, Codex CLI, Cursor, and OpenCode. When the next frontier model wins next quarter — or when you switch teams or platforms — the corpus comes with you. Vendor-managed memory follows the chat session. AgentOps' corpus follows the team. - Local sovereignty. The corpus and in-session loops run off-API, off-vendor, against the runtimes you choose. Dream cycles, evolve loops, compile/defrag, and substrate-dispatched rpi jobs do not require an AgentOps cloud and do not ship your codebase to a hosted AgentOps control plane.
- Operator workflow encoding. RPI, planner/implementer/validator separation, fresh-context worker waves, council validation — these are operating practices for multi-agent work, encoded as skills, execution packets, gates, and CI/substrate adapters. Vendors will ship managed agents; they won't ship your operator workflow.
- Curated engineering practice. The operator's preferred practices — pragmatic engineering, XP, TDD, DDD, BDD/Gherkin, SRE, DevOps, product discovery, release gates — become composable context units. You decide when to run them interactively, automate them, or inspect the output.
The model gets smarter. The corpus stays yours. See the Lineage section for how this position was arrived at.
AgentOps is engineered from constrained-environment habits, not consumer-app assumptions. This is not a certification claim; it is the operating posture the system is built toward.
Full profile: docs/assurance-profile.md.
- Local-first control. The corpus lives in
.agents/beside the code. AgentOps requires no product telemetry, and operators choose which model runtimes, networks, and subscriptions touch the repo. - Context as a boundary. Research, planning, implementation, and validation receive different context. Workers get fresh windows. Validators get evidence packets instead of the implementer's accumulated chat.
- Bookkeeping by default. RPI packets, council verdicts, citations, ratchet records, post-mortems, handoffs, and substrate/job outputs leave file-backed traces that can be inspected, diffed, archived, or excluded from source control.
- Policy gates over advice. Pre-push checks, CI gates, security scans, goal fitness gates, and pre-mortems encode process as executable constraints instead of relying on the agent to remember a runbook.
- Variable autonomy. The same factory can run interactive, supervised, or substrate-scheduled loops. High-risk environments can keep humans in the loop for planning, validation, release, and promotion while still using agents for bounded work.
- Constrained-network fit. The design favors repo-local state, explicit artifacts, no required cloud control plane, and operator-owned or substrate-owned scheduling. Formal deployment into classified, export-controlled, or safety-critical environments still requires the local authority's security controls, model approvals, supply-chain process, and accreditation work.
As of 2026-05-10:
Traction:
- GitHub repo: 341 stars, 33 forks, 2 open issues, last pushed 2026-05-10T03:24:01Z
- Public surface: GitHub Pages mkdocs site live at boshu2.github.io/agentops/; doctrine site live at 12factoragentops.com
- Distribution/runtime reach: 75 shared skills, 75 checked-in Codex artifacts, and 19 Codex overrides.
/validateand/curateare additive in this train; legacy validation and mining skills remain until their shim/retirement gates are resolved.
Measured operational proof:
ao doctor --json: hook coverage and structural gates passing- Competitive freshness gate: comparison docs maintained within the 45-day target
- Internal assessment: a 2026-05-06 council review found rubric authoring, separate-context grading, and iterate-until-pass present at the capability layer (internal finding; not an external audit or a verified parity claim against Anthropic Managed Agents)
- Empirical workbench A/B (2026-05-06, 12 cases): Δ=+0.0000 — both
skill-onandskill-offlegs scored 12/12. Honest read: at workbench v1 difficulty (off-by-one bugs, simple validators, basic SQLi) AgentOps's context layer is non-differentiating because the tasks don't require it. Substrate v2 (realistic agent task difficulty) is roadmap. Source:evals/workbench/results/2026-05-06-yjzp9-counterstat.json - 3.0 PMF scenario evidence is planned but not yet claimed.
soc-m6v5.8owns the scenario spec, control path, exported evidence, and claim-ledger posture before launch copy uses PMF/productivity claims.
Maintainer corpus stats (this repo's .agents/, derived by scripts/corpus-stats.sh — re-runnable, no fabricated numbers; exported snapshots such as docs/evidence/2026-04-02-flywheel-case-study.md when promoted for public citation):
- ~1,842 learnings · ~186 patterns · ~80 planning rules
- ~68 finding markdown files · ~24 registry entries
- ~3,867 citations recorded in
.agents/ao/citations.jsonl
These are this repo's corpus stats; your own AgentOps install will produce its own. Run scripts/corpus-stats.sh --table (or --json / --markdown) against $AO_CORPUS_ROOT to derive yours. The previously-cited "4,940 learnings, 1,195 patterns, 40 planning rules" line was removed because no on-disk source reconciled it; the numbers above are what the tracked source actually returns at the time of writing.
.agents/ runtime state was wiped by routine cleanup on 2026-05-07; receipts use 2026-05-04 stable snapshot. Durability fix tracked in soc-rv5p.
Your corpus grows every session — learnings, patterns, and constraints accumulate in your repo, not ours. The system writes the substrate; you decide whether dream/evolution/compile loops run manually, in session, or through an external orchestration substrate. Scale and compounding follow from the cadence you set, not from a claim we make.
PRODUCT.md and GOALS.md are allowed to outpace the current repo. That is the point of goals: they define the desired state, not a frozen claim that every mechanism is already complete. In the Kubernetes/control-loop lineage, GOALS.md is the setpoint, the repo is actual state, ao goals measure is the sensor, and /evolve, dream, validation gates, and follow-up issues are the reconcile loop. Gaps are not embarrassing when they are named, measured, and queued; they are the worklist that keeps the factory moving toward closure.
| Gap | Impact | Status |
|---|---|---|
| First-value path is still too diffuse | The current product surface can ask users to understand the whole factory before they feel the first benefit. This most affects the 3.0 PMF wedge: maintainers who need context continuity and trust quickly. | in-progress |
| 3.0 PMF scenario evidence is pending | The release thesis is clear, but the exact scenario with repo/task/control/measures has not yet produced exported proof. Public PMF/productivity claims stay gated on soc-m6v5.8. |
open |
Canonical /validate and /curate consolidation is not release-ready |
Additive skills exist in this worktree, but skill-count, registry, and Codex artifact gates are expected to fail until the release train resolves ship/defer and artifact sync. | in-progress |
| Public launch claims need exported proof | Local .agents/ artifacts are useful operating evidence but are not enough for public claims. 3.0 needs tracked evidence under docs/releases/ or evals/workbench/results/ when launch copy cites PMF proof. |
planned |
| Compounding autonomy is still maturing | The private compounding lane is substrate-owned after the 3.0 daemon deletion, with bounded compounding, reports, setup guidance, and a separate GitHub nightly proof harness. Remaining work is deeper full-loop autonomy, calibration, and onboarding polish. | in-progress |
| Pattern-to-skill pipeline (synthesis layer) | Detection layer ships in v1 (.agents/plans/2026-04-23-pattern-to-skill-pipeline-detection.md); synthesis (LLM-authored draft skill bodies, tier heuristics, on-disk drafts in .agents/skill-drafts/) is deferred to v2 after a design council found 8+ blockers. The "self-programming compounding" framing is aspirational, not currently producing on-disk output. |
deferred |
| Multi-runtime proof is tiered, not complete | Tier S structural proof is active for all four runtimes. Tier I live inventory proof is partial. Tier E live execution proof remains opt-in / nightly, not a default gate. | in-progress |
| Retrieval and worker knowledge propagation still limit compounding | The flywheel architecture is in place. Retrieval quality and passing prevention/finding context to implement workers remain weaker than the core thesis requires. | open |
| Behavioral eval system needs live agent runtime at scale | Eval workbench shipped: 3 fixture components (Go CLI, Python FastAPI, DevOps), 12 tasks with golden solutions and scoring scripts, behavioral eval suite, agent harness script, eval-skill-delta CI gate, and --two-pass head gate. Scoring infrastructure verified (golden 12/12, broken detection 12/12). A/B DeltaScorecard works for deterministic cases. Remaining gap: live agent runtime execution at scale — the harness and gates exist but full skill-on vs skill-off delta across the workbench is not yet a default gate. |
in-progress |
| High-assurance profile needs deeper control mapping | The initial assurance profile now documents local-first state, evidence packets, policy gates, telemetry boundaries, autonomy modes, and out-of-scope claims. Remaining work is redaction, evidence export, supply-chain inputs, and program-specific control mapping. | in-progress |
| Public messaging is still converging | README, PRODUCT.md, GOALS.md, CDLC, the docs landing page, and the one-page brief are being sharpened around autonomous code validation, with SDLC control-plane/CDLC language as the implementation mechanism. Remaining gap: downstream comparison docs and skill-page intros still need a sweep to match. | in-progress |
| The moat is unproven — corpus delta not yet measured | The product's only candidate moat (does the compounding corpus measurably improve agent output?) has no A/B proof. ADR-0002's Δ=0 was a hook-layer test, not a corpus A/B. Until ag-8p8o produces a measured delta on realistic tasks, the moat is a hypothesis — public quality/moat claims stay gated on it. This is the single highest-leverage gap: the ruler is the product. |
open (ag-8p8o) |
| The cross-family gate is table stakes, not a differentiator | Binding cross-vendor pre-merge review with non-author override already ships commercially (CodeRabbit, Qodo, Copilot — adversarial siege 2026-06-14). The in-repo pawl-gate is a necessary correctness floor; messaging must not position on it as a wedge. | acknowledged |
The internal lineage that produced this product, and the parallels we are not derived from. Users do not need this vocabulary; it records where the shape came from.
Knowledge OS → Olympus → AgentOps → Mt. Olympus.
- Knowledge OS is the systems-theoretic substrate. The dK/dt equation, stigmergy as the multi-agent coordination primitive, Meadows' leverage-point hierarchy as the design discipline. This is the body of theory the rest descends from.
- Olympus was the predecessor runtime. Power-user daemon, run ledger, context compilation, constraint injection. Archived as a live system; its patterns survived as skills inside AgentOps.
- AgentOps (this repository) is the coding-agent implementation. Skills + execution packets +
aoCLI + explicit validation gates. It applies the context-compounding model to software work; always-on scheduling is delegated to a substrate.
- Mt. Olympus is not a separate running product. It is (1) the lineage of the in-repo ratchet pawl-gate — the typed-loop / explicit-ratchet-rules work (2026) the gate descends from — and (2) a parked, optional high-assurance Linux build (binding daemon, OS-account isolation) for adversarial multi-tenant settings. The acceptance gate it pioneered now ships inside AgentOps as the in-repo pawl-gate; there is no
olympusdservice in the running factory.
Donella Meadows' Twelve Leverage Points ranks intervention points in complex systems from weakest (#12, parameters) to strongest (#1, transcending paradigms). Changing the loop beats tuning the output. AgentOps targets the high-leverage end — #4 (self-organization) and #3 (goals) — through the knowledge flywheel and GOALS.md reconciliation, rather than #12 (a better prompt). This is the primary organizing principle, not a citation: the entire CDLC is built around moving leverage up Meadows' hierarchy.
The thread-based development pattern — multiple agents working compoundingly, validation gates between phases, learnings extracted into reusable skills — applied via the software-factory operator pattern. The lineage runs through Greenfield and Short's Software Factories: Assembling Applications with Patterns, Models, Frameworks, and Tools (2003): a factory configures and composes domain-specific assets. AgentOps configures and composes context, skills, and validation gates around an operator's codebase. Direct comparison against EveryInc's Compound Engineer at docs/comparisons/vs-compound-engineer.md.
- Heroku's Twelve-Factor App. Parallel to, not derived from. The 12-factor app describes stateless web processes managed by a control plane; AgentOps applies the same shape — environment-carried continuity, replaceable workers, explicit control plane — to coding agents. Same operating-style insight, different substrate.
- Anthropic's Managed Agents (May 2026), Cursor agents, Factory's Missions. Convergent, not derived-from. Multiple teams arriving at planner/implementer/validator separation, dreaming/memory loops, and rubric-graded outcomes is evidence the architecture is correct — not lineage. AgentOps' position is the cross-runtime, repo-native, operator-sovereign substrate.
Theoretical foundation — six pillars:
- Systems theory (Meadows) — The primary organizing principle, not a citation. Changing the loop beats tuning the output — Meadows leverage point #4 (self-organization) and #3 (goals) vs. #12 (parameters). AgentOps is built as a Meadows compounding system around the user's codebase: information flows captured (#6), rules encoded (#5), self-organization through the flywheel (#4), goals declared (#3). Most agent tooling lives at #12; AgentOps lives at #4–#3.
- DevOps Three Ways — Flow, feedback, continual learning. The discipline lineage. Applied to the agent loop instead of the deploy pipeline.
- SRE (SLOs + error budgets) — Reliability is a measurable condition, not a vibe.
GOALS.mdcarries SLO-shaped fitness gates;ao goals measureis the burn-rate equivalent. The reliability lineage. Source: Site Reliability Engineering (Beyer, Jones, Petoff, Murphy). - Kubernetes control loops — Declared state + reconcile loop.
GOALS.mddeclares;/evolvereconciles. Errors don't crash the loop; they enter the work queue. The self-correction lineage. - Brownian Ratchet — Embrace agent variance, filter aggressively, ratchet successes. Chaos + filter + one-way gate = net forward progress. The forward-only-progress lineage.
- Knowledge Flywheel (escape velocity) — If retrieval rate × usage rate exceeds decay rate, knowledge compounds. If not, it decays to zero. The compounding-context lineage. This is the one infrastructure never needed — software workers persist; agents don't. The corpus is the asset that stays yours, and the candidate moat — unproven until the delta is measured (see Strategic Bet).
Operational principles:
- Agents are ephemeral; the system carries the state. Every skill, hook, and flywheel component exists because the agent itself can't remember. Build for amnesia.
- The corpus is the user's. The harness is ours. AgentOps' own commoditization is on the timeline. The user's accumulated knowledge isn't. Optimize the product for what the user keeps.
- Context quality determines output quality. Right context, right window, right time. Phase-specific. Role-scoped. Freshness-weighted.
- The cycle is the product. No single skill is the value. The compounding loop — research, plan, validate, build, validate, learn, repeat — is what makes the system improve.
- Two-tier execution. Orchestrators (
/evolve,/rpi,/crank) stay in the main session. Workers fork into subagents where results merge back via the filesystem — never accumulated chat context. - Atomic changes compose. Every primitive is cheap to undo. The Brownian Ratchet only works if the ratchet step is small.
- Reconcile, don't push. Kubernetes-shaped control loops compare actual state to desired state and fix the gap. They don't fire-and-forget. AgentOps loops do the same.
- Dormancy is last resort. When goals pass and backlog is empty, the system generates productive work from validation gaps, bug hunts, drift detection, and feature suggestions before going dormant.
This file enables product-aware council reviews:
/pre-mortem— Automatically loads product context when this file exists. Default--quickmode includes the context inline; deeper modes add a dedicatedproductperspective alongside plan-review judges./vibe— Automatically loads developer-experience context when this file exists. Default--quickmode includes the context inline; deeper modes add a dedicateddeveloper-experienceperspective alongside code-review judges./council --preset=product— Run product review on demand./council --preset=developer-experience— Run DX review on demand.
Explicit --preset overrides from the user skip auto-include (user intent takes precedence).
- PRACTICE-REGISTRY.md — practice lineage and canonical
practices: [slug]registry used by skills, hooks, evals, schemas, scripts, and CLI code. - Context Lifecycle Contract — canonical definition of judgment validation, durable learning, and loop closure, with evidence map and mechanism inventory.
- Trust Factory — how AgentOps maps to the five-step proof contract (identity, reproducibility, evaluation, evidence, recovery).
- Wiki for your agents — the wiki framing as a standalone document.
- Scale Without Swarms — why 3-5 focused agents with fresh context and regression gates outperform massive uncoordinated swarms; the AgentOps model of waves, isolation, and gates explained.
- Brownian Ratchet — the forward-only-progress lineage in detail.
- The Science — DevOps Three Ways, the escape velocity condition, and the leverage-points map.
- vs. Compound Engineer — direct comparison against EveryInc's compound-engineering-plugin, including where AgentOps is in-scope (capture, scoring, injection, council validation, repo-native
aoworkflows) and where it explicitly is not.