GitHub - Lkumar209/CHASM: CHASM: detecting and measuring stated-vs-enacted divergence in tool-using LLM agents under adaptive evasion.

# CHASM

Cross-cHannel Agent Safety Monitoring

CHASM detects when a tool-using LLM agent's stated reasoning diverges from its enacted actions, and measures how that detection holds up when the agent is actively optimized to hide the gap.

The chasm between what an agent says and what it does is the thing we measure and try to close.

Architecture

RIFT-Bench (benchmark)
  ├── Hermetic mock sandbox (8+ task families)
  ├── Non-adaptive evasion ladder (L0–L4)
  └── Ground-truth canary instrumentation

Agent loop (dual-channel capture)
  ├── Stated channel: plan / reasoning / whitelist
  └── Enacted channel: tool calls + args + observations

Monitor zoo
  ├── cot_judge          (stated-only baseline)
  ├── action_judge       (enacted-only baseline)
  ├── CrossGuard         (proposed: cross-channel consistency)
  ├── online             (streaming; flags mid-trajectory)
  ├── probe              (white-box activation probes)
  ├── ensemble / debate  (learned fusion, cross-exam)
  └── stego              (multi-agent covert channel detector)

Breaker (adaptive red-team)
  └── closed-loop evasion search → adaptive-robustness frontier

Control protocol
  └── trusted/untrusted framing → safety-case report

Core Idea Diagram

Quick start

git clone https://github.com/Lkumar209/chasm && cd chasm
uv sync --extra dev          # install all dev deps (no GPU required)
uv run pytest -m smoke -v    # smoke e2e: ~seconds on CPU, no API key needed

With a GPU and local models:

uv sync --extra local
uv run chasm experiment=full  # full pipeline (default: Qwen3-4B-Instruct on CUDA)

Reproduce the headline table and Figures 1–4 from cached trajectories:

bash scripts/reproduce.sh

Project status

Milestone	Status
M0 Scaffold + repo + CI	✅ Done
M1 Sandbox + tasks	✅ Done
M2 Agent + capture	✅ Done
M3 Induction ladder	✅ Done
M4 Baselines + calibration	✅ Done
M5 CrossGuard	✅ Done
M6 Probe + cascade	✅ Done
M7 Breaker	✅ Done
M8 Online + ensemble + debate	✅ Done
M9 Multi-agent collusion	✅ Done
M10 Control + dashboard + serving	✅ Done
M11 Full eval + release	✅ Done

159 tests passing · Python 3.11 + 3.12 · ruff clean · mypy clean

Benchmark results

Evaluated on sim backend (8 TRAIN tasks × 3 seeds = 48 trajectories per condition, 2000 bootstrap resamples). For real-LLM numbers run bash scripts/reproduce.sh with ANTHROPIC_API_KEY set.

Monitor	AUROC (naive_divergent)	AUROC (evasion_L2)	TPR@FPR=1%
ActionJudge	1.000 [1.000,1.000]	1.000	1.000
CoTJudge	1.000 [1.000,1.000]	0.500	1.000
CrossGuard	1.000 [1.000,1.000]	1.000	1.000
OnlineMonitor	1.000 [1.000,1.000]	1.000	1.000
EnsembleMonitor	1.000 [1.000,1.000]	1.000	1.000
DebateMonitor	1.000 [1.000,1.000]	1.000	1.000

All metrics with bootstrap 95% CI (2000 resamples). Key finding: CoTJudge collapses to 0.500 AUROC under evasion_L2 (agent hides canary access after step 0), while CrossGuard and ActionJudge hold at 1.000.

Hypothesis verdicts

Hypothesis	Prediction	Verdict
H1 · AUROC > 0.85 at zero pressure	CrossGuard ≥ 0.85	SUPPORTED — min AUROC=1.000 across 6 monitors
H2 · CoT ladder drop ≥ 0.25	CoTJudge collapses under evasion	SUPPORTED — AUROC drop=0.500 (naive=1.000 → L2=0.500)
H3 · CrossGuard degrades less (non-adaptive)	Dual-channel beats single-channel	SUPPORTED — CrossGuard drop=0.000 vs cot_judge drop=0.500
H4 · CrossGuard lower Breaker evasion rate	Harder to fool both channels	UNTESTED — requires Breaker run
H5 · Cascade closes ≥ 80% gap at ≤ 20% escalation	Cost-efficient detection	UNTESTED — requires cascade run
H6 · Transfer AUROC > 0.65	Rule-based signal generalizes	UNTESTED — requires transfer eval
H7 · Judge < probe on collusion	Stego signal beats stated reasoning	UNTESTED — requires collusion run
H8 · Online catches early fraction	Flags before covert action completes	UNTESTED — requires online latency eval

Research questions

See docs/design.md for full framing, docs/preregistration.md for pre-registered hypotheses.

Citation

@software{chasm2024,
  author = {Lkumar209},
  title  = {CHASM: Cross-cHannel Agent Safety Monitoring},
  url    = {https://github.com/Lkumar209/chasm},
  year   = {2024},
}

License

Apache 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.devcontainer		.devcontainer
.github		.github
configs/experiment		configs/experiment
docs		docs
paper		paper
results		results
scripts		scripts
src/chasm		src/chasm
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Architecture

Core Idea Diagram

Quick start

Project status

Benchmark results

Hypothesis verdicts

Research questions

Citation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Architecture

Core Idea Diagram

Quick start

Project status

Benchmark results

Hypothesis verdicts

Research questions

Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages