Self-hosted AI agent gateway — run
BitNet-b1.58-2B-4T(or any local LLM) with tool calling, RAG, web search, and a full agentic loop. No API keys, no cloud, no data leaks.
Small open-source models like BitNet-b1.58-2B-4T are fast and cheap to run, but they lack the reasoning depth, tool-use, and long-context handling needed for real agentic tasks. Frontier APIs (OpenAI, Anthropic) close that gap — but they require internet access, send your data to third-party servers, and cost money per token.
CAW bridges that gap without giving up control.
It is a stateless Go service that sits in front of any local model and adds:
| Problem | CAW solution |
|---|---|
| Model can't reliably call tools | Virtual tool calling — parses bash/code blocks and returns proper tool_use JSON |
| Multi-step tasks fail mid-way | Server-side agentic loop — CAW executes every step, feeds output back, loops until done |
| No memory across requests | Redis session store + sliding context window |
| Can't search or retrieve docs | RAG pipeline — Qdrant ANN + PostgreSQL FTS, RRF merge |
| Answers go stale | Web augmentation — DDG Instant Answer injected as context for live queries |
| Only one model supported | Pluggable inference adapters — Ollama, llama.cpp, BitNet, vLLM |
| Hard to scale | KEDA autoscaling — scale-to-zero on Kubernetes |
CAW is designed to be used as a drop-in ANTHROPIC_BASE_URL for Claude Code CLI — your local 2B-parameter model gets a production-grade execution environment without touching the model weights.
North Star metric: Close ≥ 60% of the capability gap between the gemma:2b published baseline and GPT-3.5 on MMLU and HumanEval, running fully offline on a $24 Droplet (4 GB RAM). Tested with BitNet-b1.58-2B-4T.
Latest benchmark (2026-04-21): MMLU 90% · HumanEval 100% · Overall gap closed: 200.8% ✅
| Use case | How CAW helps |
|---|---|
| Offline coding assistant | Use Claude Code CLI or Cursor with ANTHROPIC_BASE_URL=http://localhost:8080 — get tool calling, multi-step reasoning, and RAG over your codebase without sending a single byte to the cloud |
| Private document Q&A | Ingest internal PDFs/wikis into Qdrant; ask questions against them with full RAG retrieval — nothing leaves your machine |
| Air-gapped / regulated environments | Hospitals, defence, finance — run the full stack on an isolated server; no external API dependencies at runtime |
| Zero-cost CI/CD code review | Wire CAW into a GitHub Actions job as a local LLM reviewer; no per-token API cost, no data egress |
| Agentic workflow prototyping | Iterate on multi-step agent designs (tool calls, self-critique loops, structured JSON output) before paying for frontier API credits |
| Edge / IoT inference | Deploy to a $24 Droplet or Raspberry Pi 5; KEDA scales to zero when idle so you pay only for active inference |
| Multi-tenant SaaS backend | JWT-based tenant isolation, per-domain Qdrant collections, and Redis rate limiting — ship a white-label AI feature without a third-party LLM dependency |
When a request arrives with a tools array (e.g. from Claude Code CLI):
sequenceDiagram
participant CLI as Claude Code CLI
participant CAW as CAW (Go)
participant Model as BitNet-b1.58-2B-4T
participant Exec as Bash Executor
CLI->>CAW: POST /v1/chat/completions (tools array)
CAW->>Model: Rewritten prompt — "emit ```bash blocks, say DONE when done"
Model-->>CAW: ```bash <command>```
CAW->>Exec: Execute command (server-side sandbox)
Exec-->>CAW: stdout / stderr
CAW->>Model: Feed output back as next turn context
Model-->>CAW: DONE: <final answer>
CAW-->>CLI: Completed response (no tool round-trips)
- CAW rewrites the system prompt to instruct the model to emit
```bashblocks - The model responds with a bash command (or Python script)
- CAW executes it inside the container (Alpine + python3)
- Output is fed back as context for the next turn
- Loop continues until the model signals
DONE:or produces a plain-text final answer - Claude Code CLI receives the finished result as a normal response — no tool round-trips needed
graph TD
GW["API Gateway (Fiber)\nOpenAI-compat HTTP · SSE streaming · Worker pool\nJWT multi-tenant auth · Bearer token · /healthz"]
GW --> ORC
GW --> MEM
GW --> TOOL
subgraph ORC["Orchestration"]
CM["ContextManager"]
TP["TaskPlanner"]
OF["OutputFormatter"]
SC["Self-Critique"]
end
subgraph MEM["Memory Layer"]
RD["Redis\nsession store"]
QD["Qdrant\nvector collections"]
PG["PostgreSQL\nmetadata + FTS"]
end
subgraph TOOL["Tool Registry"]
TD["Dispatcher"]
CE["CodeExecutor\n(gVisor / native)"]
PS["Plugin System"]
end
ORC --> IA
MEM --> IA
TOOL --> IA
subgraph IA["Inference Adapter"]
OA["OllamaAdapter"]
LA["LlamaCppAdapter"]
BA["BitNetAdapter"]
VA["vLLMAdapter (Phase 2)"]
end
IA --> LM["Local Model\nBitNet-b1.58-2B-4T · gemma:2b · llama3\n(Ollama / llama.cpp / BitNet)"]
| Layer | Description |
|---|---|
| API Gateway | OpenAI-compatible HTTP surface (Fiber v2), SSE streaming, worker-pool backpressure, JWT + Bearer auth |
| Orchestration | ContextManager, TaskPlanner, OutputFormatter, Self-Critique loop |
| Memory Layer | Redis session store, Qdrant vector collections (per-domain), PostgreSQL document metadata + FTS |
| Async Ingest | Redis Streams job queue, IngestWorker, DLQ, daily reconciliation CronJob |
| Embedding Service | Dedicated all-MiniLM-L6-v2 pod (384-dim) via gRPC with circuit breaker + LRU cache |
| RAG Pipeline | Parallel Qdrant ANN + PG FTS, RRF merge, cross-encoder reranker (agent mode) |
| Tool Registry | Tool dispatcher, CodeExecutor sandbox (seccomp + cgroup v2, optional gVisor), community plugins |
| Inference Adapter | Pluggable InferenceBackend interface — OllamaAdapter, LlamaCppAdapter, BitNetAdapter, vLLMAdapter |
| IaC / Scaling | Docker Alpine image (~30 MB), Helm charts, KEDA ScaledObjects, serverless manifests |
| Observability | OTel traces, 6 canonical caw_* Prometheus metrics, Grafana dashboards, k6 load tests |
ollama pull gemma:2bdocker compose up -dThis starts: CAW wrapper, Ollama, Redis, PostgreSQL, Qdrant, and the embedding service.
Run the full CAW stack locally with BitNet-b1.58-2B-4T — a ternary-quantized model that runs on CPU at ~3 tokens/sec on 4 GB RAM.
- macOS with Homebrew (PostgreSQL + Redis auto-installed if missing)
- BitNet built from source
- Model file:
BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf
# scripts/start-bitnet-stack.sh — set these three variables at the top:
GGUF=/path/to/ggml-model-i2_s.gguf
BITNET_DIR=/path/to/BitNet
GATEWAY_DIR=/path/to/local-llm-gateway./scripts/start-bitnet-stack.shThe script will:
- Install and start PostgreSQL 16 via Homebrew if not present, create the
cawrole + database - Install and start Redis via Homebrew if not present
- Start the BitNet llama-server on
:8082 - Start the CAW gateway on
:8080(foreground)
# BitNet (model name is the path/ID reported by llama-server)
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer dev-key" \
-H "Content-Type: application/json" \
-d '{
"model": "bitnet",
"messages": [{"role": "user", "content": "Explain transformers in one paragraph."}],
"stream": false
}'
# Ollama (Option A)
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer dev-key" \
-H "Content-Type: application/json" \
-d '{
"model": "gemma:2b",
"messages": [{"role": "user", "content": "Explain transformers in one paragraph."}],
"stream": false
}'Point Claude Code at CAW as its backend — your local model gets full agentic capability:
export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_API_KEY=dev-key
claude # or: claude "write a prime factorization function and test it"CAW intercepts every tool call, executes it server-side, and returns finished results.
Hermes Agent is a self-improving AI agent with a built-in skills system, persistent memory, and 40+ tools. The hermes-caw.sh script starts CAW and Hermes together, with Hermes pre-configured to use CAW as its inference backend and tool registry.
# Start CAW + Hermes in one command
CAW_API_KEY=dev-key ./scripts/hermes-caw.sh
# Start only CAW (no Hermes)
./scripts/hermes-caw.sh --caw-only
# Attach Hermes to an already-running CAW
CAW_API_KEY=dev-key ./scripts/hermes-caw.sh --hermes-only
# Override port or Ollama endpoint
CAW_API_KEY=dev-key PORT=9090 OLLAMA_BASE_URL=http://localhost:11434 ./scripts/hermes-caw.shOnce Hermes is running, it routes all inference through CAW (localhost:8080/v1) and discovers CAW's tool registry via MCP (localhost:8080/mcp). Use it for autonomous background tasks:
# Inside Hermes:
> implement the failing tests in tests/gateway/ and commit
> run the MMLU benchmark and summarise the results
For headless execution (no interactive prompt):
CAW_API_KEY=dev-key hermes --no-interactive "implement US-14 from docs/reference/agile-backlog.md, run tests, commit"curl -N http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer dev-key" \
-H "Content-Type: application/json" \
-d '{"model":"bitnet","messages":[{"role":"user","content":"Write a Go HTTP server"}],"stream":true}'All configuration is via environment variables. See .env.example for the full list.
| Variable | Default | Description |
|---|---|---|
CAW_API_KEY |
(required) | Bearer token for API auth |
CAW_JWT_SECRET |
(optional) | If set, enables JWT multi-tenant auth |
INFERENCE_BACKEND |
ollama |
Adapter: ollama, llamacpp, bitnet, vllm |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama inference endpoint |
BITNET_BASE_URL |
http://localhost:8080 |
BitNet llama-server endpoint |
BITNET_MODEL_QUANT |
(optional) | Must be i2_s if set; any other value is rejected |
REDIS_ADDR |
localhost:6379 |
Redis address |
DATABASE_URL |
(required) | PostgreSQL DSN |
QDRANT_BASE_URL |
http://localhost:6333 |
Qdrant endpoint |
EMBED_BASE_URL |
http://localhost:5000 |
Embedding service endpoint |
WORKER_POOL_SIZE |
10 |
Max concurrent inference requests |
CODEEXEC_RUNTIME |
native |
Code sandbox: native or gvisor |
CAW_PLUGIN_DIR |
(optional) | Directory of community plugin binaries |
When CAW_JWT_SECRET is set, requests must include a signed JWT with a domains claim:
# Generate a token (example with jwt-cli)
jwt encode --secret "$CAW_JWT_SECRET" '{"sub":"tenant-1","domains":["finance","legal"]}'
# Use it
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer <token>" \
-H "X-Domain: finance" \
...Without CAW_JWT_SECRET, the service falls back to API key auth (backward compatible).
.
├── cmd/wrapper/ # main.go — entrypoint, DI wiring
├── internal/
│ ├── adapter/ # InferenceBackend interface + Ollama/LlamaCpp/vLLM adapters
│ ├── embed/ # gRPC client for embedding service + LRU cache + circuit breaker
│ ├── gateway/ # Fiber handlers, SSE streaming, worker pool, auth middleware
│ ├── ingest/ # IngestWorker, Redis Streams consumer, DLQ, reconciler
│ ├── memory/ # Redis session store, context window manager
│ ├── observability/ # OTel tracer/meter bootstrap, Prometheus metrics
│ ├── orchestration/ # ContextManager, TaskPlanner, OutputFormatter, self-critique
│ ├── rag/ # Retriever (ANN + FTS + RRF), reranker, chunk store
│ ├── security/ # JWT middleware, API key auth, constant-time compare
│ └── tools/ # Dispatcher, CodeExecutor (sandbox), PluginExecutor, loader
├── embed_service/ # Python gRPC server — all-MiniLM-L6-v2 (sentence-transformers)
├── proto/embed/ # Protobuf contract for embedding gRPC
├── scripts/
│ ├── benchmark/ # MMLU + HumanEval benchmark harness (Go package)
│ ├── start-bitnet-stack.sh # One-shot macOS launcher: auto-provisions PG + Redis, starts BitNet + CAW
│ ├── hermes-caw.sh # Start CAW + Hermes Agent together (or independently via flags)
│ └── migrate_qdrant.go # Qdrant collection migration CLI
├── deploy/
│ ├── helm/ # Helm charts (caw-wrapper, embed-service, qdrant-distributed,
│ │ # pgbouncer, ingest-worker, caw-reconciler, inference-backend)
│ ├── keda/ # KEDA ScaledObjects (wrapper + ingest-worker)
│ ├── grafana/ # Dashboard JSON + datasource config
│ ├── prometheus/ # Alert rules
│ ├── redis/ # Redis config
│ ├── docker/ # Docker-specific overrides
│ └── serverless/ # Knative + AWS Lambda manifests
├── .github/workflows/ # helm-publish.yml — OCI publish on helm-v* tags
├── tests/ # Go test packages mirroring internal/
│ ├── adapter/
│ ├── benchmark/ # Requires -tags benchmark
│ ├── embed/
│ ├── gateway/
│ ├── iac/ # Helm/K8s manifest tests
│ ├── ingest/
│ ├── k6/ # Load tests
│ ├── memory/
│ ├── observability/
│ ├── orchestration/
│ ├── rag/
│ ├── security/
│ └── tools/
├── docs/reference/ # Architecture spec, agile backlog
├── Dockerfile # Multi-stage build → scratch image (<15 MB)
├── docker-compose.yml # Full local stack
├── go.mod / go.sum
└── progress.txt # Sprint completion log
# All tests (standard)
go test ./tests/... -count=1
# All tests including benchmark harness
go test ./tests/... -tags benchmark -count=1
# Single package
go test ./tests/security/... -v
# With race detector
go test -race ./tests/... -count=112 test packages, 0 failures on the current main branch.
The benchmark harness measures the capability gap closed between the raw model and GPT-3.5 baselines.
| Benchmark | CAW score | gemma:2b baseline | GPT-3.5 baseline | Gap closed |
|---|---|---|---|---|
| MMLU | 90% (9/10) | 35% | 70% | 157.1% |
| HumanEval | 100% (5/5) | 12% | 48% | 244.4% |
| Overall | — | — | — | 200.8% ✅ (target ≥ 60%) |
The benchmark is a Go e2e test suite — start the CAW stack first, then run:
# Full North Star gate (MMLU + HumanEval, asserts ≥ 60% gap closed)
go test -v -tags e2e -run TestE2E_BenchmarkNorthStar ./tests/e2e/ -timeout 15m
# Individual category tests with per-question logging
go test -v -tags e2e -run "TestE2E_MMLU|TestE2E_HumanEval" ./tests/e2e/ -timeout 15m
# Override endpoint / API key
CAW_ENDPOINT=http://my-host:8080/v1/chat/completions \
CAW_API_KEY=my-key \
go test -v -tags e2e -run TestE2E_BenchmarkNorthStar ./tests/e2e/ -timeout 15mThe test writes tests/e2e/e2e_benchmark_results.json after each run. It auto-skips if the stack is not reachable, so it's safe to keep in CI without a live model.
# Dry-run (mock responder — no live model needed, validates harness logic)
go test ./tests/benchmark/... -tags benchmark -vAll charts live in deploy/helm/. Published to oci://ghcr.io/caw/charts on helm-v* tags.
| Chart | Description |
|---|---|
caw-wrapper |
Main API gateway + orchestration service |
embed-service |
Python embedding pod (all-MiniLM-L6-v2) |
ingest-worker |
Async ingest consumer (Redis Streams) |
caw-reconciler |
Daily reconciliation CronJob |
inference-backend |
Ollama / llama.cpp sidecar |
qdrant-distributed |
3-replica Qdrant StatefulSet (Raft consensus) |
pgbouncer |
PgBouncer connection pooler (transaction mode) |
# Install the full stack into a cluster
helm install caw oci://ghcr.io/caw/charts/caw-wrapper --version 1.0.0
# Trigger Helm publish (creates the GitHub Actions workflow run)
git tag helm-v1.0.0 && git push origin helm-v1.0.0| Component | Trigger | Scale range |
|---|---|---|
caw-wrapper |
caw_requests_in_flight (Prometheus) |
0 → 10 |
ingest-worker |
Redis Streams pendingEntriesCount > 50 |
0 → 5 |
inference-backend |
keep-warm (minReplicaCount: 1) |
1 → 3 |
Fallback: if Prometheus is unavailable, wrapper holds at 3 replicas.
| Metric | Type | Description |
|---|---|---|
caw_requests_total |
Counter | Total requests by status |
caw_requests_in_flight |
Gauge | Active worker slots in use |
caw_inference_latency_seconds |
Histogram | End-to-end inference time |
caw_redis_latency_seconds |
Histogram | Redis command latency |
caw_rag_retrieval_latency_seconds |
Histogram | RAG retrieval time |
caw_ingest_dlq_depth |
Gauge | Dead-letter queue depth |
Grafana dashboard: deploy/grafana/dashboard.json
| Path | Description |
|---|---|
GET /healthz |
Liveness probe (always 200 if process alive) |
GET /readyz |
Readiness probe (checks Redis + Postgres + Qdrant) |
GET /metrics |
Prometheus scrape endpoint |
See deploy/serverless/README.md for full instructions.
- Knative:
deploy/serverless/knative-service.yaml— scale-to-zero, maxScale 10 - AWS Lambda:
deploy/serverless/lambda-function.yaml— CloudFormation SAM, container image
Community plugins are subprocess binaries that speak JSON on stdin/stdout.
# Install a plugin
cp my-tool /plugins/my-tool && chmod +x /plugins/my-tool
# Configure
export CAW_PLUGIN_DIR=/pluginsPlugin contract (stdin → stdout):
// Request
{"tool": "my-tool", "input": {"query": "..."}}
// Response
{"output": "...", "error": null}# Build
go build -o caw ./cmd/wrapper
# Run locally with Ollama (requires Redis, Postgres, Qdrant, Ollama)
CAW_API_KEY=dev-key \
REDIS_ADDR=localhost:6379 \
DATABASE_URL=postgres://caw:caw@localhost:5432/caw?sslmode=disable \
QDRANT_BASE_URL=http://localhost:6333 \
EMBED_BASE_URL=http://localhost:5000 \
OLLAMA_BASE_URL=http://localhost:11434 \
./caw
# Run locally with BitNet (requires Redis, Postgres, and a running llama-server on :8082)
CAW_API_KEY=dev-key \
INFERENCE_BACKEND=bitnet \
BITNET_BASE_URL=http://localhost:8082 \
DATABASE_URL=postgres://caw:caw@localhost:5432/caw?sslmode=disable \
REDIS_ADDR=localhost:6379 \
./caw
# Or use the one-shot launcher (auto-provisions everything on macOS)
./scripts/start-bitnet-stack.sh
# Start CAW + Hermes Agent (full integration)
CAW_API_KEY=dev-key ./scripts/hermes-caw.sh
# Lint
go vet ./...
# Format
gofmt -w .cd embed_service
pip install -r requirements.txt
python server.py # gRPC on :50051
# or
python http_server.py # HTTP on :5000See CONTRIBUTING.md for the full guide — setup, TDD workflow, commit format, plugin development, and code style.
Quick checklist:
- Fork and create a feature branch:
git checkout -b feat/my-feature - Write failing tests first, then implement (TDD)
- Ensure
go test ./tests/... -count=1passes with 0 failures - Commit with the canonical format:
feat(US-XX): <title> - Open a pull request against
main
- Added
BitNetAdapter(internal/adapter/bitnet.go) — calls the BitNet llama-server's/v1/chat/completions(OpenAI-compat) endpoint instead of the raw/completionendpoint, enabling propermax_tokenshandling and SSE streaming. INFERENCE_BACKEND=bitnetactivates the adapter;BITNET_BASE_URLconfigures the server URL (defaulthttp://localhost:8080).
BitNetAdapter,LlamaCppAdapter, andOllamaAdapterall sentn_predict: 0/num_predict: 0when the client omittedmax_tokens, causing llama.cpp to generate zero tokens and return a single garbage token.- Fixed: default to
512tokens (-1for Ollama) whenMaxTokensis unset.
scripts/start-bitnet-stack.shnow automatically installs and starts PostgreSQL 16 and Redis via Homebrew if they are not already running, creates thecawrole and database, and waits for readiness before starting CAW — zero manual setup required on a fresh Mac.
- New
scripts/hermes-caw.shstarts CAW and Hermes Agent together with a single command. - Hermes is pointed at CAW's OpenAI-compatible
/v1endpoint and MCP tool registry at/mcp. - Flags:
--caw-only(start only CAW),--hermes-only(attach Hermes to a running CAW). - All env vars (
CAW_API_KEY,PORT,OLLAMA_BASE_URL, etc.) are forwarded automatically. ~/.hermes/config.yamlis pre-configured duringhermes installto use CAW as the default provider.
TestE2E_BenchmarkNorthStarnow passes: MMLU 90%, HumanEval 100%, gap closed 200.8%.- Fixed
buildMMLUPromptleaking the correct answer into live model prompts (was designed forMockResponderonly). - Fixed
MockResponderto look up answers fromSampleMMLUQuestionsinstead of parsing the prompt. - Fixed
extractLetterto handle(C) Parisformat responses. - Fixed
ContainsPythonFunctionDefto matchdefinside code fences and inline after explanatory text.
Hermes Agent integrates with CAW at two levels:
| Integration | How | |
|---|---|---|
| Inference | Hermes routes all LLM calls through CAW's /v1/chat/completions |
Local model gets Hermes' full agentic loop on top of CAW's orchestration |
| Tool registry | Hermes connects to CAW's MCP server at /mcp |
Hermes can invoke any CAW tool (code executor, RAG retrieval, web search) natively |
# Install Hermes (one-time)
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
# Start both
CAW_API_KEY=dev-key ./scripts/hermes-caw.shHermes is pre-configured via ~/.hermes/config.yaml (written during setup):
- Provider:
custom→http://localhost:8080/v1 - MCP server:
caw→http://localhost:8080/mcpwithAuthorization: Bearer ${CAW_API_KEY}
| Pattern | Command |
|---|---|
| Interactive agent | ./scripts/hermes-caw.sh then chat inside Hermes |
| Headless task | hermes --no-interactive "<task>" |
| Attach to running CAW | ./scripts/hermes-caw.sh --hermes-only |
| CAW only (no Hermes) | ./scripts/hermes-caw.sh --caw-only |
MIT — see LICENSE for details.
| Feature | local-llm-gateway (CAW) | LiteLLM | Ollama | LocalAI |
|---|---|---|---|---|
| OpenAI-compatible API | ✅ | ✅ | ✅ | ✅ |
| Anthropic-compatible API | ✅ | ✅ | ❌ | ❌ |
| Server-side agentic loop | ✅ | ❌ | ❌ | ❌ |
| Tool calling for any model | ✅ | partial | ❌ | ❌ |
| RAG (vector + FTS) | ✅ | ❌ | ❌ | partial |
| Web augmentation (DDG) | ✅ | ❌ | ❌ | ❌ |
| Redis session memory | ✅ | ❌ | ❌ | ❌ |
| KEDA autoscaling | ✅ | ❌ | ❌ | ❌ |
| 100% offline | ✅ | ✅ | ✅ | ✅ |
Keywords: local-llm · self-hosted-ai · llm-gateway · llm-agent · offline-ai · openai-compatible · anthropic-compatible · bitnet · gemma · ollama · rag · tool-calling · agentic-loop · go · fiber · redis · qdrant · postgresql · keda · helm · docker