Skip to content

feat(mcp): SaaS tenant-identity spine — authenticated tenant, no header override, default-deny#66

Open
Basheirkh wants to merge 5 commits into
mainfrom
feat/saas-identity-spine
Open

feat(mcp): SaaS tenant-identity spine — authenticated tenant, no header override, default-deny#66
Basheirkh wants to merge 5 commits into
mainfrom
feat/saas-identity-spine

Conversation

@Basheirkh

Copy link
Copy Markdown
Contributor

Summary

Phase 1 of multi-tenant SaaS (PLAN-saas-multitenant): close the isolation hole where a request's tenant came from the free x-nil-workspace header — anyone could set it and read any workspace's data via registry routing.

resolve_tenant gains a SaaS mode:

  • tenant = the authenticated identity via an injected claim_resolver (verified JWT workspace claim)
  • the x-nil-workspace header cannot override it (mismatch → refused)
  • a BYO x-nil-adapter-url is rejected (identity routes to the tenant's registered active adapter)
  • a missing claim is default-deny

Header-trust remains only for self-hosted/dev (multi_tenant without saas). build_remote_app/TenantToolsProvider thread saas + claim_resolver (env NIL_MCP_SAAS); SaaS fails closed without a resolver — it can never silently fall back to header-trust.

Isolation conformance (the point)

  • tenant A and B route to different adapters
  • a header naming B under an A token → refused
  • BYO adapter-url in SaaS → refused
  • missing claim → default-denied
  • unknown workspace → no adapter

25 mcp/tenant tests green; 105 green across mcp/tenant/intent/router.

Scope / honesty

This is the identity spine only (the decision we agreed: build this first). The production claim_resolver (keycloak JWKS JWT verifier) is the deployment wiring — intentionally not hand-rolled here. Remaining for full SaaS: tenant-scope the newer surfaces (intent router providers, export, automation, Hermes isolation), per-tenant quotas/rate-limits, and the encrypted per-tenant secret vault (decided: control-plane vault). Not merged = no deploy.

AI Bot added 5 commits June 27, 2026 11:52
…er override, default-deny

Phase 1 of multi-tenant SaaS (PLAN-saas-multitenant): close the isolation hole where the tenant came
from a FREE x-nil-workspace header (anyone could read any workspace).

resolve_tenant gains a saas mode: the tenant is the AUTHENTICATED identity via an injected
claim_resolver (verified JWT workspace claim); the workspace header may NOT override it; a BYO
adapter-url is rejected (identity routes to the tenant's registered active adapter); a missing claim
is default-deny. Header-trust remains only for self-hosted/dev (multi_tenant without saas).

build_remote_app/TenantToolsProvider thread saas + claim_resolver (env NIL_MCP_SAAS); SaaS FAILS
CLOSED without a claim_resolver — it can never silently fall back to header-trust.

Isolation conformance tests: tenant A and B route to different adapters; a header naming B under an
A token is refused; BYO adapter-url refused; missing claim default-denied; unknown workspace has no
adapter. 25 mcp/tenant tests green.

Remaining (follow-up): wire the production claim_resolver = keycloak JWKS JWT verifier; then
tenant-scope the newer surfaces (intent router providers, export, automation, Hermes) + per-tenant
quotas (Phases 2-3), and the encrypted per-tenant secret vault.
…quotas

Builds on the identity spine (#66). Three self-contained, TDD'd kernel keystones for tenant management:

- SecretVault (nilscript/secrets/vault.py): per-tenant secrets (adapter creds + LLM key) encrypted at
  rest (Fernet); read BY TENANT only; storage-agnostic (injectable store); wrong key can't decrypt;
  tenants isolated. This is "save your secrets once" done securely.
- JWT claim verifier (nilscript/mcp/auth.py): the production claim_resolver for SaaS mode — verifies
  the bearer JWT (sig/exp/iss/aud) and reads the workspace claim; forged/expired/missing → None →
  default-deny. Completes the spine's prod wiring (keycloak JWKS layers on top).
- Per-tenant quotas + rate limit (nilscript/governance_quota.py): token-bucket + daily volume caps per
  (tenant, kind); a noisy tenant is throttled without starving others (the 429 fairness lesson). Pure,
  injected clock, resume-safe.

32 tests green (vault roundtrip/encryption/isolation/wrong-key; JWT verify/forge/expiry/no-claim;
rate-limit fairness; quota caps).
…ault

Onboarding for SaaS: a company is stood up in a single privileged call.

- store: tenant_secrets table + put/get/delete_secrets using the SecretVault (encrypted at rest,
  Fernet, NIL_VAULT_KEY); vault disabled (fail-closed) when no key. Secrets keyed by workspace.
- app: POST /tenants/provision (workspace + secrets + adapter → store secrets encrypted, register +
  activate adapter); GET /tenants/{ws}/secret/{name} (registry-token-gated server-to-server fetch of a
  tenant's decrypted secret for the platform — never the browser, never logged).

Tests: one-call provision activates the adapter + stores secrets; secret read is token-gated; secrets
are ciphertext at rest (no plaintext key on disk); tenants isolated; auth + workspace required. 164
cp/registry/tenant/saas tests green.
…oped durable layer

Closes the remaining SaaS items (kernel side):

- JWKS claim resolver (mcp/auth.py): production keycloak path — PyJWKClient fetches + caches signing
  keys, selects by kid (rotation-safe); fail-closed on any verify error. jwt_claim_resolver_from_env
  precedence: NIL_JWT_JWKS_URL > NIL_JWT_PUBLIC_KEY > NIL_JWT_HS_SECRET.
- Surface tenant-scoping: store.recent(workspace=) and store.pending(workspace=) (pending joins to its
  events' workspace, since approvals carries none); /api/events and /api/pending take ?workspace=.
  Conformance: tenant A's events/pending never include B's; operator view (no ws) sees all.
- Tenant-scoped durable layer (durable.py, Temporal-ready): tenant-prefixed deterministic workflow ids
  (idempotent + no cross-tenant collision), per-tenant Temporal namespace, TenantDurablePolicy
  (per-tenant rate + concurrency admission — the 429 fairness, durable edition). Worker integration
  is the separate Temporal build; this is the isolation layer it plugs into.

42 SaaS tests green (JWKS verify/forge, durable id/namespace isolation + per-tenant throttle, events/
pending workspace scoping); 161 across cp/registry/tenant/mcp.
…layer (Phase 6)

durable_temporal.py (optional — temporalio imported lazily): heavy/bulk governed writes run as DURABLE
workflows, tenant-isolated and crash-safe:
- per-tenant Temporal namespace + deterministic tenant-scoped workflow id (idempotent, no cross-tenant
  collision);
- the NIL gate runs in an activity with a RetryPolicy → a throttled (429)/transient backend is retried
  durably, not dropped (the 429 fairness lesson, durable edition);
- register_executor injects the NIL propose→commit (real SDK in prod, a fake in tests);
- run_worker starts a worker on the tenant's namespace + task queue.

Verified END-TO-END with temporalio's in-process time-skipping server (no external infra): a backend
throttled twice is retried and commits on attempt 3; workflow id is tenant-scoped + idempotent.

pyproject: [saas] (pyjwt[crypto] + cryptography) and [temporal] (temporalio) extras, wired into [dev].
148 durable/saas/tenant/cp/mcp tests green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant