Skip to content

fix(auth): don't emit expired access tokens from print-access-token#408

Draft
thomhug wants to merge 1 commit into
ninech:mainfrom
thomhug:fix/print-access-token-expiry
Draft

fix(auth): don't emit expired access tokens from print-access-token#408
thomhug wants to merge 1 commit into
ninech:mainfrom
thomhug:fix/print-access-token-expiry

Conversation

@thomhug

@thomhug thomhug commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Problem

nctl auth print-access-token could hand back an expired access token (or, on the error path, an empty string) instead of refreshing or failing clearly. A client (nbill) sent an hours-expired token to a resource server with no verifier leeway (billy / zitadel) and got an opaque 401. A fresh token only appeared once kubelogin's own cache TTL flipped.

Two root causes:

  1. Client.Token swallowed every error to "" — an "expired" error became an empty string on stdout, and callers couldn't distinguish expired from broken.
  2. GetTokenString only trusted kubelogin's reported ExpirationTimestamp — it never checked the JWT exp itself, so a token kubelogin believed valid could slip through unverified.

Changes

  • api/client.go — add TokenE(ctx) (string, error); Token now delegates to it and logs the error on stderr instead of silently returning "".
  • auth/print_access_token.go — route through TokenE, so the command exits non-zero with a clear message instead of printing an empty/stale token.
  • api/login.goGetTokenString now decodes the JWT exp claim itself (golang-jwt, unverified — only the expiry is needed). If the token is expired (with a small 10s leeway), it forces a kubelogin refresh via the credential plugin's ForceRefresh flag. If even the forced refresh yields an expired token, it returns an actionable error (access token expired on <RFC3339>, run "nctl auth login") rather than writing a stale token to stdout. Falls back to kubelogin's ExpirationTimestamp for non-JWT (opaque) tokens.
  • auth/oidc.go — adapted to the new GetToken(..., forceRefresh, out) signature (false for the client-go exec-plugin path, which keeps its own expiry handling).

Testing

  • Live e2e against a real Keycloak cache with an expired (−65h) ID token: the new binary refreshed it and emitted a fresh valid JWT (exit 0, cache updated) — happy path and refresh confirmed intact.
  • Deterministic table-driven tests (api/login_test.go) cover the paths the live run can't reach: valid / expired / missing exp, missing ExpirationTimestamp, opaque tokens, and the orchestration (force-refresh recovery, "still expired → clear error, no stale token", error propagation). The refresh orchestration was extracted into validToken(now, fetch) to make this testable without driving the real login flow.
  • go test -race, go vet, staticcheck, golangci-lint (0 issues), modernize all clean.

Known limitation

A pure machine-vs-issuer clock skew (local clock running behind) is not caught: kubelogin and this check both use the local time.Now(), so a token that is expired by the issuer's clock but not the local one is still considered valid. That case needs clock sync (NTP), not this change. The evidence here (kubelogin recovering once its cache TTL flipped) points to a stale-return path rather than skew, which this PR does address.

The client-go API exec path (nctl auth oidc) is intentionally left unhardened — there, client-go performs its own expiry handling.

print-access-token could return a stale access token (or, on error, an
empty string) instead of refreshing or failing clearly. A client sent an
hours-expired token to a resource server without leeway and got an opaque
401; a fresh token only appeared once kubelogin's cache TTL flipped on its
own.

Two root causes are addressed:

- Client.Token swallowed every error to "", so an expired-token error
  became an empty string on stdout and callers couldn't tell "expired"
  from "broken". Add TokenE(ctx) (string, error) and route
  print-access-token through it so it exits non-zero with a clear message.

- GetTokenString only trusted kubelogin's reported ExpirationTimestamp.
  It now decodes the JWT exp claim itself, and if the token is expired
  (with a small leeway) forces a kubelogin refresh via the credential
  plugin's ForceRefresh flag. If even the forced refresh yields an expired
  token, it returns an actionable error
  ("access token expired on <RFC3339>, run \"nctl auth login\"") instead
  of writing a stale token to stdout.

The refresh orchestration is extracted into validToken(now, fetch) so the
force-refresh and no-stale-token paths are covered by table-driven tests
(valid/expired/missing exp, missing ExpirationTimestamp, opaque tokens).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@thomhug thomhug marked this pull request as draft June 5, 2026 23:27
@thomhug

thomhug commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

Marking this draft — it's not yet clear this change is needed.

Investigating an intermittent billy HTTP 401 from nbill, the billy server log showed the rejection reason was an audience/nonce mismatch on a still-valid (non-expired) token, not an expired access token:

auth.error="audience is not valid: Audience must contain client_id \"billy-production\"
            nonce does not match: expected \"\" but was \"vtnfH86…\""

That failure mode is being addressed elsewhere:

  • billy will stop enforcing nonce on its Bearer verifiers (a resource server shouldn't validate the nonce) — fixes the observed flap.
  • nbill gets retry-on-401 + a per-request token source, so a token that expires while nbill is held open (e.g. a long wizard session) is refreshed transparently instead of being sent stale.

The expiry scenario this PR hardens against (print-access-token emitting a JWT-expired token while kubelogin's cache still considers it valid) is plausible given the ~5-minute token lifetime, but we have no captured log of an actual expiry-reason 401 — the "self-heals once kubelogin's cache TTL flips" symptom that motivated it is equally explained by the nonce mechanism above.

Holding as draft until we either capture a real expiry-reason 401 (then this is the right source-level fix), or confirm the client-side retry + billy nonce fix fully resolve it (then this may be unnecessary).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant