Skip to content

fix(replication): close on inbound message-handler errors instead of swallowing them (#440)#511

Open
kriszyp wants to merge 3 commits into
mainfrom
kris/decode-loop-error-close-440
Open

fix(replication): close on inbound message-handler errors instead of swallowing them (#440)#511
kriszyp wants to merge 3 commits into
mainfrom
kris/decode-loop-error-close-440

Conversation

@kriszyp

@kriszyp kriszyp commented Jul 1, 2026

Copy link
Copy Markdown
Member

Summary

Lands the first safety item of workstream #440 (epic #430 Theme B/F): an error escaping onWSMessage — the inbound handler in replicateOverWS — was logged and swallowed, silently dropping the rest of the failed frame while later frames kept applying and confirming higher sequence ids: a permanent, undetected gap on the receiver. The catch now calls closeOnInboundMessageError, which latches inbound processing off before closing (frames already queued on the messageProcessing chain must not commit past the hole; the peer can still deliver frames until the close handshake completes) and closes with 1011 as a transient protocol close, so the normal retry path reconnects and resumes from the last durable cursor. A deterministic poison frame becomes a visible, backed-off reconnect loop instead of silent loss (escalation budget = W2 #432).

Follow-up commit from cross-model review: the inner per-record decode catch's log line dereferenced tableDecoder.decoder unguarded, so an unknown tableId threw a secondary TypeError from the logger itself — masking the original error and escaping to the outer catch by accident. Now guarded, restoring the intended skip-and-log semantics with truthful diagnostics.

Where to look

  • Behavior change: anything that throws/rejects out of the inbound handler now cycles the connection instead of being swallowed. The review traced the awaited paths under the outer try (isSkippableLeadingDuplicate store read, waitForDrain, frame decode) — routine/transient cases either have inner catches or only reject on substantive failure/teardown (where the latch is already set and close is a no-op). If you know of an error class in this path that is routine under normal operation, that's the thing to veto.
  • Deliberately deferred residual (please confirm you agree): a per-record value decode failure is still skip-and-logged by the inner decodeBlobsWithWrites catch, and maxBatchVersion still advances past the skipped record — the same silent-gap class, now explicitly tracked in Replication W11: Code organization, protocol versioning & decode-loop safety #440 with W2 Replication W2: Cursor correctness & divergence detection #432 (it's the #1163/#1453 structure-fork lineage, so closing there has real blast radius and deserves its own change).
  • closeOnInboundMessageError is extracted + unit-tested per the file's pure-decision-helper idiom (latch-before-close ordering pinned by test).

Cross-model review (Codex + Gemini + domain adjudication) found no blockers; both significant findings are addressed or explicitly deferred as above. Not user-facing; no docs change needed.

Generated by an LLM (Claude Fable 5), reviewed per the Harper DLC.

kriszyp and others added 2 commits July 1, 2026 16:35
…swallowing them (#440)

An error escaping onWSMessage was logged and swallowed: the rest of the
failed frame was silently dropped while later frames kept applying and
confirming higher sequence ids — a permanent, undetected gap on the
receiver (epic #430 Theme B / workstream #440 land-first safety item).

Now the catch latches inbound processing off (queued frames on the
messageProcessing chain must not commit past the hole) and closes with
1011 as a transient protocol close, so the normal retry path reconnects
and resumes from the last durable cursor, re-streaming everything past
it. Extracted as closeOnInboundMessageError with unit coverage,
following the file's pure-decision-helper idiom.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rame-vs-record error boundary (#440)

Cross-model review findings on the previous commit: an unknown tableId
made the inner decode-catch's own log line throw (unguarded
tableDecoder.decoder deref), masking the original error and escaping to
the outer close path by accident. Guard the derefs so the intended
skip-and-log semantics and truthful diagnostics are restored. Also
scope the closeOnInboundMessageError docstring: per-record value-decode
failures are still skip-and-logged with the cursor advancing past them
(residual tracked in #440 / #432).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@kriszyp kriszyp requested review from kylebernhardy and ldt1996 July 1, 2026 22:51

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces robust error handling for replication connections over WebSockets by implementing closeOnInboundMessageError. This function ensures that inbound processing is halted and the connection is closed with a transient error code (1011) when an unhandled error occurs, preventing silent data loss. Additionally, it guards against TypeErrors when logging decoding errors by using optional chaining on tableDecoder. The feedback suggests further strengthening this error handling by making the logger dependency optional to prevent secondary crashes, and improving the log output formatting when tableDecoder is undefined to avoid string-concatenating undefined values.

Comment thread replication/replicationConnection.ts
Comment thread replication/replicationConnection.ts
@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Reviewed; no blockers found.

…review)

Gemini findings: guard the logger access fully (the log must never
prevent the close) and make the decode-error log readable when the
table decoder is unknown.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@kriszyp kriszyp marked this pull request as ready for review July 1, 2026 23:09
@kriszyp kriszyp requested a review from a team as a code owner July 1, 2026 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant