Skip to content

feat: format off a stream of bytes instead of parsing (#30)#56

Open
todor-a wants to merge 2 commits into
dprint:mainfrom
todor-a:streaming-formatter-issue-30
Open

feat: format off a stream of bytes instead of parsing (#30)#56
todor-a wants to merge 2 commits into
dprint:mainfrom
todor-a:streaming-formatter-issue-30

Conversation

@todor-a

@todor-a todor-a commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Closes #30. Also closes #5 and is a step toward #19.

What

Rewrites the formatter to work directly off a stream of bytes — a single-pass tokenize + recursive emit, with no AST and no dprint-core IR/print-engine. format_text now calls a new format_streaming(&[u8], config, is_jsonc) directly.

Why (measured first)

I benchmarked the old pipeline before touching anything: parsing is only ~7% of format_text's time. The other ~93% is IR generation + the constraint-solving print engine. So "skip the parser" alone would save ~7% — the real win is replacing the whole pipeline with a direct tokenize-and-emit formatter.

Results

Speed (2.4 MB sample) ~230 ms → ~32 ms (~7x)
Spec suite passes 48/48 — test_specs now runs the streaming formatter against the committed [expect] files, including the format_twice idempotency check
Syntax errors reported with the same Line/column diagnostics as before; all existing error tests pass unchanged
Default dependencies no longer pulls jsonc-parser, text_lines, or dprint-core-macros

How it works

  • Tokenizer over &[u8]: punctuation, strings (incl. single-quoted for JSONC), words (numbers / true/false/null / bare keys), line + block comments, tracking newlines-before each token.
  • Validation: a recursive-descent pass over the token stream checks the grammar and reports StreamError { start, end, message }, which format_text renders via dprint-core's format_diagnostic. Word tokens are checked by their first character — enough to reject genuine garbage (&, a zero-width space) at the right column while staying lenient about the rest.
  • Layout: comments are placed positionally from each token's newline count (trailing vs own-line, dangling between name/colon/value/comma, blank-line preservation, slash-spacing); dprint-ignore emits the node verbatim; widths use Unicode display width (UAX#11) with an ASCII fast-path. Arrays have three layouts — flat, full multi-line, and the "inline multi-line object" form ([…, {}]).
  • The AST/IR generator and trace_file are kept behind the tracing feature only (a pre-existing stale generate() call there is fixed so the feature compiles again).

Invalid UTF-8 (#5)

format_streaming operates on &[u8], so bytes that aren't valid UTF-8 inside a string pass through untouched — something a UTF-8 parser can't do. Covered by a unit test.

Heads-up on behavior

Two notes for review:

  1. Leniency: the streaming validator is stricter than nothing but more lenient than jsonc-parser about odd word tokens (e.g. it won't reject every malformed number). It still errors on structural problems. This nudges toward format invalid code ? #19 but is not full "format-anything" leniency.
  2. Layout on a rare case: dprint-core's print engine makes the inline-vs-full-break decision for arrays-of-objects via a width measurement that runs into the broken object's first line. I reproduce this for the common cases, but at narrow lineWidth a small set of inputs can lay out differently than the old formatter. It's layout-only (never changes values) and doesn't affect any spec. Happy to add the remaining width gate if you'd prefer exact parity there.

Notes

  • New: src/streaming.rs, examples/bench.rs (perf harness). Added dep unicode-width.
  • No fallback to the old formatter — streaming is the formatter.

todor-a added 2 commits June 30, 2026 16:52
Rewrites the formatter to work directly off a token stream with no AST and
no dprint-core IR/print-engine. A single recursive pass tokenizes, validates
the grammar, and emits the formatted bytes.

Why
- Parsing was only ~7% of the old pipeline; the rest was IR generation plus
  the constraint-solving print engine. Replacing the whole pipeline (not just
  the parser) is where the win is.
- Working off bytes lets strings carry invalid UTF-8 unchanged (closes dprint#5).

Behavior
- format_text now calls format_streaming directly; no parser in the format
  path. The default build no longer depends on jsonc-parser, text_lines, or
  dprint-core-macros (now optional, tracing-only).
- Streaming validates the grammar itself and reports syntax errors with the
  same Line/column diagnostics as before (all existing error tests pass
  unchanged).
- Passes the full spec suite (test_specs runs streaming against the committed
  expectations, including the format-twice idempotency check).

Perf: ~7x faster (2.4 MB sample: ~230 ms -> ~32 ms).

The AST/IR generator and trace_file are kept behind the `tracing` feature
(and a pre-existing stale generate() call there is fixed so it compiles).
- Split the 1200-line module into streaming/{mod,printer}.rs: mod.rs holds the
  lexer, the grammar validator, and the entry point; printer.rs holds the emit
  engine. Shared token types are pub(crate).
- Extract object_renders_multiline() (was duplicated between is_breaker and the
  array arm of structurally_multiline) and emit_own_line_comment() (was repeated
  at four comment-placement sites).
- Drop dead code: identity wrapper src_start, unused params, and a hand-rolled
  trim_ascii_end now that std's [u8]::trim_ascii_end is stable.
- Render the flat form once in the object path instead of twice.
- Fix a stale module doc comment (widths are unicode display width, not chars).

No behavior change: full spec suite + error-diagnostic tests still pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rewrite formatter to format a stream of bytes Invalid UTF-8 chars prevent formatting

1 participant