feat: format off a stream of bytes instead of parsing (#30) by todor-a · Pull Request #56 · dprint/dprint-plugin-json

todor-a · 2026-06-30T13:53:56Z

Closes #30. Also closes #5 and is a step toward #19.

What

Rewrites the formatter to work directly off a stream of bytes — a single-pass tokenize + recursive emit, with no AST and no dprint-core IR/print-engine. format_text now calls a new format_streaming(&[u8], config, is_jsonc) directly.

Why (measured first)

I benchmarked the old pipeline before touching anything: parsing is only ~7% of format_text's time. The other ~93% is IR generation + the constraint-solving print engine. So "skip the parser" alone would save ~7% — the real win is replacing the whole pipeline with a direct tokenize-and-emit formatter.

Results


Speed (2.4 MB sample)	~230 ms → ~32 ms (~7x)
Spec suite	passes 48/48 — `test_specs` now runs the streaming formatter against the committed `[expect]` files, including the `format_twice` idempotency check
Syntax errors	reported with the same `Line/column` diagnostics as before; all existing error tests pass unchanged
Default dependencies	no longer pulls `jsonc-parser`, `text_lines`, or `dprint-core-macros`

How it works

Tokenizer over &[u8]: punctuation, strings (incl. single-quoted for JSONC), words (numbers / true/false/null / bare keys), line + block comments, tracking newlines-before each token.
Validation: a recursive-descent pass over the token stream checks the grammar and reports StreamError { start, end, message }, which format_text renders via dprint-core's format_diagnostic. Word tokens are checked by their first character — enough to reject genuine garbage (&, a zero-width space) at the right column while staying lenient about the rest.
Layout: comments are placed positionally from each token's newline count (trailing vs own-line, dangling between name/colon/value/comma, blank-line preservation, slash-spacing); dprint-ignore emits the node verbatim; widths use Unicode display width (UAX#11) with an ASCII fast-path. Arrays have three layouts — flat, full multi-line, and the "inline multi-line object" form ([…, { … }]).
The AST/IR generator and trace_file are kept behind the tracing feature only (a pre-existing stale generate() call there is fixed so the feature compiles again).

Invalid UTF-8 (#5)

format_streaming operates on &[u8], so bytes that aren't valid UTF-8 inside a string pass through untouched — something a UTF-8 parser can't do. Covered by a unit test.

Heads-up on behavior

Two notes for review:

Leniency: the streaming validator is stricter than nothing but more lenient than jsonc-parser about odd word tokens (e.g. it won't reject every malformed number). It still errors on structural problems. This nudges toward format invalid code ? #19 but is not full "format-anything" leniency.
Layout on a rare case: dprint-core's print engine makes the inline-vs-full-break decision for arrays-of-objects via a width measurement that runs into the broken object's first line. I reproduce this for the common cases, but at narrow lineWidth a small set of inputs can lay out differently than the old formatter. It's layout-only (never changes values) and doesn't affect any spec. Happy to add the remaining width gate if you'd prefer exact parity there.

Notes

New: src/streaming.rs, examples/bench.rs (perf harness). Added dep unicode-width.
No fallback to the old formatter — streaming is the formatter.

Rewrites the formatter to work directly off a token stream with no AST and no dprint-core IR/print-engine. A single recursive pass tokenizes, validates the grammar, and emits the formatted bytes. Why - Parsing was only ~7% of the old pipeline; the rest was IR generation plus the constraint-solving print engine. Replacing the whole pipeline (not just the parser) is where the win is. - Working off bytes lets strings carry invalid UTF-8 unchanged (closes dprint#5). Behavior - format_text now calls format_streaming directly; no parser in the format path. The default build no longer depends on jsonc-parser, text_lines, or dprint-core-macros (now optional, tracing-only). - Streaming validates the grammar itself and reports syntax errors with the same Line/column diagnostics as before (all existing error tests pass unchanged). - Passes the full spec suite (test_specs runs streaming against the committed expectations, including the format-twice idempotency check). Perf: ~7x faster (2.4 MB sample: ~230 ms -> ~32 ms). The AST/IR generator and trace_file are kept behind the `tracing` feature (and a pre-existing stale generate() call there is fixed so it compiles).

- Split the 1200-line module into streaming/{mod,printer}.rs: mod.rs holds the lexer, the grammar validator, and the entry point; printer.rs holds the emit engine. Shared token types are pub(crate). - Extract object_renders_multiline() (was duplicated between is_breaker and the array arm of structurally_multiline) and emit_own_line_comment() (was repeated at four comment-placement sites). - Drop dead code: identity wrapper src_start, unused params, and a hand-rolled trim_ascii_end now that std's [u8]::trim_ascii_end is stable. - Render the flat form once in the object path instead of twice. - Fix a stale module doc comment (widths are unicode display width, not chars). No behavior change: full spec suite + error-diagnostic tests still pass.

todor-a added 2 commits June 30, 2026 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: format off a stream of bytes instead of parsing (#30)#56

feat: format off a stream of bytes instead of parsing (#30)#56
todor-a wants to merge 2 commits into
dprint:mainfrom
todor-a:streaming-formatter-issue-30

todor-a commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

todor-a commented Jun 30, 2026

What

Why (measured first)

Results

How it works

Invalid UTF-8 (#5)

Heads-up on behavior

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant