feat: format off a stream of bytes instead of parsing (#30)#56
Open
todor-a wants to merge 2 commits into
Open
Conversation
Rewrites the formatter to work directly off a token stream with no AST and no dprint-core IR/print-engine. A single recursive pass tokenizes, validates the grammar, and emits the formatted bytes. Why - Parsing was only ~7% of the old pipeline; the rest was IR generation plus the constraint-solving print engine. Replacing the whole pipeline (not just the parser) is where the win is. - Working off bytes lets strings carry invalid UTF-8 unchanged (closes dprint#5). Behavior - format_text now calls format_streaming directly; no parser in the format path. The default build no longer depends on jsonc-parser, text_lines, or dprint-core-macros (now optional, tracing-only). - Streaming validates the grammar itself and reports syntax errors with the same Line/column diagnostics as before (all existing error tests pass unchanged). - Passes the full spec suite (test_specs runs streaming against the committed expectations, including the format-twice idempotency check). Perf: ~7x faster (2.4 MB sample: ~230 ms -> ~32 ms). The AST/IR generator and trace_file are kept behind the `tracing` feature (and a pre-existing stale generate() call there is fixed so it compiles).
- Split the 1200-line module into streaming/{mod,printer}.rs: mod.rs holds the
lexer, the grammar validator, and the entry point; printer.rs holds the emit
engine. Shared token types are pub(crate).
- Extract object_renders_multiline() (was duplicated between is_breaker and the
array arm of structurally_multiline) and emit_own_line_comment() (was repeated
at four comment-placement sites).
- Drop dead code: identity wrapper src_start, unused params, and a hand-rolled
trim_ascii_end now that std's [u8]::trim_ascii_end is stable.
- Render the flat form once in the object path instead of twice.
- Fix a stale module doc comment (widths are unicode display width, not chars).
No behavior change: full spec suite + error-diagnostic tests still pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #30. Also closes #5 and is a step toward #19.
What
Rewrites the formatter to work directly off a stream of bytes — a single-pass tokenize + recursive emit, with no AST and no dprint-core IR/print-engine.
format_textnow calls a newformat_streaming(&[u8], config, is_jsonc)directly.Why (measured first)
I benchmarked the old pipeline before touching anything: parsing is only ~7% of
format_text's time. The other ~93% is IR generation + the constraint-solving print engine. So "skip the parser" alone would save ~7% — the real win is replacing the whole pipeline with a direct tokenize-and-emit formatter.Results
test_specsnow runs the streaming formatter against the committed[expect]files, including theformat_twiceidempotency checkLine/columndiagnostics as before; all existing error tests pass unchangedjsonc-parser,text_lines, ordprint-core-macrosHow it works
&[u8]: punctuation, strings (incl. single-quoted for JSONC), words (numbers /true/false/null/ bare keys), line + block comments, tracking newlines-before each token.StreamError { start, end, message }, whichformat_textrenders via dprint-core'sformat_diagnostic. Word tokens are checked by their first character — enough to reject genuine garbage (&, a zero-width space) at the right column while staying lenient about the rest.dprint-ignoreemits the node verbatim; widths use Unicode display width (UAX#11) with an ASCII fast-path. Arrays have three layouts — flat, full multi-line, and the "inline multi-line object" form ([…, {…}]).trace_fileare kept behind thetracingfeature only (a pre-existing stalegenerate()call there is fixed so the feature compiles again).Invalid UTF-8 (#5)
format_streamingoperates on&[u8], so bytes that aren't valid UTF-8 inside a string pass through untouched — something a UTF-8 parser can't do. Covered by a unit test.Heads-up on behavior
Two notes for review:
jsonc-parserabout odd word tokens (e.g. it won't reject every malformed number). It still errors on structural problems. This nudges toward format invalid code ? #19 but is not full "format-anything" leniency.lineWidtha small set of inputs can lay out differently than the old formatter. It's layout-only (never changes values) and doesn't affect any spec. Happy to add the remaining width gate if you'd prefer exact parity there.Notes
src/streaming.rs,examples/bench.rs(perf harness). Added depunicode-width.