Skip to content

Perl: switch to the more accurate ts-parser-perl grammar (~92% fewer parse failures)#126

Open
rabbiveesh wants to merge 1 commit into
cortexkit:mainfrom
rabbiveesh:perl-use-ts-parser-perl
Open

Perl: switch to the more accurate ts-parser-perl grammar (~92% fewer parse failures)#126
rabbiveesh wants to merge 1 commit into
cortexkit:mainfrom
rabbiveesh:perl-use-ts-parser-perl

Conversation

@rabbiveesh

@rabbiveesh rabbiveesh commented Jun 18, 2026

Copy link
Copy Markdown

What & why

aft's Perl support depends on the crates.io tree-sitter-perl crate, which is an independent, older Perl grammar by a different author. This PR switches to ts-parser-perl — the crate for tree-sitter-perl/tree-sitter-perl — which parses real-world Perl far more accurately.

Benchmark on 8,342 real-world Perl files (perl5 core + CPAN: DBIx-Class, Mojolicious, SpamAssassin, Bugzilla, …), files parsing with no ERROR/MISSING nodes:

grammar clean parse failures
ts-parser-perl (this PR) 95.4% 382
tree-sitter-perl (current dep) 40.2% 4,983

(Curated 3,386-module gold corpus: 99.6% vs 73.9%.) For aft this is coverage, not just quality: an ERROR node makes that whole subtree untraversable, so on the ~60% of files the current grammar can't parse cleanly, the symbol/import/call extractors silently walk past everything underneath.

Not a drop-in — the two grammars use different node names

The dependency itself is a Cargo package-rename, so the tree_sitter_perl::LANGUAGE path is unchanged:

tree-sitter-perl = { package = "ts-parser-perl", version = "1.1" }

But aft matches Perl node kinds directly, and those differ, so the extractors needed updating:

  • parser.rs PERL_QUERYpackage_statement / subroutine_declaration_statement / method_declaration_statement. use constant constants are captured and gated on the pragma text in Rust (the tree-sitter binding doesn't evaluate #eq? predicates, and use parent -norequire, … shares the same shape). The variable capture now keeps the sigil so the symbol is $counter, not counter.
  • perl_package_name reads the package name node (was package_name) so subs inside a package Foo; are still classified as methods.
  • imports/perl.rs — this grammar represents every use/no pragma (including use parent / use constant) as a single use_statement with a use/no keyword token and a module field, and runtime require as an expression_statement wrapping a require_expression. The parser was restructured accordingly. One subtlety handled: a use_statement spans its own ; but an expression_statement does not (the ; is a sibling), so the statement range is extended through a trailing ; so organize/remove leave no stray semicolon.
  • calls.rsfunction_call_expression / method_call_expression; the method-call callee resolves via the method field.

Testing

Built and ran the full agent-file-tools test suite locally:

  • All Perl tests pass: imports::perl::* unit tests (including the rewritten grammar-node-kind stability fixture and parse_perl_supported_forms / round-trip), outline_perl_symbols_include_packages_subroutines_constants_and_variables, and the import_golden_corpus Perl scenarios (organize/add/remove) — the golden snapshots match unchanged, confirming behavior parity.
  • No regressions elsewhere. (One pre-existing failure, watcher_filter_tests::gitignore_write_rebuilds_before_filtering_same_batch_paths, fails identically on a clean checkout of main in my environment — it's an inotify/gitignore timing test unrelated to this change.)

(Disclosure: I maintain tree-sitter-perl/tree-sitter-perl. The benchmark is reproducible — compile each grammar to a separate .so and parse the same corpus.)


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.


Summary by cubic

Switch Perl parsing to ts-parser-perl for much higher accuracy (95.4% vs 40.2% clean parses across 8,342 files), recovering symbols/imports/calls previously lost to parse errors. This improves coverage on real-world Perl and reduces untraversable subtrees.

  • Refactors
    • Dependencies: replace tree-sitter-perl with ts-parser-perl via Cargo package-rename; tree_sitter_perl path remains unchanged.
    • Symbols: update queries to package_statement (package), subroutine_declaration_statement, and method_declaration_statement; capture variables with the sigil; detect constants from use constant NAME => ... gated by pragma text.
    • Imports: unify all use/no as a single use_statement with a module field; parse require Foo; as an expression_statement wrapping require_expression; extend the statement range through a trailing ; to avoid stray semicolons.
    • Calls: switch to function_call_expression and method_call_expression; resolve method callees via the method field.
    • Tests: all Perl tests pass; golden snapshots unchanged.

Written for commit 684dff3. Summary will update on new commits.

Review in cubic

The crates.io `tree-sitter-perl` is an independent, older Perl grammar by a
different author; `ts-parser-perl` is the crate for
github.com/tree-sitter-perl/tree-sitter-perl, which parses real-world Perl
substantially more accurately (95.4% vs 40.2% clean parse across 8,342
real-world files; ~92% fewer ERROR/MISSING failures). For a symbol/import/call
indexer, an ERROR node makes that subtree untraversable, so the more accurate
grammar recovers symbols that are silently missed today.

The two grammars use different node names, so this is not a drop-in dependency
bump. Changes:

- Cargo: package-rename so the `tree_sitter_perl` path is unchanged.
- parser.rs PERL_QUERY: package_statement/subroutine_declaration_statement/
  method_declaration_statement; `use constant` constants captured and gated on
  pragma text in Rust (the binding does not evaluate #eq? predicates); variable
  capture keeps the sigil ($counter).
- perl_package_name: read the `package` name node.
- imports/perl.rs: every use/no pragma is one `use_statement` (use/no keyword +
  `module` field); require is an `expression_statement` wrapping
  `require_expression`. Extend the statement range through the trailing `;`
  sibling so organize/remove leave no stray semicolon.
- calls.rs: function_call_expression/method_call_expression; method-call callee
  resolves via the `method` field.

All Perl tests pass (imports, outline, round-trip, organize golden corpus).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@socket-security

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedcargo/​ts-parser-perl@​1.1.390100100100100

View full report

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 5 files

Re-trigger cubic

@rabbiveesh

Copy link
Copy Markdown
Author

let me know if you'd like the comments cleaned up - you know how claude can be sometimes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant