feat(resources): add zip file support, enhanced PDF parsing, intermediary conversions, and structured data extraction by Copilot · Pull Request #179 · ClickerMonkey/aeye

Copilot · 2026-06-28T15:52:33Z

Adds hierarchical resource support (parent/children) to ParsedResource, a new zip file parser, enhanced PDF parsing that mirrors the cletus package's render-then-transcribe approach, intermediary document-to-PDF conversion, and automatic structured data extraction for tabular file types.

Changes

Parent/Children on `ParsedResource`

children?: ParsedResource[] — zip entries, rendered PDF pages
parentLocation?: ResourceLocation — back-reference to parent

Enhanced PDF parser

New PdfParseOptions: renderPages, renderDpi, transcribePages, extractImages
renderPdfPages callback batch-renders all pages to a temp directory (no per-page round-trips, no full PDF in JS heap)
Falls back to filesystem-based pdf-parse text extraction if rendering errors or isn't configured
No data loaded into JS heap — reads PDF from filesystem directly

Intermediary document-to-PDF conversion (`convertToPdf`)

New convertToPdf option on ParseOptions: receives a source file path, returns the path to a converted PDF
DOCX and Excel parsers check for convertToPdf and delegate to the PDF parser when available, enabling richer page rendering and transcription for any document type
Falls back to native extraction (mammoth/xlsx) if conversion fails or isn't configured

Structured data extraction for tabular types

New ExtractedTable interface: headers, rows, name, sheetName, sheetIndex
Excel parser auto-detects table regions within sheets (separated by empty rows), extracts headers and data rows
Tables rendered as markdown tables in text parts; raw structured data available in resource.metadata.structuredTables
Excel parser now also handles csv and tsv types natively

Zip parser (`parsers/zip.ts`)

Reads from filesystem, processes entries sequentially to bound memory
Zip bomb protection (1000 file cap, 100MB total, 50MB per entry)
Each entry → child ParsedResource; parent gets a text listing part

Type mappings

.zip, .tar.gz, .tgz, .jar extensions → "zip" type
application/zip, application/x-zip-compressed, application/x-tar, application/gzip MIME types

Code quality

All inline await import(...) calls removed; all imports are top-level
No full file data loaded into JS heap for PDF — uses filesystem path directly

Usage

const result = await registry.parse("/path/to/doc.pdf", {
  pdf: { renderPages: true, transcribePages: true, renderDpi: 150 },
  renderPdfPages: async (pdfPath, outDir, dpi, signal) => {
    await poppler.pdfToPpm(pdfPath, `${outDir}/page`, { pngFile: true, resolutionXYAxis: dpi });
    return pages;
  },
  transcribeImage: async (imageData, part, source) => {
    return await vision.transcribe(imageData);
  },
});

// Convert DOCX to PDF for richer parsing
const result = await registry.parse("/path/to/doc.docx", {
  pdf: { renderPages: true, transcribePages: true },
  convertToPdf: async (sourcePath, signal) => {
    return await libreoffice.convert(sourcePath, "pdf");
  },
  renderPdfPages: async (pdfPath, outDir, dpi, signal) => { /* ... */ },
  transcribeImage: async (imageData, part, source) => { /* ... */ },
});

// Structured data from Excel
const result = await registry.parse("/path/to/data.xlsx", {});
// result.resource.metadata.structuredTables → ExtractedTable[]
// Each table has .headers and .rows for programmatic access

ClickerMonkey · 2026-06-28T18:23:27Z

+import { readFile } from "node:fs/promises";
+import path from "node:path";
+import { Buffer } from "node:buffer";
+import { ResourceRegistry, createParsedResource, createUrlResourceSource } from "./registry.js";


Change all imports to exclude .js like the other packages in aeye

Done — removed all .js extensions from imports to match the other packages (using moduleResolution: "bundler").

ClickerMonkey · 2026-06-28T18:24:37Z

+  return "text";
+}
+
+const textParser: ResourceParser = {


Let's separate html and markdown and code parsers instead of sticking everything in text

Done — split into separate parsers: textParser, markdownParser, htmlParser, codeParser, each in their own file under src/parsers/.

ClickerMonkey · 2026-06-28T18:25:55Z

+    const slices: ResourceSlice[] = [];
+    const maxChars = context.options.maxChars ?? 2000;
+    const minChars = context.options.minChars ?? 400;
+    const declarationPattern = /^(export\s+)?(async\s+)?(function|class|interface|type|enum|const|let|var)\s+([^=(<{]+)/;


This sort of stuff should be configurable instead of fixed

Done — the declaration pattern is now configurable via options.code.declarationPattern in the CodeParserOptions interface. Defaults to the same regex as before.

ClickerMonkey · 2026-06-28T18:26:04Z

+
+      lines.forEach((line) => {
+        const trimmed = line.trim();
+        if (/^(import|export\s+.*from\s+|const\s+.*=\s*require\()/.test(trimmed)) {


Needs to be configurable

Done — the import pattern is now configurable via options.code.importPattern in the CodeParserOptions interface.

ClickerMonkey · 2026-06-28T18:27:52Z

+    return { source, resource };
+  }
+
+  async slice(resource: ParsedResource, options: SliceOptions = {}): Promise<import("./types.js").ResourceSlice[]> {


No inline importing anywhere in the repo

Done — removed the import("./types.js") inline type import and used the proper top-level ResourceSlice import instead.

ClickerMonkey · 2026-06-28T18:29:22Z

+      throw new Error(`Failed to resolve ${link}: ${response.status} ${response.statusText}`);
+    }
+
+    const input = new Uint8Array(await response.arrayBuffer());


We need to support streaming as the default to be passed around, assume any resource could cause an OOM error and we should avoid at all costs just loading all the data in mem at once.

Done — added streaming support throughout:

ResourceInput type now supports ReadableStream, factory functions () => AsyncIterable<Uint8Array>, and all existing iterable types

File resolver uses createReadStream with a lazy factory to avoid eagerly reading entire files into memory

URL resolver passes response.body as a ReadableStream when available

collectInput handles all streaming input types, collecting incrementally

ClickerMonkey · 2026-06-28T18:30:53Z

+    }
+
+    const input = new Uint8Array(await response.arrayBuffer());
+    const mimeType = response.headers.get("content-type") ?? undefined;


I mentioned if html mimeType it should instead use puppeteer if they have it enabled and configured. Maybe an option renderUrl or something that's given a url and needs to return html once the page is settled (see ginny). Then provide a puppeteer implementation

Done — added renderUrl option to ParseOptions. The HTML parser checks for renderUrl and if the source location is an HTTP URL, it calls the provided function to render the page first (e.g. via puppeteer). The type signature is (url: string, signal?: AbortSignal) => Promise<string> so consumers can plug in any rendering implementation.

ClickerMonkey · 2026-06-28T18:33:48Z

+  }
+};
+
+const codeSlicer: ResourceSlicer = {


Each slicer, parser, resolver should get its own file.

Also add Excel and doc and PDF as seen in ginny and cletus package code if you didn't add it yet.

Done — each parser, slicer, and resolver now has its own file:

src/parsers/ — text, markdown, html, code, image, pdf, excel, docx

src/slicers/ — text, markdown, code

src/resolvers/ — file, url

Also added PDF (using pdf-parse), Excel (using xlsx), and DOCX (using mammoth) parsers as optional peer dependencies with runtime availability checks.

ClickerMonkey · 2026-06-28T18:34:28Z

@copilot continue!

…les, remove .js imports, add streaming, configurable patterns, renderUrl, PDF/Excel/DOCX parsers

…e type casts, add code opts to SliceOptions

… for clarity

…d PDF options - Add children/parentLocation fields to ParsedResource for hierarchical resources - Add PdfParseOptions with renderPages, transcribePages, extractImages, renderDpi - Add renderPdfPages callback (batch renders all pages to disk, avoids memory overload) - PDF parser: tries render+transcribe first, falls back to text extraction on error - Add zip parser using jszip (reads from filesystem, processes entries one at a time) - Add zip bomb protection (max files, max size, max per-file size) - Register zip type in extension/mime maps and default registry

- Use createPartId for consistent ID formatting in zip parser - Add comprehensive image extension-to-MIME mapping - Rename variable to canTranscribePages for clarity - Pass page-specific source to transcribeImage callback

ClickerMonkey

@copilot

ClickerMonkey · 2026-06-28T20:22:10Z

I saw inline importing again, don't do any of that. Also rendering one page at a time for PDF could be slow, and I also saw loading all PDF data - that could crash the server. Please stop doing that. Use the file system or something to avoid that. Maybe you addressed some of that, make sure you did.

Also support a way to have intermediary conversions. So if I wanted I could use a tool to convert doc files into PDF - since PDF parsing is richer and can render pages. There are tools that can take multiple formats and pdfify them

Also for Excel or any tabular data, support trying to parse out structured data. For csv that's obvious, for Excel there could be multiple sections in one tab and potentially multiple little tables with layers of headers etc. let's make auto structured data extraction native for supported file types.

ClickerMonkey

@copilot

The renderPdfPages and convertToPdf are not very forward thinking, maybe support parser stacking so you can add a pdf parser like a rendering one and that's attempted first before falling back to the other. Let's add a way for per type config options to be specified on the registry

Initial plan

b665c25

Copilot AI assigned Copilot and ClickerMonkey Jun 28, 2026

Copilot started work on behalf of ClickerMonkey June 28, 2026 15:53 View session

Copilot AI added 3 commits June 28, 2026 16:14

feat: add extensible resources package

c97722d

fix: harden resources parsing helpers

fb70524

chore: finalize resources validation fixes

f93c569

Copilot AI changed the title ~~[WIP] Add aeye/resources package for input stream processing~~ Add extensible @aeye/resources package for parsing, slicing, and resolving crawlable resources Jun 28, 2026

Copilot AI requested a review from ClickerMonkey June 28, 2026 16:19

Copilot finished work on behalf of ClickerMonkey June 28, 2026 16:19

ClickerMonkey requested changes Jun 28, 2026

View reviewed changes

Copilot started work on behalf of ClickerMonkey June 28, 2026 18:34 View session

Copilot stopped work on behalf of ClickerMonkey due to an error June 28, 2026 18:35
The session was cancelled by the user.

Copilot AI requested a review from ClickerMonkey June 28, 2026 18:35

Copilot started work on behalf of ClickerMonkey June 28, 2026 18:35 View session

Copilot stopped work on behalf of ClickerMonkey due to an error June 28, 2026 18:37
The session was cancelled by the user.

Copilot started work on behalf of ClickerMonkey June 28, 2026 18:38 View session

Copilot AI added 3 commits June 28, 2026 18:52

refactor: address PR review - split parsers/slicers/resolvers into fi…

a58f747

…les, remove .js imports, add streaming, configurable patterns, renderUrl, PDF/Excel/DOCX parsers

fix: address code review and CodeQL feedback - fix regex redos, remov…

ad7ee32

…e type casts, add code opts to SliceOptions

fix: resolve remaining CodeQL regex redos and rename module variables…

ecd2abd

… for clarity

Copilot AI changed the title ~~Add extensible @aeye/resources package for parsing, slicing, and resolving crawlable resources~~ Refactor @aeye/resources: modular parsers, streaming, configurable patterns, PDF/Excel/DOCX support Jun 28, 2026

Copilot finished work on behalf of ClickerMonkey June 28, 2026 18:58

Copilot started work on behalf of ClickerMonkey June 28, 2026 19:03 View session

Copilot AI added 2 commits June 28, 2026 19:25

Copilot AI changed the title ~~Refactor @aeye/resources: modular parsers, streaming, configurable patterns, PDF/Excel/DOCX support~~ feat(resources): add zip file support and enhanced PDF parsing with render/transcribe options Jun 28, 2026

Copilot finished work on behalf of ClickerMonkey June 28, 2026 19:28

Copilot started work on behalf of ClickerMonkey June 28, 2026 20:11 View session

Copilot stopped work on behalf of ClickerMonkey due to an error June 28, 2026 20:17
The session was cancelled by the user.

ClickerMonkey requested changes Jun 28, 2026

View reviewed changes

Copilot started work on behalf of ClickerMonkey June 28, 2026 20:22 View session

Apply remaining changes

d42deba

Copilot AI changed the title ~~feat(resources): add zip file support and enhanced PDF parsing with render/transcribe options~~ feat(resources): add zip file support, enhanced PDF parsing, intermediary conversions, and structured data extraction Jun 28, 2026

Copilot AI requested a review from ClickerMonkey June 28, 2026 20:27

Copilot finished work on behalf of ClickerMonkey June 28, 2026 20:27

ClickerMonkey requested changes Jun 28, 2026

View reviewed changes

ClickerMonkey merged commit d42deba into main Jun 29, 2026

Conversation

Copilot AI commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Parent/Children on ParsedResource

Enhanced PDF parser

Intermediary document-to-PDF conversion (convertToPdf)

Structured data extraction for tabular types

Zip parser (parsers/zip.ts)

Type mappings

Code quality

Usage

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

ClickerMonkey commented Jun 28, 2026

Uh oh!

ClickerMonkey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClickerMonkey left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jun 28, 2026 •

edited

Loading

Parent/Children on `ParsedResource`

Intermediary document-to-PDF conversion (`convertToPdf`)

Zip parser (`parsers/zip.ts`)