Skip to content

feat(resources): add zip file support, enhanced PDF parsing, intermediary conversions, and structured data extraction#179

Merged
ClickerMonkey merged 10 commits into
mainfrom
copilot/add-aeye-resources-package
Jun 29, 2026
Merged

feat(resources): add zip file support, enhanced PDF parsing, intermediary conversions, and structured data extraction#179
ClickerMonkey merged 10 commits into
mainfrom
copilot/add-aeye-resources-package

Conversation

Copilot AI commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Adds hierarchical resource support (parent/children) to ParsedResource, a new zip file parser, enhanced PDF parsing that mirrors the cletus package's render-then-transcribe approach, intermediary document-to-PDF conversion, and automatic structured data extraction for tabular file types.

Changes

Parent/Children on ParsedResource

  • children?: ParsedResource[] — zip entries, rendered PDF pages
  • parentLocation?: ResourceLocation — back-reference to parent

Enhanced PDF parser

  • New PdfParseOptions: renderPages, renderDpi, transcribePages, extractImages
  • renderPdfPages callback batch-renders all pages to a temp directory (no per-page round-trips, no full PDF in JS heap)
  • Falls back to filesystem-based pdf-parse text extraction if rendering errors or isn't configured
  • No data loaded into JS heap — reads PDF from filesystem directly

Intermediary document-to-PDF conversion (convertToPdf)

  • New convertToPdf option on ParseOptions: receives a source file path, returns the path to a converted PDF
  • DOCX and Excel parsers check for convertToPdf and delegate to the PDF parser when available, enabling richer page rendering and transcription for any document type
  • Falls back to native extraction (mammoth/xlsx) if conversion fails or isn't configured

Structured data extraction for tabular types

  • New ExtractedTable interface: headers, rows, name, sheetName, sheetIndex
  • Excel parser auto-detects table regions within sheets (separated by empty rows), extracts headers and data rows
  • Tables rendered as markdown tables in text parts; raw structured data available in resource.metadata.structuredTables
  • Excel parser now also handles csv and tsv types natively

Zip parser (parsers/zip.ts)

  • Reads from filesystem, processes entries sequentially to bound memory
  • Zip bomb protection (1000 file cap, 100MB total, 50MB per entry)
  • Each entry → child ParsedResource; parent gets a text listing part

Type mappings

  • .zip, .tar.gz, .tgz, .jar extensions → "zip" type
  • application/zip, application/x-zip-compressed, application/x-tar, application/gzip MIME types

Code quality

  • All inline await import(...) calls removed; all imports are top-level
  • No full file data loaded into JS heap for PDF — uses filesystem path directly

Usage

const result = await registry.parse("/path/to/doc.pdf", {
  pdf: { renderPages: true, transcribePages: true, renderDpi: 150 },
  renderPdfPages: async (pdfPath, outDir, dpi, signal) => {
    await poppler.pdfToPpm(pdfPath, `${outDir}/page`, { pngFile: true, resolutionXYAxis: dpi });
    return pages;
  },
  transcribeImage: async (imageData, part, source) => {
    return await vision.transcribe(imageData);
  },
});

// Convert DOCX to PDF for richer parsing
const result = await registry.parse("/path/to/doc.docx", {
  pdf: { renderPages: true, transcribePages: true },
  convertToPdf: async (sourcePath, signal) => {
    return await libreoffice.convert(sourcePath, "pdf");
  },
  renderPdfPages: async (pdfPath, outDir, dpi, signal) => { /* ... */ },
  transcribeImage: async (imageData, part, source) => { /* ... */ },
});

// Structured data from Excel
const result = await registry.parse("/path/to/data.xlsx", {});
// result.resource.metadata.structuredTables → ExtractedTable[]
// Each table has .headers and .rows for programmatic access

Copilot AI changed the title [WIP] Add aeye/resources package for input stream processing Add extensible @aeye/resources package for parsing, slicing, and resolving crawlable resources Jun 28, 2026
Copilot AI requested a review from ClickerMonkey June 28, 2026 16:19
Comment thread packages/resources/src/default.ts Outdated
import { readFile } from "node:fs/promises";
import path from "node:path";
import { Buffer } from "node:buffer";
import { ResourceRegistry, createParsedResource, createUrlResourceSource } from "./registry.js";

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change all imports to exclude .js like the other packages in aeye

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed all .js extensions from imports to match the other packages (using moduleResolution: "bundler").

Comment thread packages/resources/src/default.ts Outdated
return "text";
}

const textParser: ResourceParser = {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's separate html and markdown and code parsers instead of sticking everything in text

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — split into separate parsers: textParser, markdownParser, htmlParser, codeParser, each in their own file under src/parsers/.

Comment thread packages/resources/src/default.ts Outdated
const slices: ResourceSlice[] = [];
const maxChars = context.options.maxChars ?? 2000;
const minChars = context.options.minChars ?? 400;
const declarationPattern = /^(export\s+)?(async\s+)?(function|class|interface|type|enum|const|let|var)\s+([^=(<{]+)/;

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of stuff should be configurable instead of fixed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — the declaration pattern is now configurable via options.code.declarationPattern in the CodeParserOptions interface. Defaults to the same regex as before.

Comment thread packages/resources/src/default.ts Outdated

lines.forEach((line) => {
const trimmed = line.trim();
if (/^(import|export\s+.*from\s+|const\s+.*=\s*require\()/.test(trimmed)) {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be configurable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — the import pattern is now configurable via options.code.importPattern in the CodeParserOptions interface.

Comment thread packages/resources/src/registry.ts Outdated
return { source, resource };
}

async slice(resource: ParsedResource, options: SliceOptions = {}): Promise<import("./types.js").ResourceSlice[]> {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No inline importing anywhere in the repo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed the import("./types.js") inline type import and used the proper top-level ResourceSlice import instead.

Comment thread packages/resources/src/default.ts Outdated
throw new Error(`Failed to resolve ${link}: ${response.status} ${response.statusText}`);
}

const input = new Uint8Array(await response.arrayBuffer());

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support streaming as the default to be passed around, assume any resource could cause an OOM error and we should avoid at all costs just loading all the data in mem at once.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added streaming support throughout:

  • ResourceInput type now supports ReadableStream, factory functions () => AsyncIterable<Uint8Array>, and all existing iterable types
  • File resolver uses createReadStream with a lazy factory to avoid eagerly reading entire files into memory
  • URL resolver passes response.body as a ReadableStream when available
  • collectInput handles all streaming input types, collecting incrementally

Comment thread packages/resources/src/default.ts Outdated
}

const input = new Uint8Array(await response.arrayBuffer());
const mimeType = response.headers.get("content-type") ?? undefined;

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned if html mimeType it should instead use puppeteer if they have it enabled and configured. Maybe an option renderUrl or something that's given a url and needs to return html once the page is settled (see ginny). Then provide a puppeteer implementation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added renderUrl option to ParseOptions. The HTML parser checks for renderUrl and if the source location is an HTTP URL, it calls the provided function to render the page first (e.g. via puppeteer). The type signature is (url: string, signal?: AbortSignal) => Promise<string> so consumers can plug in any rendering implementation.

Comment thread packages/resources/src/default.ts Outdated
}
};

const codeSlicer: ResourceSlicer = {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each slicer, parser, resolver should get its own file.

Also add Excel and doc and PDF as seen in ginny and cletus package code if you didn't add it yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — each parser, slicer, and resolver now has its own file:

  • src/parsers/ — text, markdown, html, code, image, pdf, excel, docx
  • src/slicers/ — text, markdown, code
  • src/resolvers/ — file, url

Also added PDF (using pdf-parse), Excel (using xlsx), and DOCX (using mammoth) parsers as optional peer dependencies with runtime availability checks.

@ClickerMonkey

Copy link
Copy Markdown
Owner

@copilot continue!

Copilot AI added 3 commits June 28, 2026 18:52
…les, remove .js imports, add streaming, configurable patterns, renderUrl, PDF/Excel/DOCX parsers
Copilot AI changed the title Add extensible @aeye/resources package for parsing, slicing, and resolving crawlable resources Refactor @aeye/resources: modular parsers, streaming, configurable patterns, PDF/Excel/DOCX support Jun 28, 2026
Copilot AI added 2 commits June 28, 2026 19:25
…d PDF options

- Add children/parentLocation fields to ParsedResource for hierarchical resources
- Add PdfParseOptions with renderPages, transcribePages, extractImages, renderDpi
- Add renderPdfPages callback (batch renders all pages to disk, avoids memory overload)
- PDF parser: tries render+transcribe first, falls back to text extraction on error
- Add zip parser using jszip (reads from filesystem, processes entries one at a time)
- Add zip bomb protection (max files, max size, max per-file size)
- Register zip type in extension/mime maps and default registry
- Use createPartId for consistent ID formatting in zip parser
- Add comprehensive image extension-to-MIME mapping
- Rename variable to canTranscribePages for clarity
- Pass page-specific source to transcribeImage callback
Copilot AI changed the title Refactor @aeye/resources: modular parsers, streaming, configurable patterns, PDF/Excel/DOCX support feat(resources): add zip file support and enhanced PDF parsing with render/transcribe options Jun 28, 2026
Copilot stopped work on behalf of ClickerMonkey due to an error June 28, 2026 20:17

@ClickerMonkey ClickerMonkey left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw inline importing again, don't do any of that. Also rendering one page at a time for PDF could be slow, and I also saw loading all PDF data - that could crash the server. Please stop doing that. Use the file system or something to avoid that. Maybe you addressed some of that, make sure you did.

Also support a way to have intermediary conversions. So if I wanted I could use a tool to convert doc files into PDF - since PDF parsing is richer and can render pages. There are tools that can take multiple formats and pdfify them

Also for Excel or any tabular data, support trying to parse out structured data. For csv that's obvious, for Excel there could be multiple sections in one tab and potentially multiple little tables with layers of headers etc. let's make auto structured data extraction native for supported file types.

Copilot AI changed the title feat(resources): add zip file support and enhanced PDF parsing with render/transcribe options feat(resources): add zip file support, enhanced PDF parsing, intermediary conversions, and structured data extraction Jun 28, 2026
Copilot AI requested a review from ClickerMonkey June 28, 2026 20:27

@ClickerMonkey ClickerMonkey left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot

The renderPdfPages and convertToPdf are not very forward thinking, maybe support parser stacking so you can add a pdf parser like a rendering one and that's attempted first before falling back to the other. Let's add a way for per type config options to be specified on the registry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants