feat(resources): add zip file support, enhanced PDF parsing, intermediary conversions, and structured data extraction#179
Conversation
@aeye/resources package for parsing, slicing, and resolving crawlable resources
| import { readFile } from "node:fs/promises"; | ||
| import path from "node:path"; | ||
| import { Buffer } from "node:buffer"; | ||
| import { ResourceRegistry, createParsedResource, createUrlResourceSource } from "./registry.js"; |
There was a problem hiding this comment.
Change all imports to exclude .js like the other packages in aeye
There was a problem hiding this comment.
Done — removed all .js extensions from imports to match the other packages (using moduleResolution: "bundler").
| return "text"; | ||
| } | ||
|
|
||
| const textParser: ResourceParser = { |
There was a problem hiding this comment.
Let's separate html and markdown and code parsers instead of sticking everything in text
There was a problem hiding this comment.
Done — split into separate parsers: textParser, markdownParser, htmlParser, codeParser, each in their own file under src/parsers/.
| const slices: ResourceSlice[] = []; | ||
| const maxChars = context.options.maxChars ?? 2000; | ||
| const minChars = context.options.minChars ?? 400; | ||
| const declarationPattern = /^(export\s+)?(async\s+)?(function|class|interface|type|enum|const|let|var)\s+([^=(<{]+)/; |
There was a problem hiding this comment.
This sort of stuff should be configurable instead of fixed
There was a problem hiding this comment.
Done — the declaration pattern is now configurable via options.code.declarationPattern in the CodeParserOptions interface. Defaults to the same regex as before.
|
|
||
| lines.forEach((line) => { | ||
| const trimmed = line.trim(); | ||
| if (/^(import|export\s+.*from\s+|const\s+.*=\s*require\()/.test(trimmed)) { |
There was a problem hiding this comment.
Done — the import pattern is now configurable via options.code.importPattern in the CodeParserOptions interface.
| return { source, resource }; | ||
| } | ||
|
|
||
| async slice(resource: ParsedResource, options: SliceOptions = {}): Promise<import("./types.js").ResourceSlice[]> { |
There was a problem hiding this comment.
No inline importing anywhere in the repo
There was a problem hiding this comment.
Done — removed the import("./types.js") inline type import and used the proper top-level ResourceSlice import instead.
| throw new Error(`Failed to resolve ${link}: ${response.status} ${response.statusText}`); | ||
| } | ||
|
|
||
| const input = new Uint8Array(await response.arrayBuffer()); |
There was a problem hiding this comment.
We need to support streaming as the default to be passed around, assume any resource could cause an OOM error and we should avoid at all costs just loading all the data in mem at once.
There was a problem hiding this comment.
Done — added streaming support throughout:
ResourceInputtype now supportsReadableStream, factory functions() => AsyncIterable<Uint8Array>, and all existing iterable types- File resolver uses
createReadStreamwith a lazy factory to avoid eagerly reading entire files into memory - URL resolver passes
response.bodyas aReadableStreamwhen available collectInputhandles all streaming input types, collecting incrementally
| } | ||
|
|
||
| const input = new Uint8Array(await response.arrayBuffer()); | ||
| const mimeType = response.headers.get("content-type") ?? undefined; |
There was a problem hiding this comment.
I mentioned if html mimeType it should instead use puppeteer if they have it enabled and configured. Maybe an option renderUrl or something that's given a url and needs to return html once the page is settled (see ginny). Then provide a puppeteer implementation
There was a problem hiding this comment.
Done — added renderUrl option to ParseOptions. The HTML parser checks for renderUrl and if the source location is an HTTP URL, it calls the provided function to render the page first (e.g. via puppeteer). The type signature is (url: string, signal?: AbortSignal) => Promise<string> so consumers can plug in any rendering implementation.
| } | ||
| }; | ||
|
|
||
| const codeSlicer: ResourceSlicer = { |
There was a problem hiding this comment.
Each slicer, parser, resolver should get its own file.
Also add Excel and doc and PDF as seen in ginny and cletus package code if you didn't add it yet.
There was a problem hiding this comment.
Done — each parser, slicer, and resolver now has its own file:
src/parsers/— text, markdown, html, code, image, pdf, excel, docxsrc/slicers/— text, markdown, codesrc/resolvers/— file, url
Also added PDF (using pdf-parse), Excel (using xlsx), and DOCX (using mammoth) parsers as optional peer dependencies with runtime availability checks.
|
@copilot continue! |
…les, remove .js imports, add streaming, configurable patterns, renderUrl, PDF/Excel/DOCX parsers
…e type casts, add code opts to SliceOptions
@aeye/resources package for parsing, slicing, and resolving crawlable resources…d PDF options - Add children/parentLocation fields to ParsedResource for hierarchical resources - Add PdfParseOptions with renderPages, transcribePages, extractImages, renderDpi - Add renderPdfPages callback (batch renders all pages to disk, avoids memory overload) - PDF parser: tries render+transcribe first, falls back to text extraction on error - Add zip parser using jszip (reads from filesystem, processes entries one at a time) - Add zip bomb protection (max files, max size, max per-file size) - Register zip type in extension/mime maps and default registry
- Use createPartId for consistent ID formatting in zip parser - Add comprehensive image extension-to-MIME mapping - Rename variable to canTranscribePages for clarity - Pass page-specific source to transcribeImage callback
There was a problem hiding this comment.
I saw inline importing again, don't do any of that. Also rendering one page at a time for PDF could be slow, and I also saw loading all PDF data - that could crash the server. Please stop doing that. Use the file system or something to avoid that. Maybe you addressed some of that, make sure you did.
Also support a way to have intermediary conversions. So if I wanted I could use a tool to convert doc files into PDF - since PDF parsing is richer and can render pages. There are tools that can take multiple formats and pdfify them
Also for Excel or any tabular data, support trying to parse out structured data. For csv that's obvious, for Excel there could be multiple sections in one tab and potentially multiple little tables with layers of headers etc. let's make auto structured data extraction native for supported file types.
ClickerMonkey
left a comment
There was a problem hiding this comment.
The renderPdfPages and convertToPdf are not very forward thinking, maybe support parser stacking so you can add a pdf parser like a rendering one and that's attempted first before falling back to the other. Let's add a way for per type config options to be specified on the registry
Adds hierarchical resource support (parent/children) to
ParsedResource, a new zip file parser, enhanced PDF parsing that mirrors the cletus package's render-then-transcribe approach, intermediary document-to-PDF conversion, and automatic structured data extraction for tabular file types.Changes
Parent/Children on
ParsedResourcechildren?: ParsedResource[]— zip entries, rendered PDF pagesparentLocation?: ResourceLocation— back-reference to parentEnhanced PDF parser
PdfParseOptions:renderPages,renderDpi,transcribePages,extractImagesrenderPdfPagescallback batch-renders all pages to a temp directory (no per-page round-trips, no full PDF in JS heap)pdf-parsetext extraction if rendering errors or isn't configuredIntermediary document-to-PDF conversion (
convertToPdf)convertToPdfoption onParseOptions: receives a source file path, returns the path to a converted PDFconvertToPdfand delegate to the PDF parser when available, enabling richer page rendering and transcription for any document typeStructured data extraction for tabular types
ExtractedTableinterface:headers,rows,name,sheetName,sheetIndexresource.metadata.structuredTablescsvandtsvtypes nativelyZip parser (
parsers/zip.ts)ParsedResource; parent gets a text listing partType mappings
.zip,.tar.gz,.tgz,.jarextensions →"zip"typeapplication/zip,application/x-zip-compressed,application/x-tar,application/gzipMIME typesCode quality
await import(...)calls removed; all imports are top-levelUsage