Skip to content

feat(semantic-search): add cloudflare ai search backend#68

Merged
aryasaatvik merged 4 commits into
devfrom
feat/semantic-search-ai-search-backend
Jun 26, 2026
Merged

feat(semantic-search): add cloudflare ai search backend#68
aryasaatvik merged 4 commits into
devfrom
feat/semantic-search-ai-search-backend

Conversation

@aryasaatvik

@aryasaatvik aryasaatvik commented Jun 26, 2026

Copy link
Copy Markdown
Owner

Summary

Add the Cloudflare AI Search backend for the semantic search backend contract.

Changes

  • Add the AI Search backend and plugin-storage tracking for uploaded items.
  • Use AI Search as the Cloudflare host semantic backend when the binding is present.
  • Truncate uploaded documents by bytes before sending them to AI Search.

Tests

  • bun run --cwd packages/plugins/semantic-search typecheck
  • bun run --cwd packages/plugins/semantic-search test src/sdk/plugin.test.ts src/sdk/ai-search.test.ts
  • bun run --cwd apps/host-cloudflare typecheck
  • oxfmt --check apps/host-cloudflare/src/app.ts apps/host-cloudflare/src/config.ts apps/host-cloudflare/src/execution.ts apps/host-cloudflare/src/plugins.ts apps/host-cloudflare/wrangler.jsonc packages/plugins/semantic-search/src/api/group.ts packages/plugins/semantic-search/src/sdk/ai-search.ts packages/plugins/semantic-search/src/sdk/ai-search.test.ts packages/plugins/semantic-search/src/sdk/collections.ts packages/plugins/semantic-search/src/sdk/documents.ts packages/plugins/semantic-search/src/sdk/index.ts packages/plugins/semantic-search/src/sdk/plugin.ts packages/plugins/semantic-search/src/sdk/tool-search-backend.ts
  • oxlint -c .oxlintrc.jsonc apps/host-cloudflare/src/app.ts apps/host-cloudflare/src/config.ts apps/host-cloudflare/src/execution.ts apps/host-cloudflare/src/plugins.ts packages/plugins/semantic-search/src/api/group.ts packages/plugins/semantic-search/src/sdk/ai-search.ts packages/plugins/semantic-search/src/sdk/ai-search.test.ts packages/plugins/semantic-search/src/sdk/collections.ts packages/plugins/semantic-search/src/sdk/documents.ts packages/plugins/semantic-search/src/sdk/index.ts packages/plugins/semantic-search/src/sdk/plugin.ts packages/plugins/semantic-search/src/sdk/tool-search-backend.ts --deny-warnings

Stack

  1. refactor(semantic-search): introduce backend contract #67
  2. feat(semantic-search): add cloudflare ai search backend #68 👈 current

@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown

Greptile Summary

This PR replaces the Vectorize + Gemini embedding backend for semantic tool search with Cloudflare AI Search, which handles chunking, embedding, and retrieval natively. The switch eliminates the Gemini API key dependency and simplifies the plugin wiring to a single AI_SEARCH binding.

  • New ai-search.ts backend: implements reindexAiSearch (upload tool documents, fingerprint-diff skip, stale cleanup), statusAiSearch (live status from AI Search API + local storage), and makeAiSearchToolDiscoveryProvider (hybrid-retrieval search with chunk deduplication and best-score-per-path selection).
  • documents.ts additions: collectToolSearchDocument builds a markdown document per tool (with schema terms, UTF-8-safe byte truncation at 3.5 MB), and toolItemKey produces the fingerprint used to skip unchanged tools on reindex.
  • Host wiring: plugins.ts, config.ts, execution.ts, and wrangler.jsonc updated to forward the AI_SEARCH binding; the backend is opt-in and inert when the binding is absent.

Confidence Score: 4/5

Safe to merge after fixing the self-deletion in reindexAiSearch for re-indexed tools.

Cloudflare AI Search's upload() is an upsert-by-name: uploading an item with the same filename updates the existing record and returns the same id. When a tool's fingerprint changes and its document is re-uploaded, uploaded.id equals previous.itemId, so the immediately-following deleteItemBestEffort(aiSearch, previous.itemId) deletes the document that was just updated. The tool disappears from the search index until the next full reindex. There is no test covering this re-index a changed tool path, so the regression is not caught automatically.

packages/plugins/semantic-search/src/sdk/ai-search.ts (reindexAiSearch stale-delete block) and packages/plugins/semantic-search/src/sdk/ai-search.test.ts (missing coverage for the fingerprint-changed update path).

Important Files Changed

Filename Overview
packages/plugins/semantic-search/src/sdk/ai-search.ts New AI Search backend implementing reindex, search, and status. Contains a bug in reindexAiSearch where re-indexing a changed tool immediately deletes the freshly uploaded item, because AI Search upload is upsert-by-name (same name returns same ID) and the subsequent deleteItemBestEffort targets that same ID.
packages/plugins/semantic-search/src/sdk/ai-search.test.ts New test file covering schema-fetch fallback, stale-row cleanup, chunk deduplication, and namespace filtering. Missing a test for the critical "re-index a tool whose fingerprint changed" path, which is where the upsert self-deletion bug manifests.
packages/plugins/semantic-search/src/sdk/documents.ts Adds truncateToAiSearchLimit (encode-once, subarray at UTF-8 boundary), toolItemKey fingerprint helper, ToolSearchDocument interface, and collectToolSearchDocument. Logic is clean and the byte-truncation is correct.
packages/plugins/semantic-search/src/sdk/collections.ts Adds AiSearchItemRow schema and aiSearchItems plugin-storage collection with appropriate indexes. Schema is well-defined and consistent with how rows are written in ai-search.ts.
apps/host-cloudflare/src/plugins.ts Replaces Vectorize + Gemini backend wiring with makeAiSearchToolSearchBackend. The opt-in guard (backend only activated when AI_SEARCH binding is present) is preserved correctly.
apps/host-cloudflare/src/config.ts Replaces VECTORIZE and GEMINI_API_KEY bindings with AI_SEARCH. CloudflareEnv and CloudflareConfig updated consistently; loadConfig correctly passes the binding through unchanged.
apps/host-cloudflare/src/execution.ts makeCloudflarePluginsProvider now forwards aiSearch and organizationId to makeCloudflarePlugins, making in-execution plugin construction consistent with the app-level path.
packages/plugins/semantic-search/src/sdk/tool-search-backend.ts Extends SemanticSearchStatus with optional AI-Search-specific status counters (queued, running, completed, error, skipped, outdated, lastActivity). Additions are additive and backward-compatible.
packages/plugins/semantic-search/src/api/group.ts StatusResponse schema extended with the same optional fields added to SemanticSearchStatus. Schema and TypeScript interface are in sync.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Host as host-cloudflare
    participant Plugin as semantic-search plugin
    participant Storage as Plugin Storage (aiSearchItems)
    participant AISearch as Cloudflare AI Search

    Note over Host,AISearch: Reindex flow
    Host->>Plugin: reindex(executor)
    Plugin->>Plugin: listToolManifests()
    Plugin->>Storage: items.list() — fetch existing rows
    loop For each manifest
        Plugin->>Plugin: toolItemKey(manifest) — compute fingerprint
        alt Fingerprint unchanged
            Plugin-->>Plugin: skip
        else Fingerprint changed or new
            Plugin->>Plugin: collectToolSearchDocument() — build markdown doc
            Plugin->>AISearch: items.upload(name, content, metadata)
            Note right of AISearch: Upsert-by-name: same name = same ID
            AISearch-->>Plugin: "{ id, key }"
            Plugin->>Storage: "items.put(key, { itemId, fingerprint, status:queued })"
            opt "previous exists (BUG: previous.itemId === uploaded.id)"
                Plugin->>AISearch: items.delete(previous.itemId) deletes just-uploaded doc
            end
        end
    end
    loop Stale entries
        Plugin->>Storage: items.remove(key)
        Plugin->>AISearch: items.delete(itemId) best-effort
    end
    Plugin-->>Host: "{ indexed, skipped, removed }"

    Note over Host,AISearch: Search flow
    Host->>Plugin: search(query, namespace, limit)
    Plugin->>AISearch: "search({ messages, ai_search_options })"
    AISearch-->>Plugin: "{ chunks[] }"
    Plugin->>Storage: items.list() — build key to row map
    Plugin->>Plugin: deduplicate chunks by path (best score)
    Plugin->>Plugin: filter by namespace, paginate
    Plugin-->>Host: "{ items[], total }"
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Host as host-cloudflare
    participant Plugin as semantic-search plugin
    participant Storage as Plugin Storage (aiSearchItems)
    participant AISearch as Cloudflare AI Search

    Note over Host,AISearch: Reindex flow
    Host->>Plugin: reindex(executor)
    Plugin->>Plugin: listToolManifests()
    Plugin->>Storage: items.list() — fetch existing rows
    loop For each manifest
        Plugin->>Plugin: toolItemKey(manifest) — compute fingerprint
        alt Fingerprint unchanged
            Plugin-->>Plugin: skip
        else Fingerprint changed or new
            Plugin->>Plugin: collectToolSearchDocument() — build markdown doc
            Plugin->>AISearch: items.upload(name, content, metadata)
            Note right of AISearch: Upsert-by-name: same name = same ID
            AISearch-->>Plugin: "{ id, key }"
            Plugin->>Storage: "items.put(key, { itemId, fingerprint, status:queued })"
            opt "previous exists (BUG: previous.itemId === uploaded.id)"
                Plugin->>AISearch: items.delete(previous.itemId) deletes just-uploaded doc
            end
        end
    end
    loop Stale entries
        Plugin->>Storage: items.remove(key)
        Plugin->>AISearch: items.delete(itemId) best-effort
    end
    Plugin-->>Host: "{ indexed, skipped, removed }"

    Note over Host,AISearch: Search flow
    Host->>Plugin: search(query, namespace, limit)
    Plugin->>AISearch: "search({ messages, ai_search_options })"
    AISearch-->>Plugin: "{ chunks[] }"
    Plugin->>Storage: items.list() — build key to row map
    Plugin->>Plugin: deduplicate chunks by path (best score)
    Plugin->>Plugin: filter by namespace, paginate
    Plugin-->>Host: "{ items[], total }"
Loading

Reviews (7): Last reviewed commit: "fix(semantic-search): tolerate ai search..." | Re-trigger Greptile

Comment thread packages/plugins/semantic-search/src/sdk/documents.ts
Comment thread packages/plugins/semantic-search/src/sdk/ai-search.ts
@aryasaatvik aryasaatvik force-pushed the feat/semantic-search-backend-contract branch from 6063ba3 to 02ad2ce Compare June 26, 2026 15:40
@aryasaatvik aryasaatvik force-pushed the feat/semantic-search-ai-search-backend branch from 874dfe2 to 5b79372 Compare June 26, 2026 15:43
@aryasaatvik aryasaatvik changed the base branch from feat/semantic-search-backend-contract to dev June 26, 2026 16:03
aryasaatvik added a commit that referenced this pull request Jun 26, 2026
## Summary

Introduce the semantic search backend contract and move the current
vector implementation behind it.

## Changes

- Add `ToolSearchBackend` as the plugin-level backend boundary.
- Keep the existing vector/Gemini implementation available through
`ToolSearchBackend.vector`.
- Wire the Cloudflare host through the backend option.

## Tests

- `bun run --cwd packages/plugins/semantic-search typecheck`
- `bun run --cwd packages/plugins/semantic-search test
src/sdk/plugin.test.ts`
- `bun run --cwd apps/host-cloudflare typecheck`
- `oxfmt --check apps/host-cloudflare/src/plugins.ts
packages/plugins/semantic-search/src/sdk/index.ts
packages/plugins/semantic-search/src/sdk/plugin.ts
packages/plugins/semantic-search/src/sdk/plugin.test.ts
packages/plugins/semantic-search/src/sdk/tool-search-backend.ts`
- `oxlint -c .oxlintrc.jsonc apps/host-cloudflare/src/plugins.ts
packages/plugins/semantic-search/src/sdk/index.ts
packages/plugins/semantic-search/src/sdk/plugin.ts
packages/plugins/semantic-search/src/sdk/plugin.test.ts
packages/plugins/semantic-search/src/sdk/tool-search-backend.ts
--deny-warnings`

<!-- stack:links:start -->
### [Stack](https://github.com/aryasaatvik/stack)

1. **#67** 👈 current
2. #68
<!-- stack:links:end -->
@aryasaatvik aryasaatvik force-pushed the feat/semantic-search-ai-search-backend branch from 2050a5a to 0fe5412 Compare June 26, 2026 16:03
@aryasaatvik aryasaatvik merged commit 3b44948 into dev Jun 26, 2026
8 checks passed
@aryasaatvik aryasaatvik deleted the feat/semantic-search-ai-search-backend branch June 26, 2026 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant