Skip to content

fix: index post title so title-only questions match in search#177

Open
HardeepAsrani wants to merge 1 commit into
developmentfrom
fix/index-title-in-embeddings
Open

fix: index post title so title-only questions match in search#177
HardeepAsrani wants to merge 1 commit into
developmentfrom
fix/index-title-in-embeddings

Conversation

@HardeepAsrani

@HardeepAsrani HardeepAsrani commented Jun 30, 2026

Copy link
Copy Markdown
Member

Summary

Knowledge base entries are embedded from post_content alone — the title was never included in the vector. For Q&A-style custom data, where the question lives in the title and the body holds only the answer, this meant the vector had nothing resembling the visitor's question, so semantic search silently failed to match it.

This was a long-standing wiring slip: Tokenizer::tokenize() already prepends the title when computing each chunk's token count (encode( $title . ' ' . $chunk )), but that title-prefixed string only ever fed the token counter — the value stored in post_content (and later embedded) was always the bare chunk. So the title counted against the token budget but never made it into the embedding.

Fix

Prepend the title to the text sent for embedding in process_post(), so the vector matches the budget the tokenizer already reserves:

if ( ! empty( $post->post_title ) ) {
    $stripped = $post->post_title . ' ' . $stripped;
}

Embed-only by design: stored post_content is left untouched, so the existing answer-time context (post_title - post_content) stays correct and the title isn't duplicated there.

QA

Reproduce the original bug on development, then confirm it's fixed on this branch.

Setup

  1. Have a working OpenAI API key configured in Hyve settings.
  2. Go to Hyve → Custom Data and add an entry where the question is the title and the answer is the content, with the question not repeated in the body:
    • Title: How much does it cost?
    • Content: $100.
  3. Save and let it finish indexing (status becomes included/processed).

Verify the fix
4. Open the chat widget on the frontend and ask the question verbatim: "How much does it cost?"
5. Expected (this branch): the bot answers with $100. — the title-only question is matched.
6. Before (on development): the same entry is not matched and the bot falls back to "no answer" / the default message.

Regression check (make sure normal entries still work)
7. Add a regular post/page to the knowledge base where the answer lives in the body, and confirm it's still matched and answered as before.

Re-indexing note
8. Entries indexed before this change keep their old (title-less) vectors — re-save / re-process such an entry and confirm it then matches. Newly added or edited entries are fixed automatically.

Closes Codeinwp/hyve#228

The embedding vector was built from post_content alone, so a knowledge
base entry whose question lived in the title — the Q&A custom data
pattern — was never matched semantically. Prepend the title to the text
sent for embedding, matching the token budget Tokenizer::tokenize()
already reserves for it.

Existing entries benefit after a re-index (the embedding is regenerated
on add/update).

Closes Codeinwp/hyve#228
@github-actions

Copy link
Copy Markdown

Plugin build for b4e83fb is ready 🛎️!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants