fix: index post title so title-only questions match in search#177
Open
HardeepAsrani wants to merge 1 commit into
Open
fix: index post title so title-only questions match in search#177HardeepAsrani wants to merge 1 commit into
HardeepAsrani wants to merge 1 commit into
Conversation
The embedding vector was built from post_content alone, so a knowledge base entry whose question lived in the title — the Q&A custom data pattern — was never matched semantically. Prepend the title to the text sent for embedding, matching the token budget Tokenizer::tokenize() already reserves for it. Existing entries benefit after a re-index (the embedding is regenerated on add/update). Closes Codeinwp/hyve#228
Soare-Robert-Daniel
approved these changes
Jun 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Knowledge base entries are embedded from
post_contentalone — the title was never included in the vector. For Q&A-style custom data, where the question lives in the title and the body holds only the answer, this meant the vector had nothing resembling the visitor's question, so semantic search silently failed to match it.This was a long-standing wiring slip:
Tokenizer::tokenize()already prepends the title when computing each chunk's token count (encode( $title . ' ' . $chunk )), but that title-prefixed string only ever fed the token counter — the value stored inpost_content(and later embedded) was always the bare chunk. So the title counted against the token budget but never made it into the embedding.Fix
Prepend the title to the text sent for embedding in
process_post(), so the vector matches the budget the tokenizer already reserves:Embed-only by design: stored
post_contentis left untouched, so the existing answer-time context (post_title - post_content) stays correct and the title isn't duplicated there.QA
Reproduce the original bug on
development, then confirm it's fixed on this branch.Setup
How much does it cost?$100.Verify the fix
4. Open the chat widget on the frontend and ask the question verbatim: "How much does it cost?"
5. Expected (this branch): the bot answers with
$100.— the title-only question is matched.6. Before (on
development): the same entry is not matched and the bot falls back to "no answer" / the default message.Regression check (make sure normal entries still work)
7. Add a regular post/page to the knowledge base where the answer lives in the body, and confirm it's still matched and answered as before.
Re-indexing note
8. Entries indexed before this change keep their old (title-less) vectors — re-save / re-process such an entry and confirm it then matches. Newly added or edited entries are fixed automatically.
Closes Codeinwp/hyve#228