Skip to content

feat: add Pinecone skill for scientific RAG persistence layer#173

Open
immuhammadfurqan wants to merge 3 commits into
K-Dense-AI:mainfrom
immuhammadfurqan:feature/pinecone-skill
Open

feat: add Pinecone skill for scientific RAG persistence layer#173
immuhammadfurqan wants to merge 3 commits into
K-Dense-AI:mainfrom
immuhammadfurqan:feature/pinecone-skill

Conversation

@immuhammadfurqan

Copy link
Copy Markdown

Summary

Adds a Pinecone skill positioned as the retrieval persistence layer that complements existing database-lookup, paper-lookup, and biopython skills. While those skills handle direct API access to public databases, Pinecone enables persistent semantic retrieval over embedded scientific data — necessary when repeated API queries are too slow, when working with proprietary research data, or when building multimodal RAG pipelines.

What's Included

  • SKILL.md (403 lines) covering index management, batch upserting, namespaces, metadata filtering, hybrid search, and multimodal retrieval
  • references/index_types.md — serverless vs pod tradeoffs, dimension selection table for 12 scientific embedding models
  • references/scientific_embedding_models.md — domain to model mapping with usage patterns
  • references/hybrid_search.md — dense + sparse BM25 setup with alpha-blending tuning guide
  • scripts/index_pubmed.py — working PubMed abstract ingestion pipeline
  • scripts/multimodal_radiology.py — working multimodal (text + image) ingestion pipeline

Scientific Workflows Covered

  • PubMed literature RAG with voyage-large-2
  • Multimodal radiology retrieval with voyage-multimodal-3
  • Clinical note similarity with Bio_ClinicalBERT
  • Molecular similarity search (Morgan FP via Pinecone for fast recall, RDKit Tanimoto for precise rerank)

Integration Points

Explicitly cross-references and integrates with: paper-lookup, database-lookup, biopython, pysam, scanpy, scientific-writing, pyhealth.

Scanner Results

  • 0 CRITICAL, 0 HIGH findings
  • 3 MEDIUM findings — all network-related (expected for any network-dependent skill; Pinecone is a cloud API by definition)

Validation

  • Frontmatter validates as YAML matching the anndata/SKILL.md schema
  • Both scripts pass python -m py_compile
  • Cisco AI skill scanner: OK SAFE
  • Tested with pinecone>=6.0.0
  • Production-validated patterns from a multimodal medical chatbot (voyage-multimodal-3 + Pinecone)

Checklist

  • Adheres to Agent Skills Specification
  • SKILL.md with overview, when-to-use, capabilities, workflows, troubleshooting
  • All cross-referenced files exist
  • No hardcoded API keys (uses os.environ)
  • References to official documentation included

Adds a Pinecone skill positioned as the retrieval persistence layer
complementing existing database-lookup, paper-lookup, and biopython skills.

Includes:
- Serverless and pod-based index management
- Namespace strategies for organism/study isolation
- MongoDB-style metadata filtering
- Hybrid dense + sparse BM25 search
- Multimodal retrieval with voyage-multimodal-3
- Working scripts for PubMed and radiology indexing
- Three reference docs (index types, embedding models, hybrid search)

Tested with pinecone>=6.0.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant