Skip to content

GiorgosPanagopoulos/docqna

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📄 DocQnA — Document Q&A with RAG

Python LangChain ChromaDB Anthropic Gradio

Upload any PDF or TXT document. Ask questions in English or Greek. Get answers grounded exclusively in the document.

Showcase document: EU NIS2 Directive — but the bot is fully domain-agnostic.


Architecture

graph LR
    A[📄 Upload] --> B[Loader<br/>PyPDF / TXT]
    B --> C[Chunker<br/>RecursiveChar]
    C --> D[Embeddings<br/>multilingual-e5-base]
    D --> E[(ChromaDB<br/>Persistent)]
    F[❓ Query] --> G[E5 Query Embed]
    G --> E
    E --> H[Retriever<br/>top-k=4]
    H --> I[Claude<br/>claude-sonnet-4]
    I --> J[💬 Streaming Response]
Loading

Features

  • Domain-agnostic — works with any PDF or TXT document
  • Bilingual — query in English or Greek over English documents via intfloat/multilingual-e5-base
  • Grounded answers — Claude is constrained to answer only from retrieved context; hallucinations are actively suppressed
  • Source citations — every response references the originating page and chunk
  • Persistent vector store — ChromaDB persists embeddings to disk; re-uploading the same document replaces the collection cleanly
  • Conversation memory — last 5 turns of chat history are included with each query (ConversationBufferWindowMemory k=5)
  • Streaming UI — token-by-token response streaming via Gradio queue

Quick Start

1. Clone & install

git clone https://github.com/GiorgosPanagopoulos/docqna.git
cd docqna
pip install -r requirements.txt

Note on PyTorch: requirements.txt pins torch>=2.0.0. For a CPU-only install you may prefer the smaller wheel:

pip install torch --index-url https://download.pytorch.org/whl/cpu

2. Set up your API key

cp .env.example .env
# Open .env and set ANTHROPIC_API_KEY=sk-ant-...

3. Run

python app.py

Open http://localhost:7860 in your browser.


Usage

  1. Upload a PDF or TXT file using the left panel and click Process Document
  2. Wait for embeddings to generate (first run downloads ~560 MB for multilingual-e5-base)
  3. Ask any question in the chat box — in English or Greek
  4. Use the example question buttons for NIS2-specific prompts
  5. Click Clear Collection to wipe the vector store and start fresh

Tech Stack

Component Technology
LLM Anthropic Claude (claude-sonnet-4)
Embeddings intfloat/multilingual-e5-base via HuggingFace
Vector Store ChromaDB (persistent local storage)
Orchestration LangChain LCEL
UI Gradio 4.x
PDF Parsing PyPDF
Language Python 3.11+

Project Structure

docqna/
├── app.py                  # Gradio UI entry point
├── core/
│   ├── config.py           # Constants and env vars
│   ├── document_loader.py  # PDF/TXT → LangChain Documents
│   ├── chunker.py          # RecursiveCharacterTextSplitter
│   ├── vectorstore.py      # ChromaDB + E5Embeddings wrapper
│   └── chain.py            # LCEL retrieval chain with Claude
├── data/                   # Drop sample documents here (gitignored)
├── chroma_db/              # ChromaDB persistent storage (gitignored)
├── .env.example
├── requirements.txt
└── README.md

Screenshots

Screenshots will be added after deployment.


"I build things I'd trust with something that matters."

Built by Georgios Panagopoulos

GitHub LinkedIn

About

RAG-powered Document Q&A - upload any PDF or TXT, ask questions in English or Greek. Built with LangChain · ChromaDB · Claude · Gradio.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages