Upload any PDF or TXT document. Ask questions in English or Greek. Get answers grounded exclusively in the document.
Showcase document: EU NIS2 Directive — but the bot is fully domain-agnostic.
graph LR
A[📄 Upload] --> B[Loader<br/>PyPDF / TXT]
B --> C[Chunker<br/>RecursiveChar]
C --> D[Embeddings<br/>multilingual-e5-base]
D --> E[(ChromaDB<br/>Persistent)]
F[❓ Query] --> G[E5 Query Embed]
G --> E
E --> H[Retriever<br/>top-k=4]
H --> I[Claude<br/>claude-sonnet-4]
I --> J[💬 Streaming Response]
- Domain-agnostic — works with any PDF or TXT document
- Bilingual — query in English or Greek over English documents via
intfloat/multilingual-e5-base - Grounded answers — Claude is constrained to answer only from retrieved context; hallucinations are actively suppressed
- Source citations — every response references the originating page and chunk
- Persistent vector store — ChromaDB persists embeddings to disk; re-uploading the same document replaces the collection cleanly
- Conversation memory — last 5 turns of chat history are included with each query (ConversationBufferWindowMemory k=5)
- Streaming UI — token-by-token response streaming via Gradio queue
git clone https://github.com/GiorgosPanagopoulos/docqna.git
cd docqna
pip install -r requirements.txtNote on PyTorch:
requirements.txtpinstorch>=2.0.0. For a CPU-only install you may prefer the smaller wheel:pip install torch --index-url https://download.pytorch.org/whl/cpu
cp .env.example .env
# Open .env and set ANTHROPIC_API_KEY=sk-ant-...python app.pyOpen http://localhost:7860 in your browser.
- Upload a PDF or TXT file using the left panel and click Process Document
- Wait for embeddings to generate (first run downloads ~560 MB for
multilingual-e5-base) - Ask any question in the chat box — in English or Greek
- Use the example question buttons for NIS2-specific prompts
- Click Clear Collection to wipe the vector store and start fresh
| Component | Technology |
|---|---|
| LLM | Anthropic Claude (claude-sonnet-4) |
| Embeddings | intfloat/multilingual-e5-base via HuggingFace |
| Vector Store | ChromaDB (persistent local storage) |
| Orchestration | LangChain LCEL |
| UI | Gradio 4.x |
| PDF Parsing | PyPDF |
| Language | Python 3.11+ |
docqna/
├── app.py # Gradio UI entry point
├── core/
│ ├── config.py # Constants and env vars
│ ├── document_loader.py # PDF/TXT → LangChain Documents
│ ├── chunker.py # RecursiveCharacterTextSplitter
│ ├── vectorstore.py # ChromaDB + E5Embeddings wrapper
│ └── chain.py # LCEL retrieval chain with Claude
├── data/ # Drop sample documents here (gitignored)
├── chroma_db/ # ChromaDB persistent storage (gitignored)
├── .env.example
├── requirements.txt
└── README.md
Screenshots will be added after deployment.