📄 DocQnA — Document Q&A with RAG

Upload any PDF or TXT document. Ask questions in English or Greek. Get answers grounded exclusively in the document.

Showcase document: EU NIS2 Directive — but the bot is fully domain-agnostic.

Architecture

graph LR
    A[📄 Upload] --> B[Loader<br/>PyPDF / TXT]
    B --> C[Chunker<br/>RecursiveChar]
    C --> D[Embeddings<br/>multilingual-e5-base]
    D --> E[(ChromaDB<br/>Persistent)]
    F[❓ Query] --> G[E5 Query Embed]
    G --> E
    E --> H[Retriever<br/>top-k=4]
    H --> I[Claude<br/>claude-sonnet-4]
    I --> J[💬 Streaming Response]

Features

Domain-agnostic — works with any PDF or TXT document
Bilingual — query in English or Greek over English documents via intfloat/multilingual-e5-base
Grounded answers — Claude is constrained to answer only from retrieved context; hallucinations are actively suppressed
Source citations — every response references the originating page and chunk
Persistent vector store — ChromaDB persists embeddings to disk; re-uploading the same document replaces the collection cleanly
Conversation memory — last 5 turns of chat history are included with each query (ConversationBufferWindowMemory k=5)
Streaming UI — token-by-token response streaming via Gradio queue

Quick Start

1. Clone & install

git clone https://github.com/GiorgosPanagopoulos/docqna.git
cd docqna
pip install -r requirements.txt

Note on PyTorch: requirements.txt pins torch>=2.0.0. For a CPU-only install you may prefer the smaller wheel:
pip install torch --index-url https://download.pytorch.org/whl/cpu

2. Set up your API key

cp .env.example .env
# Open .env and set ANTHROPIC_API_KEY=sk-ant-...

3. Run

python app.py

Open http://localhost:7860 in your browser.

Usage

Upload a PDF or TXT file using the left panel and click Process Document
Wait for embeddings to generate (first run downloads ~560 MB for multilingual-e5-base)
Ask any question in the chat box — in English or Greek
Use the example question buttons for NIS2-specific prompts
Click Clear Collection to wipe the vector store and start fresh

Tech Stack

Component	Technology
LLM	Anthropic Claude (`claude-sonnet-4`)
Embeddings	`intfloat/multilingual-e5-base` via HuggingFace
Vector Store	ChromaDB (persistent local storage)
Orchestration	LangChain LCEL
UI	Gradio 4.x
PDF Parsing	PyPDF
Language	Python 3.11+

Project Structure

docqna/
├── app.py                  # Gradio UI entry point
├── core/
│   ├── config.py           # Constants and env vars
│   ├── document_loader.py  # PDF/TXT → LangChain Documents
│   ├── chunker.py          # RecursiveCharacterTextSplitter
│   ├── vectorstore.py      # ChromaDB + E5Embeddings wrapper
│   └── chain.py            # LCEL retrieval chain with Claude
├── data/                   # Drop sample documents here (gitignored)
├── chroma_db/              # ChromaDB persistent storage (gitignored)
├── .env.example
├── requirements.txt
└── README.md

Screenshots

Screenshots will be added after deployment.

"I build things I'd trust with something that matters."

Built by Georgios Panagopoulos

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 DocQnA — Document Q&A with RAG

Architecture

Features

Quick Start

1. Clone & install

2. Set up your API key

3. Run

Usage

Tech Stack

Project Structure

Screenshots

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
core		core
data		data
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📄 DocQnA — Document Q&A with RAG

Architecture

Features

Quick Start

1. Clone & install

2. Set up your API key

3. Run

Usage

Tech Stack

Project Structure

Screenshots

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages