A powerful tool to analyze PDF documents and answer questions using Retrieval-Augmented Generation (RAG) with your local Ollama installation. This tool extracts content only from the provided PDF and doesn't use any external knowledge sources.
- PDF Text Extraction: Uses pdfplumber for robust text extraction, including Arabic text
- Semantic Search: Creates embeddings and uses FAISS for fast similarity search
- Local AI: Uses your local Ollama installation for answering questions
- Multi-language Support: Works with Arabic, English, and other languages
- Two Interfaces: Both web UI (Streamlit) and command-line interface
- Source Citation: Shows which parts of the PDF were used to generate answers
- Install Python dependencies:
pip install -r requirements.txt-
Install and setup Ollama:
- Download Ollama from https://ollama.ai
- Install and start the Ollama service
- Pull a model (e.g.,
ollama pull llama3.2)
-
Verify Ollama is running:
ollama list- Start the Streamlit app:
streamlit run pdf_analyzer.py-
Open your browser to
http://localhost:8501 -
Upload your PDF and click "Process PDF"
-
Ask questions about the PDF content
- Run the CLI:
python cli.py "path/to/your/pdf/file.pdf"- Ask questions interactively
python cli.py "document.pdf" --model llama3.2 --chunk-size 500 --overlap 100 --top-k 5Edit config.py to customize:
- Ollama model name
- Chunk size and overlap
- Embedding model
- Number of relevant chunks to retrieve
- Text Extraction: Extracts text from PDF using pdfplumber
- Text Chunking: Splits text into overlapping chunks for better context
- Embedding Creation: Creates vector embeddings using SentenceTransformers
- Vector Storage: Stores embeddings in FAISS index for fast similarity search
- Question Processing:
- Converts question to embedding
- Finds most similar text chunks
- Sends relevant context to Ollama
- Returns AI-generated answer based only on PDF content
Any Ollama model can be used. Popular choices:
llama3.2(recommended for general use)mistralcodellama(for code-related documents)qwen2.5(good for multilingual content)
Make sure to pull the model first: ollama pull model-name
- Ensure Ollama is running:
ollama serve - Check if model is available:
ollama list - Verify the model name in configuration
- Ensure the PDF contains extractable text (not just images)
- Try with a different PDF to isolate the issue
- Check file permissions
- Reduce chunk size in configuration
- Use a smaller embedding model
- Process smaller PDFs
For an Arabic legal document:
- "ما هو موضوع هذا النظام؟" (What is the subject of this system?)
- "ما هي المواد المتعلقة بالحكم؟" (What are the articles related to governance?)
For English documents:
- "What is the main topic of this document?"
- "Summarize the key points"
- "What are the requirements mentioned?"
pdf_analyzer/
├── pdf_analyzer.py # Main Streamlit application
├── cli.py # Command-line interface
├── config.py # Configuration settings
├── requirements.txt # Python dependencies
├── README.md # This file
└── cache/ # Cached indexes (created automatically)
This project is open source and available under the MIT License.