This project is a Python-based Retrieval-Augmented Generation (RAG) microservice designed to ingest documents, answer queries based on their content, and provide deep observability into the entire process using Langfuse.
The solution is built on a modern, containerized, and fully open-source stack, enabling a 100% local and private RAG pipeline from document ingestion to answer generation.
| Component | Technology | Purpose |
|---|---|---|
| API Framework | FastAPI | For building a high-performance, asynchronous, and robust web service. |
| LLM Serving | Ollama with gemma:2b |
To serve an open-source Large Language Model locally. |
| Embeddings | all-MiniLM-L6-v2 (via Hugging Face) |
To generate powerful text embeddings locally, ensuring privacy and zero cost. |
| Vector Database | ChromaDB | As an in-memory vector store for fast prototyping and development. |
| AI Orchestration | LangChain | To structure the RAG pipeline (load, split, retrieve, generate). |
| Observability | Langfuse | For deep, nested tracing and debugging of the RAG pipeline's logic. |
| Containerization | Docker & Docker Compose | To package and deploy the entire multi-service application with a single command. |
Follow these steps to set up and run the entire environment on your local machine.
- Git installed.
- Docker and Docker Compose installed and running.
-
Clone the repository:
git clone https://github.com/jorgeston/RAG_LLM.git cd RAG_LLM -
Create the environment file: This project requires API keys to connect to the Langfuse dashboard. Create a file named
.envin the project root. You can copy the provided template:cp .env.template .env
Next, edit the
.envfile and add your keys obtained from Langfuse Cloud.(You will need to create a
.env.templatefile with the content below for this step to work)..env.template:
# Langfuse Credentials # Get these from your project settings in Langfuse Cloud LANGFUSE_PUBLIC_KEY="pk-lf-..." LANGFUSE_SECRET_KEY="sk-lf-..." LANGFUSE_HOST="https://cloud.langfuse.com"
-
Launch all services with Docker Compose: This single command will build the API's Docker image, pull the Ollama image, and start both containers in a connected network.
docker-compose up --build
-
Download the LLM (One-Time Step): The first time you run the service, Ollama needs to download the
gemma:2bmodel. Open a new terminal window and run the following command:docker-compose exec ollama ollama run gemma:2bWait for the download to complete. Once finished, the model is stored in a persistent Docker volume and will not need to be downloaded again.
The service is now running and accessible at http://localhost:8000.
You can interact with the API via the auto-generated interactive documentation at http://localhost:8000/docs.
Checks the service's health status.
- Success Response (200 OK):
{ "status": "ok" }
Ingests and processes a document. Uses multipart/form-data to handle file uploads.
- Parameters:
file: The document to be processed (PDF, TXT, HTML, etc.).document_type: A string identifying the file type (e.g.,pdf).
curlExample:curl -X POST -F "file=@/path/to/your/document.pdf" -F "document_type=pdf" http://localhost:8000/ingest
- Success Response (200 OK):
{ "status": "success", "message": "Successfully ingested and processed document.pdf", "chunks_created": 123 }
Sends a query to the RAG system based on the last ingested document.
- Request Body:
{ "question": "Your question here" } curlExample:curl -X POST -H "Content-Type: application/json" -d '{"question": "What is the maximum line length in Python?"}' http://localhost:8000/query
- Success Response (200 OK):
{ "answer": "The generated answer from the LLM.", "sources": [ { "page": 0, "text": "A relevant snippet from the source document..." } ] }
This project includes a suite of unit tests to ensure the API layer is functioning correctly. The tests use pytest and pytest-mock to test endpoint logic in isolation.
To run the tests, follow these steps on your local machine (outside of Docker):
-
Create and activate a local virtual environment:
python -m venv venv .\venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Run pytest: From the root directory of the project, simply run:
pytest
All tests should pass, confirming the API endpoints are correctly set up.
-
Local-First & Open-Source Stack: The decision was made to use a fully local stack (Ollama, Hugging Face Embeddings) instead of relying on paid, proprietary APIs. This approach prioritizes data privacy (documents never leave the local environment), zero cost, and full control over every component. The trade-off is that inference performance is dependent on local hardware.
-
Improved
/ingestEndpoint: The ingestion endpoint was intentionally designed to acceptmultipart/form-data(file uploads) rather than a JSON payload containing the document's content. While a deviation from the initial schema, this is a significant practical improvement, as it is more efficient and robust for handling documents of any size and format, avoiding issues with JSON payload size limits. -
Deep vs. Shallow Observability: Instead of merely tracing the top-level API endpoints, a granular instrumentation of the RAG pipeline was implemented using Langfuse's context managers. This creates nested spans for the
retrievalandsynthesis-generationsteps, providing deep visibility that is crucial for debugging RAG quality issues, such as hallucinations or irrelevant context. -
Robust Container Environment: The
Dockerfileincludes the installation of system-level dependencies (libgl1,poppler,tesseract-ocr). This makes the ingestion process more resilient by enabling theunstructuredlibrary to correctly process complex file formats like PDFs and scanned documents, preventing common runtime errors inside a minimal container environment.
- Persistent Vector Store: Configure ChromaDB to persist its data to a Docker volume, allowing the knowledge base to survive container restarts.
- Multi-Document Management: Evolve the API to handle a persistent library of multiple documents, likely by associating chunks with a unique
document_id. - Asynchronous Tracing: For very high-throughput scenarios, the Langfuse exporting process could be made fully asynchronous to avoid any potential blocking on the main application thread.