Building a Production-Ready Private RAG System

Every powerful software product is backed by extensive knowledge. For a product like SalesWorx, this knowledge is codified in detailed user guides, technical manuals, and implementation documents covering everything from commission structures to complex trade deal rules. This information is invaluable, but it’s often trapped in static PDF and Markdown files, making it difficult for customer support, implementation specialists, and even sales teams to find precise, contextual answers quickly.

Standard keyword search is no longer enough. It lacks the semantic understanding to answer complex questions like, “How do the multi-transaction bonus rules in SalesWorx impact the commission payout slabs for sales reps?”

This is where a Retrieval-Augmented Generation (RAG) system comes in. But while many tutorials show how to build a basic RAG prototype, they often fall short in a real-world production environment. They are brittle, struggle with complex queries, and lack the robustness required for business-critical applications.

This guide is different. We will walk you through the complete, end-to-end process of building a production-ready RAG system. We’ll cover the architecture, the detailed workflow, the full setup and implementation, and the final output, using the official SalesWorx product documentation as our real-world knowledge base.

 Features

  • Privacy-Focused: Local processing ensures data security without cloud dependency.
  • Library-First Ingestion Pipeline: LlamaIndex IngestionPipeline orchestrates Unstructured parsing, deterministic hashing, DuckDB caching, and AES-GCM page image handling with OpenTelemetry spans for each run.
  • Versatile Document Handling: Supports multiple file formats:
    • 📄 PDF
    • 📑 DOCX
    • 📝 TXT
    • 📊 XLSX
    • 🌐 MD (Markdown)
    • 🗃️ JSON
    • 🗂️ XML
    • 🔤 RTF
    • 📇 CSV
    • 📧 MSG (Email)
    • 🖥️ PPTX (PowerPoint)
    • 📘 ODT (OpenDocument Text)
    • 📚 EPUB (E-book)
    • 💻 Code files (PY, JS, JAVA, TS, TSX, C, CPP, H, and more)
  • Multi-Agent Coordination: LangGraph supervisor coordinating 5 specialized agents: query router, query planner, retrieval expert, result synthesizer, and response validator.
  • Retrieval/Router: RouterQueryEngine composed via router_factory with tools semantic_searchhybrid_search (Qdrant server‑side fusion), and optional knowledge_graph; uses async/batching where appropriate.
  • Hybrid Retrieval: Qdrant Query API server‑side fusion (RRF default, DBSF optional) over named vectors text-dense (BGE‑M3; COSINE) and text-sparse (FastEmbed BM42/BM25 with IDF). Dense via LlamaIndex; sparse via FastEmbed.
  • Knowledge Graph (optional): Adds a knowledge_graph router tool when a PropertyGraphIndex is present and healthy; uses spaCy entity extraction; selector prefers PydanticSingleSelector then LLMSingleSelector; falls back to vector/hybrid when absent.
  • Multimodal Processing: Unstructured hi‑res parsing for PDFs with text, tables, and images; visual features scored with SigLIP by default (CLIP optional).
  • Always-on Reranking: Text via BGE Cross-Encoder and visual via SigLIP; optional ColPali on capable GPUs. Deterministic, batch‑wise cancellation; fail‑open; SigLIP loader cached.
  • Offline-First Design: 100% local processing with no external API dependencies.
  • GPU Acceleration: CUDA support with mixed precision and FP8 quantization via vLLM FlashInfer backend for optimized performance.
  • Session Persistence: SQLite WAL with local multi-process support for concurrent access.
  • Docker Support: Easy deployment with Docker and Docker Compose.
  • Intelligent Caching: High-performance document processing cache for rapid re-analysis.
  • Robust Error Handling: Reliable retry strategies with exponential backoff.
  • Structured Logging: Contextual logging with automatic rotation and JSON output.
  • Encrypted Page Images (AES-GCM): Optional at-rest encryption for rendered PDF page images using AES-GCM with KID as AAD; .enc files are decrypted just-in-time for visual scoring and immediately cleaned up.
  • Simple Configuration: Environment variables and Streamlit native config for easy setup.

The Core Problem: Why Basic RAG Fails in Production

A simple RAG system typically follows a two-step process: retrieve relevant text chunks and feed them to a Large Language Model (LLM) to generate an answer. This works for simple questions but breaks down when faced with real-world complexity:

  • Poor Retrieval Accuracy: A simple vector search might miss documents that use different terminology for the same concept or fail to find documents that require both keyword and semantic matching.
  • Context-Blindness: It can’t handle multi-part questions that require synthesizing information from different documents.
  • Lack of Scalability: The architecture isn’t designed for a growing knowledge base or concurrent users.
  • No Robustness: It lacks proper error handling, monitoring, and the ability to trace how an answer was generated.

A production system must overcome all these challenges.

Architectural Deep Dive: The Anatomy of a Production RAG System

Our system is built for robustness and intelligence. It’s divided into two main workflows: an offline Ingestion Workflow to process and index knowledge, and a real-time Query Workflow to intelligently answer user questions.

This system combines hybrid search (dense + sparse embeddings), knowledge graph extraction, and a 5-agent coordination system to extract and analyze information from your PDFs, Office docs, and multimedia content. Built on LlamaIndex pipelines with LangGraph supervisor orchestration and Qwen3-4B-Instruct-2507‘s FULL 262K context capability through INT8 KV cache optimization, it provides document intelligence that runs entirely on your hardware—with GPU acceleration and agent coordination.

Phase 1: The Ingestion Workflow (Building the Knowledge Foundation)

This is the crucial first step where we convert our raw product manuals into a structured, queryable knowledge base.

  1. Data Sourcing & Parsing: We start with the official SalesWorx product documentation. Instead of just reading text, we use the unstructured library to parse these files. This intelligently extracts text, tables, and titles, preserving the document’s original structure.
  2. Strategic Chunking: We employ a title-based chunking strategy. This is superior to fixed-size chunks because it keeps related paragraphs grouped under their original headings, maintaining vital semantic context.
  3. Hybrid Embeddings (BGE-M3): Each chunk is transformed into numerical vectors. We use the BGE-M3 model, which is exceptional because it generates both dense vectors (capturing semantic meaning, e.g., “incentive” is similar to “bonus”) and sparse vectors (capturing keyword relevance) in a single pass. This is the foundation of our advanced hybrid search.
  4. Vector Storage (Qdrant): The chunks and their vectors are stored in Qdrant, a production-grade vector database. We chose Qdrant for its key features:
    • Named Vectors: It can store both our dense and sparse vectors for every single data point.
    • Server-Side Fusion: It can combine search results from both vector types on the server using Reciprocal Rank Fusion (RRF), which is highly efficient and improves retrieval accuracy.
    • Scalability & Performance: Built in Rust, it’s incredibly fast and can be deployed in a distributed cluster.

Phase 2: The Query Workflow (The Multi-Agent “Brain”)

When a user asks a question, a sophisticated, real-time process is initiated, managed by a team of five specialized AI agents orchestrated by LangGraph.

Let’s trace a complex query: “Summarize the ‘Simple Bonus by Item Quantity’ feature in SalesWorx and explain how it differs from the ‘Assortment Bonus by Overall Quantity’.”

  1. Query Router: The query first hits this agent. It analyzes the structure and identifies two distinct parts: one about “Simple Bonus” and another about “Assortment Bonus.” It flags the query as complex and requiring a multi-step plan.
  2. Query Planner: This agent creates a logical plan:
    • Task 1: Retrieve information specifically about the “Simple Bonus by Item Quantity” feature, including its types (“Point” and “Recurring”).
    • Task 2: Retrieve information about the “Assortment Bonus by Overall Quantity” feature.
    • Task 3: Synthesize the retrieved information, focusing on the key differences between the two features.
  3. Retrieval Expert: This agent executes the plan. For each task, it performs a hybrid search on Qdrant, looking for chunks that are both semantically similar and contain relevant keywords. The results are then passed to a reranker model to push the most accurate chunks to the very top.
  4. Result Synthesizer: This agent takes the curated context from both retrieval tasks and combines them into a single, coherent block of information.
  5. Response Validator: Before the final answer is generated, this agent performs a quality check. Does the synthesized context accurately address both parts of the original query? Only after this validation is the context passed to the LLM to generate the final, human-readable answer.

This structured, agent-based workflow allows our system to deconstruct and answer complex questions about the SalesWorx product with a level of precision that a simple RAG system cannot match.

Step-by-Step Implementation: Let’s Build It

Here’s how you can set up and run this entire system on your local machine.

Step 0: Prerequisites

  • Docker & Docker Compose: To run our vector database.
  • Python 3.11+ and the uv package manager.
  • Ollama: For running a local LLM. Install it and pull a model:ollama pull qwen3-4b-instruct-2507

Step 1: Project Setup

First, get the project code and set up the Python environment.

# Clone the repository (replace with your actual repo URL)
git clone https://github.com/your-repo/production-rag-salesworx.git
cd production-rag-salesworx

# Install all dependencies from pyproject.toml
uv sync

# Create your environment configuration file
cp .env.example .env
```Make sure your `.env` file points to your local Ollama instance and sets the correct model name.

#### Step 2: Launching Core Services with Docker

We use Docker Compose to manage our Qdrant instance. The `docker-compose.yml` file defines the service.

```yaml
# docker-compose.yml
version: '3.8'
services:
  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - ./qdrant_storage:/qdrant/storage

Launch the service:

docker-compose up -d

You can verify that Qdrant is running by visiting http://localhost:6333/dashboard in your browser.

Step 3: The Ingestion Process (Code and Execution)

Create a folder named data and place your SalesWorx product documentation files inside it. Now, create an ingest.py script to process these files and load them into Qdrant.

# ingest.py
import asyncio
from pathlib import Path
from src.config import settings
from src.utils.document import load_documents_unstructured
from src.utils.embedding import create_index_async

async def main():
    """
    Main function to run the ingestion pipeline.
    It processes all supported product documents in the 'data' folder and indexes them.
    """
    print("Starting the ingestion process...")
    data_folder = Path("./data")

    supported_extensions = {'.md', '.pdf'}
    documents_paths = [
        f for f in data_folder.rglob("*")
        if f.suffix.lower() in supported_extensions
    ]

    if not documents_paths:
        print("No supported documents found in the './data' folder.")
        return

    print(f"Found {len(documents_paths)} documents to process...")

    # 1. Load and parse documents using the Unstructured library
    documents = await load_documents_unstructured(documents_paths, settings)

    # 2. Create embeddings and store them in Qdrant
    # This function handles chunking, embedding, and indexing
    await create_index_async(documents, settings)

    print("✅ Ingestion complete. The SalesWorx knowledge base is ready.")

if __name__ == "__main__":
    asyncio.run(main())

Run the script from your terminal:

uv run python ingest.py```
You will see log messages indicating the progress, and upon completion, your knowledge base will be indexed and ready in Qdrant.

#### Step 4: The Query Process (Code and Execution)

Now for the exciting part: asking questions. Create a `query.py` script to interact with our multi-agent system.

```python
# query.py
import asyncio
import sys
from src.agents.coordinator import MultiAgentCoordinator

async def main(user_query: str):
    """
    Initializes the multi-agent coordinator and processes a user query.
    """
    print(f"Processing query: '{user_query}'")

    # Initialize the 5-agent system
    coordinator = MultiAgentCoordinator()

    # Process the query and get the response
    response = coordinator.process_query(user_query, context=None)

    print("\\n--- 🤖 Answer ---")
    print(response)
    print("--------------------")

if __name__ == "__main__":
    if len(sys.argv) > 1:
        query = " ".join(sys.argv[1:])
        asyncio.run(main(query))
    else:
        print("Usage: uv run python query.py <your question here>")

Now, you can ask questions about the SalesWorx product directly from your command line.

Putting It All Together: The Final Output

Let’s run our query.py script with the same questions we used as examples.

Query 1: A direct, comparative question

uv run python query.py "Summarize the 'Simple Bonus by Item Quantity' feature in SalesWorx and explain how it differs from the 'Assortment Bonus by Overall Quantity'."

Expected Output:

— 🤖 Answer —
The key difference between the ‘Simple Bonus by Item Quantity’ and ‘Assortment Bonus by Overall Quantity’ features in SalesWorx lies in the items a customer must purchase to qualify for the bonus.

  1. Simple Bonus by Item Quantity: This bonus is tied to the purchase of a single, specific item. A customer must buy a certain quantity of ‘Product A’ to receive a Free of Charge (FOC) item. The bonus can be ‘Point’ (a one-time reward for buying within a quantity range) or ‘Recurring’ (a reward that multiplies as the purchase quantity increases).
  2. Assortment Bonus by Overall Quantity: This bonus is more flexible. A customer can qualify by purchasing a combination of different items from a predefined ‘Order Items List’. The system looks at the total quantity of all eligible items in the cart. For example, a customer could buy 3 units of Product A, 4 units of Product B, and 3 units of Product C to reach a total of 10 units and qualify for the bonus associated with that quantity slab.

In short, the Simple Bonus feature is for promoting a single product, while the Assortment Bonus feature is for encouraging the purchase of a range of different products.

Query 2: A step-by-step process question

uv run python query.py "How do I set up a multi-transaction bonus rule in SalesWorx?"

Expected Output:

— 🤖 Answer —
To set up a multi-transaction bonus rule in the SalesWorx product, follow these steps:

  1. Navigate to the Menu: From the main menu, go to Product Management -> Bonus/Discount Management.
  2. Select the Rule Type: Click on Multi Transaction Bonus Rules.
  3. Choose Organization: Select the organization (e.g., Site – 2100) for which you are creating the deal.
  4. Add a New Rule: Click the Add button.
  5. Define the Rule:
    • Sales Item: Select the product the customer needs to purchase over time (e.g., ‘200082-Al Ain 5 Gallon’).
    • Sales Qty: Enter the total quantity the customer must achieve across multiple invoices (e.g., 100).
    • Promo Item: Select the item that will be given as a bonus.
    • Promo Qty: Enter the quantity of the bonus item to be awarded.
    • Validity: Set the ‘Valid From’ and ‘Valid To’ dates for the promotion period.
  6. Save: Click OK to save and activate the trade deal. The system will then track customer purchases of the sales item between the valid dates.

Conclusion

By moving beyond basic prototypes and embracing a production-ready architecture, we have transformed the static SalesWorx product documentation into a dynamic and intelligent knowledge base. This system, built with a robust ingestion pipeline, a scalable Qdrant vector database, and a sophisticated multi-agent workflow, can understand and answer complex, real-world questions about the product’s features with remarkable accuracy.

This approach doesn’t just make information accessible; it makes it actionable, empowering your teams to make faster, more informed decisions.

ucs_admin:
Related Post