Skip to content

vipinkataria2209/rag2graphrag

Repository files navigation

🤖 RAG2GraphRAG - Production-Ready RAG System

Python Streamlit LangChain Docker License

A scalable, modular Retrieval-Augmented Generation (RAG) system with clean architecture

FeaturesQuick StartArchitectureDocumentationContributing


📋 Table of Contents


🎯 Overview

RAG2GraphRAG is a production-ready RAG (Retrieval-Augmented Generation) system that demonstrates best practices for building scalable AI applications. Built with LangChain, Streamlit, and Docker, it provides a complete solution for document-based question-answering with a beautiful, interactive web interface.

What is RAG?

RAG (Retrieval-Augmented Generation) combines the power of Large Language Models (LLMs) with external knowledge retrieval. Instead of relying solely on pre-trained knowledge, RAG:

  1. Retrieves relevant information from your documents
  2. Augments the LLM's context with retrieved information
  3. Generates accurate, source-backed answers

Why This Project?

  • Production-Ready: Built with scalability and maintainability in mind
  • Modular Design: Clean architecture with separated concerns
  • Easy to Extend: Simple to add new features and integrations
  • Well Documented: Comprehensive documentation and examples
  • Docker Support: One-command deployment

✨ Features

Core Capabilities

  • 📄 Multi-Format Document Support: TXT, PDF, CSV, and Markdown files
  • 🔍 Intelligent Chunking: Configurable text splitting with overlap
  • 🧠 Embedding Generation: OpenAI embeddings with extensible support
  • 💾 Vector Storage: ChromaDB integration (FAISS support available)
  • 🔎 Semantic Search: Fast similarity search over document embeddings
  • 💬 Interactive Q&A: Beautiful Streamlit interface with chat history
  • 📊 LLM vs RAG Comparison: Side-by-side comparison of responses
  • 🎛️ Configurable: Environment-based configuration with sensible defaults

Technical Features

  • 🏗️ Modular Architecture: Service-oriented design with clear separation
  • 🔧 Type Safety: Pydantic models for data validation
  • 📝 Comprehensive Logging: Structured logging throughout
  • 🐳 Docker Support: Containerized deployment with health checks
  • 🔒 Security: Environment-based secrets management
  • Performance: Optimized for production workloads

🏗️ Architecture

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Streamlit Web Interface                    │
└───────────────────────────┬─────────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────────┐
│                      Application Layer                       │
│                    (app/main.py)                             │
└───────────────────────────┬─────────────────────────────────┘
                            │
        ┌───────────────────┼───────────────────┐
        │                   │                   │
┌───────▼────────┐  ┌───────▼────────┐  ┌───────▼────────┐
│   Ingestion    │  │    Chunking    │  │   Embedding    │
│    Service     │  │    Service     │  │    Service     │
└───────┬────────┘  └───────┬────────┘  └───────┬────────┘
        │                   │                   │
        └───────────────────┼───────────────────┘
                            │
                    ┌───────▼────────┐
                    │  Vector Store  │
                    │    Service     │
                    └───────┬────────┘
                            │
                    ┌───────▼────────┐
                    │   Retrieval    │
                    │    Service     │
                    └───────┬────────┘
                            │
                    ┌───────▼────────┐
                    │   ChromaDB     │
                    │  Vector Store  │
                    └────────────────┘

Component Overview

Component Responsibility Technology
Ingestion Service Document loading from various sources LangChain Loaders
Chunking Service Text splitting and chunking LangChain Text Splitters
Embedding Service Vector generation OpenAI Embeddings
Vector Store Service Vector database management ChromaDB
Retrieval Service RAG pipeline orchestration LangChain Chains

🚀 Quick Start

Prerequisites

  • Python 3.11+ (for local development)
  • Docker & Docker Compose (for containerized deployment)
  • OpenAI API Key (Get one here)

Option 1: Docker Deployment (Recommended) ⭐

Fastest way to get started - 3 commands!

# 1. Clone the repository
git clone <repository-url>
cd rag2graphrag

# 2. Set up environment variables
cp .env.example .env
# Edit .env and add: OPENAI_API_KEY=your_key_here

# 3. Start the application
docker-compose up --build

That's it! Open your browser to http://localhost:8501 🎉

Option 2: Local Development

# 1. Clone and navigate
git clone <repository-url>
cd rag2graphrag

# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Set environment variable
export OPENAI_API_KEY=your_key_here
# Or create .env file from .env.example

# 5. Run the application
streamlit run app/main.py

Option 3: Using the Run Script

# Make executable (if needed)
chmod +x run.sh

# Run the script
./run.sh

The script will:

  • ✅ Check/create virtual environment
  • ✅ Install dependencies
  • ✅ Create necessary directories
  • ✅ Start the Streamlit app

📖 Usage Guide

Step 1: Upload Documents

  1. Navigate to the "Upload Documents" tab
  2. Click "Upload documents" and select your files
    • Supported formats: .txt, .pdf, .csv, .md
  3. Click "🔄 Process Documents"
  4. Wait for processing to complete (you'll see progress indicators)

What happens during processing:

  • 📄 Documents are loaded and parsed
  • ✂️ Text is split into manageable chunks
  • 🔢 Embeddings are generated for each chunk
  • 💾 Chunks are stored in the vector database

Step 2: Query Your Documents

  1. Navigate to the "Query System" tab
  2. Type your question in the chat input
  3. View the answer with source citations
  4. Explore example questions using the quick buttons

Example Questions:

  • "What is RAG and how does it work?"
  • "What are the key components of LangChain?"
  • "How does machine learning relate to AI?"

Step 3: Compare LLM vs RAG

  1. Navigate to the "LLM vs RAG Comparison" tab
  2. Enter a question
  3. See side-by-side comparison:
    • Left: Plain LLM response (training data only)
    • Right: RAG-enhanced response (your documents + LLM)

Step 4: Monitor System

  1. Navigate to the "System Status" tab
  2. View:
    • Processing statistics
    • Configuration settings
    • System health
    • RAG pipeline explanation

⚙️ Configuration

Environment Variables

Create a .env file in the project root:

# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_MODEL=gpt-3.5-turbo
OPENAI_TEMPERATURE=0.0

# Embedding Configuration
EMBEDDING_MODEL=text-embedding-ada-002

# Chunking Configuration
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

# Retrieval Configuration
TOP_K=3
SEARCH_TYPE=similarity

# Vector Store Configuration
VECTORSTORE_TYPE=chroma
VECTORSTORE_PERSIST_DIRECTORY=./chroma_db

# Application Configuration
DOCUMENTS_DIRECTORY=./documents
LOG_LEVEL=INFO

Streamlit Sidebar Configuration

You can also configure settings directly in the Streamlit UI:

  • Model Settings: Choose LLM model and temperature
  • Chunking Settings: Adjust chunk size and overlap
  • Retrieval Settings: Set number of documents to retrieve

Configuration Priority

  1. Streamlit UI settings (highest priority)
  2. Environment variables (.env file)
  3. Default values in app/config/settings.py

🐳 Docker Deployment

Docker Compose

The docker-compose.yml file provides a complete setup:

services:
  rag-app:
    build: .
    ports:
      - "8501:8501"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    volumes:
      - ./documents:/app/documents
      - ./chroma_db:/app/chroma_db

Docker Commands

# Build and start
docker-compose up --build

# Start in background
docker-compose up -d

# View logs
docker-compose logs -f

# Stop containers
docker-compose down

# Rebuild without cache
docker-compose build --no-cache

Dockerfile Features

  • ✅ Python 3.11 slim base image
  • ✅ Optimized layer caching
  • ✅ Health checks included
  • ✅ Non-root user support
  • ✅ Minimal image size

💻 Development

Project Structure

rag2graphrag/
├── app/                          # Main application package
│   ├── __init__.py
│   ├── main.py                   # Streamlit application
│   ├── config/                   # Configuration management
│   │   ├── __init__.py
│   │   └── settings.py           # Settings with Pydantic
│   ├── models/                   # Data models
│   │   ├── __init__.py
│   │   └── schemas.py            # Pydantic schemas
│   ├── services/                 # Business logic services
│   │   ├── __init__.py
│   │   ├── ingestion.py         # Document loading
│   │   ├── chunking.py          # Text chunking
│   │   ├── embedding.py         # Embedding generation
│   │   ├── vectorstore.py       # Vector database
│   │   └── retrieval.py         # RAG retrieval
│   └── utils/                    # Utilities
│       ├── __init__.py
│       └── helpers.py           # Helper functions
├── documents/                    # Document storage
├── chroma_db/                    # Vector database storage
├── Dockerfile                    # Docker configuration
├── docker-compose.yml            # Docker Compose config
├── requirements.txt              # Python dependencies
├── .env.example                 # Environment template
├── run.sh                       # Quick start script
├── README.md                    # This file
├── QUICKSTART.md                # Quick start guide
└── ARCHITECTURE.md              # Architecture details

Running Tests

# Test application structure
python3 test_app.py

# Test with API key
python3 test_with_api.py

Code Quality

The project follows these best practices:

  • ✅ Type hints throughout
  • ✅ Pydantic for data validation
  • ✅ Comprehensive error handling
  • ✅ Structured logging
  • ✅ Docstrings for all functions
  • ✅ Modular, testable code

📚 API Reference

Services

DocumentIngestionService

from app.services import DocumentIngestionService

# Initialize service
ingestion = DocumentIngestionService()

# Load from files
documents = ingestion.load_from_file_path("path/to/file.txt")

# Load from uploaded files (Streamlit)
documents = ingestion.load_from_uploaded_files(uploaded_files)

# Load from directory
documents = ingestion.load_from_directory("./documents")

ChunkingService

from app.services import ChunkingService

# Initialize with custom settings
chunking = ChunkingService(
    chunk_size=1000,
    chunk_overlap=200,
    splitter_type="recursive"
)

# Chunk documents
chunks = chunking.chunk_documents(documents)

EmbeddingService

from app.services import EmbeddingService

# Initialize with OpenAI
embedding = EmbeddingService(
    embedding_type="openai",
    api_key="your_key"
)

# Generate embeddings
embeddings = embedding.embed_documents(["text1", "text2"])

VectorStoreService

from app.services import VectorStoreService

# Initialize
vectorstore = VectorStoreService(
    vectorstore_type="chroma",
    embeddings=embedding.get_embeddings()
)

# Create from documents
vs = vectorstore.create_from_documents(chunks)

RetrievalService

from app.services import RetrievalService

# Initialize
retrieval = RetrievalService(
    vectorstore=vs,
    llm=llm,
    k=3
)

# Query
result = retrieval.invoke("What is RAG?")
print(result["result"])  # Answer
print(result["source_documents"])  # Sources

🔧 Troubleshooting

Common Issues

Issue: ModuleNotFoundError: No module named 'app'

Solution: Make sure you're running from the project root and PYTHONPATH is set correctly. In Docker, this is handled automatically.

Issue: OpenAI API key not found

Solution:

  • Check that .env file exists and contains OPENAI_API_KEY
  • Or export it: export OPENAI_API_KEY=your_key
  • Verify in Streamlit sidebar that API key is set

Issue: Port 8501 already in use

Solution:

# Find process using port
lsof -ti:8501

# Kill process (replace PID)
kill -9 <PID>

# Or use different port
streamlit run app/main.py --server.port=8502

Issue: Docker build fails

Solution:

# Clean Docker cache
docker system prune -a

# Rebuild without cache
docker-compose build --no-cache

# Check Docker logs
docker-compose logs

Issue: Documents not processing

Solution:

  • Check file format is supported (TXT, PDF, CSV, MD)
  • Verify file is not corrupted
  • Check logs for specific error messages
  • Ensure OpenAI API key is valid and has credits

Getting Help

  1. Check the Troubleshooting section
  2. Review QUICKSTART.md for common issues
  3. Check ARCHITECTURE.md for system details
  4. Open an issue on GitHub with:
    • Error message
    • Steps to reproduce
    • Environment details

🚧 Extending the System

Adding New Document Types

  1. Add loader to app/services/ingestion.py:
from langchain_community.document_loaders import YourLoader

SUPPORTED_EXTENSIONS = {
    ".your_ext": YourLoader,
    # ... existing loaders
}
  1. Add dependency to requirements.txt
  2. Test with your file type

Adding New Embedding Models

  1. Add to app/services/embedding.py:
EMBEDDING_TYPES = {
    "your_model": YourEmbeddingClass,
    # ... existing models
}
  1. Update configuration in app/config/settings.py

Adding New Vector Stores

  1. Add to app/services/vectorstore.py:
VECTORSTORE_TYPES = {
    "your_store": YourVectorStoreClass,
    # ... existing stores
}
  1. Update configuration

Customizing Prompts

Edit the prompt template in app/services/retrieval.py:

DEFAULT_PROMPT_TEMPLATE = """Your custom prompt here...
Context: {context}
Question: {question}
Answer:"""

🤝 Contributing

Contributions are welcome! Here's how you can help:

How to Contribute

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes
  4. Add tests (if applicable)
  5. Commit your changes: git commit -m 'Add amazing feature'
  6. Push to the branch: git push origin feature/amazing-feature
  7. Open a Pull Request

Contribution Guidelines

  • ✅ Follow existing code style
  • ✅ Add docstrings to new functions
  • ✅ Update documentation as needed
  • ✅ Test your changes thoroughly
  • ✅ Keep commits atomic and well-described

Areas for Contribution

  • 🐛 Bug fixes
  • ✨ New features
  • 📚 Documentation improvements
  • 🧪 Test coverage
  • 🎨 UI/UX enhancements
  • ⚡ Performance optimizations

📦 Dependencies

Core Dependencies

Package Version Purpose
langchain >=0.1.0 LLM framework
langchain-openai >=0.0.5 OpenAI integration
langchain-community >=0.0.20 Community integrations
chromadb >=0.4.0 Vector database
streamlit >=1.28.0 Web framework
pydantic >=2.0.0 Data validation
openai >=1.0.0 OpenAI API client

Optional Dependencies

  • pypdf - PDF processing
  • unstructured - Advanced document parsing

See requirements.txt for complete list.


📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


🙏 Acknowledgments

  • LangChain - Amazing framework for LLM applications
  • Streamlit - Beautiful web framework
  • Chroma - Fast vector database
  • OpenAI - Powerful LLM and embedding models

📞 Support & Contact


Made with ❤️ for the AI community

Star this repo if you find it useful!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published