🤖 RAG2GraphRAG - Production-Ready RAG System

A scalable, modular Retrieval-Augmented Generation (RAG) system with clean architecture

Features • Quick Start • Architecture • Documentation • Contributing

📋 Table of Contents

Overview
Features
Architecture
Quick Start
Usage Guide
Configuration
Docker Deployment
Development
Project Structure
API Reference
Troubleshooting
Contributing
License

🎯 Overview

RAG2GraphRAG is a production-ready RAG (Retrieval-Augmented Generation) system that demonstrates best practices for building scalable AI applications. Built with LangChain, Streamlit, and Docker, it provides a complete solution for document-based question-answering with a beautiful, interactive web interface.

What is RAG?

RAG (Retrieval-Augmented Generation) combines the power of Large Language Models (LLMs) with external knowledge retrieval. Instead of relying solely on pre-trained knowledge, RAG:

Retrieves relevant information from your documents
Augments the LLM's context with retrieved information
Generates accurate, source-backed answers

Why This Project?

✅ Production-Ready: Built with scalability and maintainability in mind
✅ Modular Design: Clean architecture with separated concerns
✅ Easy to Extend: Simple to add new features and integrations
✅ Well Documented: Comprehensive documentation and examples
✅ Docker Support: One-command deployment

✨ Features

Core Capabilities

📄 Multi-Format Document Support: TXT, PDF, CSV, and Markdown files
🔍 Intelligent Chunking: Configurable text splitting with overlap
🧠 Embedding Generation: OpenAI embeddings with extensible support
💾 Vector Storage: ChromaDB integration (FAISS support available)
🔎 Semantic Search: Fast similarity search over document embeddings
💬 Interactive Q&A: Beautiful Streamlit interface with chat history
📊 LLM vs RAG Comparison: Side-by-side comparison of responses
🎛️ Configurable: Environment-based configuration with sensible defaults

Technical Features

🏗️ Modular Architecture: Service-oriented design with clear separation
🔧 Type Safety: Pydantic models for data validation
📝 Comprehensive Logging: Structured logging throughout
🐳 Docker Support: Containerized deployment with health checks
🔒 Security: Environment-based secrets management
⚡ Performance: Optimized for production workloads

🏗️ Architecture

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Streamlit Web Interface                    │
└───────────────────────────┬─────────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────────┐
│                      Application Layer                       │
│                    (app/main.py)                             │
└───────────────────────────┬─────────────────────────────────┘
                            │
        ┌───────────────────┼───────────────────┐
        │                   │                   │
┌───────▼────────┐  ┌───────▼────────┐  ┌───────▼────────┐
│   Ingestion    │  │    Chunking    │  │   Embedding    │
│    Service     │  │    Service     │  │    Service     │
└───────┬────────┘  └───────┬────────┘  └───────┬────────┘
        │                   │                   │
        └───────────────────┼───────────────────┘
                            │
                    ┌───────▼────────┐
                    │  Vector Store  │
                    │    Service     │
                    └───────┬────────┘
                            │
                    ┌───────▼────────┐
                    │   Retrieval    │
                    │    Service     │
                    └───────┬────────┘
                            │
                    ┌───────▼────────┐
                    │   ChromaDB     │
                    │  Vector Store  │
                    └────────────────┘

Component Overview

Component	Responsibility	Technology
Ingestion Service	Document loading from various sources	LangChain Loaders
Chunking Service	Text splitting and chunking	LangChain Text Splitters
Embedding Service	Vector generation	OpenAI Embeddings
Vector Store Service	Vector database management	ChromaDB
Retrieval Service	RAG pipeline orchestration	LangChain Chains

🚀 Quick Start

Prerequisites

Python 3.11+ (for local development)
Docker & Docker Compose (for containerized deployment)
OpenAI API Key (Get one here)

Option 1: Docker Deployment (Recommended) ⭐

Fastest way to get started - 3 commands!

# 1. Clone the repository
git clone <repository-url>
cd rag2graphrag

# 2. Set up environment variables
cp .env.example .env
# Edit .env and add: OPENAI_API_KEY=your_key_here

# 3. Start the application
docker-compose up --build

That's it! Open your browser to http://localhost:8501 🎉

Option 2: Local Development

# 1. Clone and navigate
git clone <repository-url>
cd rag2graphrag

# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Set environment variable
export OPENAI_API_KEY=your_key_here
# Or create .env file from .env.example

# 5. Run the application
streamlit run app/main.py

Option 3: Using the Run Script

# Make executable (if needed)
chmod +x run.sh

# Run the script
./run.sh

The script will:

✅ Check/create virtual environment
✅ Install dependencies
✅ Create necessary directories
✅ Start the Streamlit app

📖 Usage Guide

Step 1: Upload Documents

Navigate to the "Upload Documents" tab
Click "Upload documents" and select your files
- Supported formats: .txt, .pdf, .csv, .md
Click "🔄 Process Documents"
Wait for processing to complete (you'll see progress indicators)

What happens during processing:

📄 Documents are loaded and parsed
✂️ Text is split into manageable chunks
🔢 Embeddings are generated for each chunk
💾 Chunks are stored in the vector database

Step 2: Query Your Documents

Navigate to the "Query System" tab
Type your question in the chat input
View the answer with source citations
Explore example questions using the quick buttons

Example Questions:

"What is RAG and how does it work?"
"What are the key components of LangChain?"
"How does machine learning relate to AI?"

Step 3: Compare LLM vs RAG

Navigate to the "LLM vs RAG Comparison" tab
Enter a question
See side-by-side comparison:
- Left: Plain LLM response (training data only)
- Right: RAG-enhanced response (your documents + LLM)

Step 4: Monitor System

Navigate to the "System Status" tab
View:
- Processing statistics
- Configuration settings
- System health
- RAG pipeline explanation

⚙️ Configuration

Environment Variables

Create a .env file in the project root:

# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_MODEL=gpt-3.5-turbo
OPENAI_TEMPERATURE=0.0

# Embedding Configuration
EMBEDDING_MODEL=text-embedding-ada-002

# Chunking Configuration
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

# Retrieval Configuration
TOP_K=3
SEARCH_TYPE=similarity

# Vector Store Configuration
VECTORSTORE_TYPE=chroma
VECTORSTORE_PERSIST_DIRECTORY=./chroma_db

# Application Configuration
DOCUMENTS_DIRECTORY=./documents
LOG_LEVEL=INFO

Streamlit Sidebar Configuration

You can also configure settings directly in the Streamlit UI:

Model Settings: Choose LLM model and temperature
Chunking Settings: Adjust chunk size and overlap
Retrieval Settings: Set number of documents to retrieve

Configuration Priority

Streamlit UI settings (highest priority)
Environment variables (.env file)
Default values in app/config/settings.py

🐳 Docker Deployment

Docker Compose

The docker-compose.yml file provides a complete setup:

services:
  rag-app:
    build: .
    ports:
      - "8501:8501"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    volumes:
      - ./documents:/app/documents
      - ./chroma_db:/app/chroma_db

Docker Commands

# Build and start
docker-compose up --build

# Start in background
docker-compose up -d

# View logs
docker-compose logs -f

# Stop containers
docker-compose down

# Rebuild without cache
docker-compose build --no-cache

Dockerfile Features

✅ Python 3.11 slim base image
✅ Optimized layer caching
✅ Health checks included
✅ Non-root user support
✅ Minimal image size

💻 Development

Project Structure

rag2graphrag/
├── app/                          # Main application package
│   ├── __init__.py
│   ├── main.py                   # Streamlit application
│   ├── config/                   # Configuration management
│   │   ├── __init__.py
│   │   └── settings.py           # Settings with Pydantic
│   ├── models/                   # Data models
│   │   ├── __init__.py
│   │   └── schemas.py            # Pydantic schemas
│   ├── services/                 # Business logic services
│   │   ├── __init__.py
│   │   ├── ingestion.py         # Document loading
│   │   ├── chunking.py          # Text chunking
│   │   ├── embedding.py         # Embedding generation
│   │   ├── vectorstore.py       # Vector database
│   │   └── retrieval.py         # RAG retrieval
│   └── utils/                    # Utilities
│       ├── __init__.py
│       └── helpers.py           # Helper functions
├── documents/                    # Document storage
├── chroma_db/                    # Vector database storage
├── Dockerfile                    # Docker configuration
├── docker-compose.yml            # Docker Compose config
├── requirements.txt              # Python dependencies
├── .env.example                 # Environment template
├── run.sh                       # Quick start script
├── README.md                    # This file
├── QUICKSTART.md                # Quick start guide
└── ARCHITECTURE.md              # Architecture details

Running Tests

# Test application structure
python3 test_app.py

# Test with API key
python3 test_with_api.py

Code Quality

The project follows these best practices:

✅ Type hints throughout
✅ Pydantic for data validation
✅ Comprehensive error handling
✅ Structured logging
✅ Docstrings for all functions
✅ Modular, testable code

📚 API Reference

Services

DocumentIngestionService

from app.services import DocumentIngestionService

# Initialize service
ingestion = DocumentIngestionService()

# Load from files
documents = ingestion.load_from_file_path("path/to/file.txt")

# Load from uploaded files (Streamlit)
documents = ingestion.load_from_uploaded_files(uploaded_files)

# Load from directory
documents = ingestion.load_from_directory("./documents")

ChunkingService

from app.services import ChunkingService

# Initialize with custom settings
chunking = ChunkingService(
    chunk_size=1000,
    chunk_overlap=200,
    splitter_type="recursive"
)

# Chunk documents
chunks = chunking.chunk_documents(documents)

EmbeddingService

from app.services import EmbeddingService

# Initialize with OpenAI
embedding = EmbeddingService(
    embedding_type="openai",
    api_key="your_key"
)

# Generate embeddings
embeddings = embedding.embed_documents(["text1", "text2"])

VectorStoreService

from app.services import VectorStoreService

# Initialize
vectorstore = VectorStoreService(
    vectorstore_type="chroma",
    embeddings=embedding.get_embeddings()
)

# Create from documents
vs = vectorstore.create_from_documents(chunks)

RetrievalService

from app.services import RetrievalService

# Initialize
retrieval = RetrievalService(
    vectorstore=vs,
    llm=llm,
    k=3
)

# Query
result = retrieval.invoke("What is RAG?")
print(result["result"])  # Answer
print(result["source_documents"])  # Sources

🔧 Troubleshooting

Common Issues

Issue: `ModuleNotFoundError: No module named 'app'`

Solution: Make sure you're running from the project root and PYTHONPATH is set correctly. In Docker, this is handled automatically.

Issue: `OpenAI API key not found`

Solution:

Check that .env file exists and contains OPENAI_API_KEY
Or export it: export OPENAI_API_KEY=your_key
Verify in Streamlit sidebar that API key is set

Issue: `Port 8501 already in use`

Solution:

# Find process using port
lsof -ti:8501

# Kill process (replace PID)
kill -9 <PID>

# Or use different port
streamlit run app/main.py --server.port=8502

Issue: Docker build fails

Solution:

# Clean Docker cache
docker system prune -a

# Rebuild without cache
docker-compose build --no-cache

# Check Docker logs
docker-compose logs

Issue: Documents not processing

Solution:

Check file format is supported (TXT, PDF, CSV, MD)
Verify file is not corrupted
Check logs for specific error messages
Ensure OpenAI API key is valid and has credits

Getting Help

Check the Troubleshooting section
Review QUICKSTART.md for common issues
Check ARCHITECTURE.md for system details
Open an issue on GitHub with:
- Error message
- Steps to reproduce
- Environment details

🚧 Extending the System

Adding New Document Types

Add loader to app/services/ingestion.py:

from langchain_community.document_loaders import YourLoader

SUPPORTED_EXTENSIONS = {
    ".your_ext": YourLoader,
    # ... existing loaders
}

Add dependency to requirements.txt
Test with your file type

Adding New Embedding Models

Add to app/services/embedding.py:

EMBEDDING_TYPES = {
    "your_model": YourEmbeddingClass,
    # ... existing models
}

Update configuration in app/config/settings.py

Adding New Vector Stores

Add to app/services/vectorstore.py:

VECTORSTORE_TYPES = {
    "your_store": YourVectorStoreClass,
    # ... existing stores
}

Update configuration

Customizing Prompts

Edit the prompt template in app/services/retrieval.py:

DEFAULT_PROMPT_TEMPLATE = """Your custom prompt here...
Context: {context}
Question: {question}
Answer:"""

🤝 Contributing

Contributions are welcome! Here's how you can help:

How to Contribute

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes
Add tests (if applicable)
Commit your changes: git commit -m 'Add amazing feature'
Push to the branch: git push origin feature/amazing-feature
Open a Pull Request

Contribution Guidelines

✅ Follow existing code style
✅ Add docstrings to new functions
✅ Update documentation as needed
✅ Test your changes thoroughly
✅ Keep commits atomic and well-described

Areas for Contribution

🐛 Bug fixes
✨ New features
📚 Documentation improvements
🧪 Test coverage
🎨 UI/UX enhancements
⚡ Performance optimizations

📦 Dependencies

Core Dependencies

Package	Version	Purpose
`langchain`	>=0.1.0	LLM framework
`langchain-openai`	>=0.0.5	OpenAI integration
`langchain-community`	>=0.0.20	Community integrations
`chromadb`	>=0.4.0	Vector database
`streamlit`	>=1.28.0	Web framework
`pydantic`	>=2.0.0	Data validation
`openai`	>=1.0.0	OpenAI API client

Optional Dependencies

pypdf - PDF processing
unstructured - Advanced document parsing

See requirements.txt for complete list.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

LangChain - Amazing framework for LLM applications
Streamlit - Beautiful web framework
Chroma - Fast vector database
OpenAI - Powerful LLM and embedding models

📞 Support & Contact

📧 Issues: GitHub Issues
💬 Discussions: GitHub Discussions
📖 Documentation: See QUICKSTART.md and ARCHITECTURE.md

Made with ❤️ for the AI community

⭐ Star this repo if you find it useful! ⭐

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
app		app
documents		documents
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile		Dockerfile
QUICKSTART.md		QUICKSTART.md
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
run.sh		run.sh
streamlit_rag_demo.py		streamlit_rag_demo.py
test_app.py		test_app.py
test_with_api.py		test_with_api.py

vipinkataria2209/rag2graphrag

Folders and files

Latest commit

History

Repository files navigation

🤖 RAG2GraphRAG - Production-Ready RAG System

📋 Table of Contents

🎯 Overview

What is RAG?

Why This Project?

✨ Features

Core Capabilities

Technical Features

🏗️ Architecture

System Architecture

Component Overview

🚀 Quick Start

Prerequisites

Option 1: Docker Deployment (Recommended) ⭐

Option 2: Local Development

Option 3: Using the Run Script

📖 Usage Guide

Step 1: Upload Documents

Step 2: Query Your Documents

Step 3: Compare LLM vs RAG

Step 4: Monitor System

⚙️ Configuration

Environment Variables

Streamlit Sidebar Configuration

Configuration Priority

🐳 Docker Deployment

Docker Compose

Docker Commands

Dockerfile Features

💻 Development

Project Structure

Running Tests

Code Quality

📚 API Reference

Services

DocumentIngestionService

ChunkingService

EmbeddingService

VectorStoreService

RetrievalService

🔧 Troubleshooting

Common Issues

Issue: ModuleNotFoundError: No module named 'app'

Issue: OpenAI API key not found

Issue: Port 8501 already in use

Issue: Docker build fails

Issue: Documents not processing

Getting Help

🚧 Extending the System

Adding New Document Types

Adding New Embedding Models

Adding New Vector Stores

Customizing Prompts

🤝 Contributing

How to Contribute

Contribution Guidelines

Areas for Contribution

📦 Dependencies

Core Dependencies

Optional Dependencies

📄 License

🙏 Acknowledgments

📞 Support & Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Issue: `ModuleNotFoundError: No module named 'app'`

Issue: `OpenAI API key not found`

Issue: `Port 8501 already in use`

Packages