A scalable, modular Retrieval-Augmented Generation (RAG) system with clean architecture
Features • Quick Start • Architecture • Documentation • Contributing
- Overview
- Features
- Architecture
- Quick Start
- Usage Guide
- Configuration
- Docker Deployment
- Development
- Project Structure
- API Reference
- Troubleshooting
- Contributing
- License
RAG2GraphRAG is a production-ready RAG (Retrieval-Augmented Generation) system that demonstrates best practices for building scalable AI applications. Built with LangChain, Streamlit, and Docker, it provides a complete solution for document-based question-answering with a beautiful, interactive web interface.
RAG (Retrieval-Augmented Generation) combines the power of Large Language Models (LLMs) with external knowledge retrieval. Instead of relying solely on pre-trained knowledge, RAG:
- Retrieves relevant information from your documents
- Augments the LLM's context with retrieved information
- Generates accurate, source-backed answers
- ✅ Production-Ready: Built with scalability and maintainability in mind
- ✅ Modular Design: Clean architecture with separated concerns
- ✅ Easy to Extend: Simple to add new features and integrations
- ✅ Well Documented: Comprehensive documentation and examples
- ✅ Docker Support: One-command deployment
- 📄 Multi-Format Document Support: TXT, PDF, CSV, and Markdown files
- 🔍 Intelligent Chunking: Configurable text splitting with overlap
- 🧠 Embedding Generation: OpenAI embeddings with extensible support
- 💾 Vector Storage: ChromaDB integration (FAISS support available)
- 🔎 Semantic Search: Fast similarity search over document embeddings
- 💬 Interactive Q&A: Beautiful Streamlit interface with chat history
- 📊 LLM vs RAG Comparison: Side-by-side comparison of responses
- 🎛️ Configurable: Environment-based configuration with sensible defaults
- 🏗️ Modular Architecture: Service-oriented design with clear separation
- 🔧 Type Safety: Pydantic models for data validation
- 📝 Comprehensive Logging: Structured logging throughout
- 🐳 Docker Support: Containerized deployment with health checks
- 🔒 Security: Environment-based secrets management
- ⚡ Performance: Optimized for production workloads
┌─────────────────────────────────────────────────────────────┐
│ Streamlit Web Interface │
└───────────────────────────┬─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ Application Layer │
│ (app/main.py) │
└───────────────────────────┬─────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌───────▼────────┐ ┌───────▼────────┐ ┌───────▼────────┐
│ Ingestion │ │ Chunking │ │ Embedding │
│ Service │ │ Service │ │ Service │
└───────┬────────┘ └───────┬────────┘ └───────┬────────┘
│ │ │
└───────────────────┼───────────────────┘
│
┌───────▼────────┐
│ Vector Store │
│ Service │
└───────┬────────┘
│
┌───────▼────────┐
│ Retrieval │
│ Service │
└───────┬────────┘
│
┌───────▼────────┐
│ ChromaDB │
│ Vector Store │
└────────────────┘
| Component | Responsibility | Technology |
|---|---|---|
| Ingestion Service | Document loading from various sources | LangChain Loaders |
| Chunking Service | Text splitting and chunking | LangChain Text Splitters |
| Embedding Service | Vector generation | OpenAI Embeddings |
| Vector Store Service | Vector database management | ChromaDB |
| Retrieval Service | RAG pipeline orchestration | LangChain Chains |
- Python 3.11+ (for local development)
- Docker & Docker Compose (for containerized deployment)
- OpenAI API Key (Get one here)
Fastest way to get started - 3 commands!
# 1. Clone the repository
git clone <repository-url>
cd rag2graphrag
# 2. Set up environment variables
cp .env.example .env
# Edit .env and add: OPENAI_API_KEY=your_key_here
# 3. Start the application
docker-compose up --buildThat's it! Open your browser to http://localhost:8501 🎉
# 1. Clone and navigate
git clone <repository-url>
cd rag2graphrag
# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Set environment variable
export OPENAI_API_KEY=your_key_here
# Or create .env file from .env.example
# 5. Run the application
streamlit run app/main.py# Make executable (if needed)
chmod +x run.sh
# Run the script
./run.shThe script will:
- ✅ Check/create virtual environment
- ✅ Install dependencies
- ✅ Create necessary directories
- ✅ Start the Streamlit app
- Navigate to the "Upload Documents" tab
- Click "Upload documents" and select your files
- Supported formats:
.txt,.pdf,.csv,.md
- Supported formats:
- Click "🔄 Process Documents"
- Wait for processing to complete (you'll see progress indicators)
What happens during processing:
- 📄 Documents are loaded and parsed
- ✂️ Text is split into manageable chunks
- 🔢 Embeddings are generated for each chunk
- 💾 Chunks are stored in the vector database
- Navigate to the "Query System" tab
- Type your question in the chat input
- View the answer with source citations
- Explore example questions using the quick buttons
Example Questions:
- "What is RAG and how does it work?"
- "What are the key components of LangChain?"
- "How does machine learning relate to AI?"
- Navigate to the "LLM vs RAG Comparison" tab
- Enter a question
- See side-by-side comparison:
- Left: Plain LLM response (training data only)
- Right: RAG-enhanced response (your documents + LLM)
- Navigate to the "System Status" tab
- View:
- Processing statistics
- Configuration settings
- System health
- RAG pipeline explanation
Create a .env file in the project root:
# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_MODEL=gpt-3.5-turbo
OPENAI_TEMPERATURE=0.0
# Embedding Configuration
EMBEDDING_MODEL=text-embedding-ada-002
# Chunking Configuration
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
# Retrieval Configuration
TOP_K=3
SEARCH_TYPE=similarity
# Vector Store Configuration
VECTORSTORE_TYPE=chroma
VECTORSTORE_PERSIST_DIRECTORY=./chroma_db
# Application Configuration
DOCUMENTS_DIRECTORY=./documents
LOG_LEVEL=INFOYou can also configure settings directly in the Streamlit UI:
- Model Settings: Choose LLM model and temperature
- Chunking Settings: Adjust chunk size and overlap
- Retrieval Settings: Set number of documents to retrieve
- Streamlit UI settings (highest priority)
- Environment variables (
.envfile) - Default values in
app/config/settings.py
The docker-compose.yml file provides a complete setup:
services:
rag-app:
build: .
ports:
- "8501:8501"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
volumes:
- ./documents:/app/documents
- ./chroma_db:/app/chroma_db# Build and start
docker-compose up --build
# Start in background
docker-compose up -d
# View logs
docker-compose logs -f
# Stop containers
docker-compose down
# Rebuild without cache
docker-compose build --no-cache- ✅ Python 3.11 slim base image
- ✅ Optimized layer caching
- ✅ Health checks included
- ✅ Non-root user support
- ✅ Minimal image size
rag2graphrag/
├── app/ # Main application package
│ ├── __init__.py
│ ├── main.py # Streamlit application
│ ├── config/ # Configuration management
│ │ ├── __init__.py
│ │ └── settings.py # Settings with Pydantic
│ ├── models/ # Data models
│ │ ├── __init__.py
│ │ └── schemas.py # Pydantic schemas
│ ├── services/ # Business logic services
│ │ ├── __init__.py
│ │ ├── ingestion.py # Document loading
│ │ ├── chunking.py # Text chunking
│ │ ├── embedding.py # Embedding generation
│ │ ├── vectorstore.py # Vector database
│ │ └── retrieval.py # RAG retrieval
│ └── utils/ # Utilities
│ ├── __init__.py
│ └── helpers.py # Helper functions
├── documents/ # Document storage
├── chroma_db/ # Vector database storage
├── Dockerfile # Docker configuration
├── docker-compose.yml # Docker Compose config
├── requirements.txt # Python dependencies
├── .env.example # Environment template
├── run.sh # Quick start script
├── README.md # This file
├── QUICKSTART.md # Quick start guide
└── ARCHITECTURE.md # Architecture details
# Test application structure
python3 test_app.py
# Test with API key
python3 test_with_api.pyThe project follows these best practices:
- ✅ Type hints throughout
- ✅ Pydantic for data validation
- ✅ Comprehensive error handling
- ✅ Structured logging
- ✅ Docstrings for all functions
- ✅ Modular, testable code
from app.services import DocumentIngestionService
# Initialize service
ingestion = DocumentIngestionService()
# Load from files
documents = ingestion.load_from_file_path("path/to/file.txt")
# Load from uploaded files (Streamlit)
documents = ingestion.load_from_uploaded_files(uploaded_files)
# Load from directory
documents = ingestion.load_from_directory("./documents")from app.services import ChunkingService
# Initialize with custom settings
chunking = ChunkingService(
chunk_size=1000,
chunk_overlap=200,
splitter_type="recursive"
)
# Chunk documents
chunks = chunking.chunk_documents(documents)from app.services import EmbeddingService
# Initialize with OpenAI
embedding = EmbeddingService(
embedding_type="openai",
api_key="your_key"
)
# Generate embeddings
embeddings = embedding.embed_documents(["text1", "text2"])from app.services import VectorStoreService
# Initialize
vectorstore = VectorStoreService(
vectorstore_type="chroma",
embeddings=embedding.get_embeddings()
)
# Create from documents
vs = vectorstore.create_from_documents(chunks)from app.services import RetrievalService
# Initialize
retrieval = RetrievalService(
vectorstore=vs,
llm=llm,
k=3
)
# Query
result = retrieval.invoke("What is RAG?")
print(result["result"]) # Answer
print(result["source_documents"]) # SourcesSolution: Make sure you're running from the project root and PYTHONPATH is set correctly. In Docker, this is handled automatically.
Solution:
- Check that
.envfile exists and containsOPENAI_API_KEY - Or export it:
export OPENAI_API_KEY=your_key - Verify in Streamlit sidebar that API key is set
Solution:
# Find process using port
lsof -ti:8501
# Kill process (replace PID)
kill -9 <PID>
# Or use different port
streamlit run app/main.py --server.port=8502Solution:
# Clean Docker cache
docker system prune -a
# Rebuild without cache
docker-compose build --no-cache
# Check Docker logs
docker-compose logsSolution:
- Check file format is supported (TXT, PDF, CSV, MD)
- Verify file is not corrupted
- Check logs for specific error messages
- Ensure OpenAI API key is valid and has credits
- Check the Troubleshooting section
- Review QUICKSTART.md for common issues
- Check ARCHITECTURE.md for system details
- Open an issue on GitHub with:
- Error message
- Steps to reproduce
- Environment details
- Add loader to
app/services/ingestion.py:
from langchain_community.document_loaders import YourLoader
SUPPORTED_EXTENSIONS = {
".your_ext": YourLoader,
# ... existing loaders
}- Add dependency to
requirements.txt - Test with your file type
- Add to
app/services/embedding.py:
EMBEDDING_TYPES = {
"your_model": YourEmbeddingClass,
# ... existing models
}- Update configuration in
app/config/settings.py
- Add to
app/services/vectorstore.py:
VECTORSTORE_TYPES = {
"your_store": YourVectorStoreClass,
# ... existing stores
}- Update configuration
Edit the prompt template in app/services/retrieval.py:
DEFAULT_PROMPT_TEMPLATE = """Your custom prompt here...
Context: {context}
Question: {question}
Answer:"""Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes
- Add tests (if applicable)
- Commit your changes:
git commit -m 'Add amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request
- ✅ Follow existing code style
- ✅ Add docstrings to new functions
- ✅ Update documentation as needed
- ✅ Test your changes thoroughly
- ✅ Keep commits atomic and well-described
- 🐛 Bug fixes
- ✨ New features
- 📚 Documentation improvements
- 🧪 Test coverage
- 🎨 UI/UX enhancements
- ⚡ Performance optimizations
| Package | Version | Purpose |
|---|---|---|
langchain |
>=0.1.0 | LLM framework |
langchain-openai |
>=0.0.5 | OpenAI integration |
langchain-community |
>=0.0.20 | Community integrations |
chromadb |
>=0.4.0 | Vector database |
streamlit |
>=1.28.0 | Web framework |
pydantic |
>=2.0.0 | Data validation |
openai |
>=1.0.0 | OpenAI API client |
pypdf- PDF processingunstructured- Advanced document parsing
See requirements.txt for complete list.
This project is licensed under the MIT License - see the LICENSE file for details.
- LangChain - Amazing framework for LLM applications
- Streamlit - Beautiful web framework
- Chroma - Fast vector database
- OpenAI - Powerful LLM and embedding models
- 📧 Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
- 📖 Documentation: See QUICKSTART.md and ARCHITECTURE.md
Made with ❤️ for the AI community
⭐ Star this repo if you find it useful! ⭐