A comprehensive document ingestion and retrieval system that processes various document formats (PDFs, images, office documents) through multiple stages including OCR, text extraction, chunking, embedding generation, and indexing. The system supports multi-language processing with specialized Arabic language handling.
- Multi-format Document Processing: Support for PDFs, images, and office documents
- Intelligent OCR Routing: Automatic language detection with specialized Arabic OCR processing
- Multi-stage Processing Pipeline: Preprocessing → OCR → Table Extraction → Chunking → Embedding → Indexing
- Duplicate Detection: Hash-based duplicate prevention
- Hybrid Search: Combined text and vector search capabilities
- Administrative Interface: Dataset management, monitoring, and document reprocessing
- Robust Error Handling: Automatic retry logic with exponential backoff
- Scalable Architecture: Microservices-based design with message queues
The system follows a microservices architecture with the following components:
- Upload Service: Handles document uploads, validation, and duplicate detection
- Processing Orchestrator: Coordinates multi-stage document processing
- Search Service: Provides hybrid search capabilities
- Admin Service: Manages datasets and system monitoring
- Language Detector: Identifies document language (Arabic, English, mixed)
- OCR Router: Routes to appropriate OCR engine based on language
- Text Normalizer: Unicode normalization and diacritic handling
- Chunking Service: Segments text for retrieval optimization
- Embedding Service: Generates semantic embeddings
- Table Extraction: Detects and extracts table structures
- PostgreSQL: Document metadata, processing status, chunks, tables
- Vector Database: Embeddings for semantic search (Qdrant/Weaviate)
- Blob Storage: Original files and processing artifacts
- Search Index: Full-text search (Elasticsearch)
- Node.js >= 18.x
- PostgreSQL >= 14.x
- Redis >= 6.x
- Docker (optional, for containerized services)
-
Clone the repository
git clone <repository-url> cd RAG-Qoder
-
Install dependencies
npm install
-
Configure environment
cp .env.example .env # Edit .env with your configuration -
Set up database
# Create PostgreSQL database createdb rag_qoder # Run migrations psql -d rag_qoder -f src/database/schema.sql
-
Start Redis
redis-server
Edit .env file with your settings:
DB_HOST=localhost
DB_PORT=5432
DB_NAME=rag_qoder
DB_USER=postgres
DB_PASSWORD=your_passwordDEFAULT_OCR_ENDPOINT=http://localhost:5000/ocr
ARABIC_OCR_ENDPOINT=http://localhost:5001/ocrEMBEDDING_SERVICE_ENDPOINT=http://localhost:8000/embed
EMBEDDING_MODEL_DEFAULT=sentence-transformers/all-MiniLM-L6-v2
EMBEDDING_MODEL_ARABIC=sentence-transformers/paraphrase-multilingual-mpnet-base-v2npm run devnpm run build
npm startnpm test # Run all tests
npm run test:watch # Run tests in watch modeCurrent test coverage: Configuration, validation, and hashing utilities (33 tests passing)
POST /api/v1/upload
Content-Type: multipart/form-data
{
"file": <binary>,
"dataset": "my-dataset",
"metadata": {}
}POST /api/v1/search
Content-Type: application/json
{
"text": "search query",
"datasets": ["dataset1"],
"limit": 10
}GET /api/v1/documents/:id/statusGET /api/v1/admin/datasetsRAG-Qoder/
├── src/
│ ├── index.ts # Application entry point
│ ├── config.ts # Configuration management
│ ├── types/ # TypeScript type definitions
│ │ └── index.ts
│ ├── interfaces/ # Service interfaces
│ │ └── services.ts
│ ├── database/ # Database layer
│ │ ├── schema.sql
│ │ ├── connection.ts
│ │ └── migrations/
│ │ └── run-migrations.ts
│ └── utils/ # Utility functions
│ ├── logger.ts
│ ├── hash.ts
│ └── validation.ts
├── logs/ # Application logs
├── uploads/ # Uploaded files
├── temp/ # Temporary processing files
├── storage/ # Blob storage
├── dist/ # Compiled JavaScript
├── package.json
├── tsconfig.json
└── .env.example
# To be implemented:
# - src/services/ # Service implementations
# - src/routes/ # API routes
# - src/workers/ # Background workers
# - tests/ # Test files
The project follows a structured task-based implementation plan outlined in tasks.md:
- ✅ Set up project structure and core interfaces
- ⏳ Implement document upload and metadata management
- ⏳ Create processing orchestration system
- ⏳ Develop language detection and OCR routing
- ⏳ Implement text chunking and segmentation
- ⏳ Develop embedding and vector search system
- ⏳ Create table extraction system
- ⏳ Develop search and retrieval system
- ⏳ Implement administrative interface
- ⏳ Add monitoring and observability
- ⏳ Testing and validation
Document Upload
↓
Validation & Duplicate Check
↓
Queue for Processing
↓
Preprocessing (PDF → Images)
↓
Language Detection
↓
OCR Routing (Arabic/Default)
↓
Text Normalization
↓
Table Extraction (Parallel)
↓
Text Chunking
↓
Embedding Generation
↓
Vector & Text Indexing
↓
Processing Complete
- Automatic detection of Arabic content
- Specialized Arabic OCR engine routing
- Optional diacritic removal
- Arabic-aware text segmentation
- Multilingual embedding models
- Default OCR engine for non-Arabic content
- Standard text processing pipeline
- Language-specific embedding models
The system implements robust error handling with:
- Automatic Retry: Transient errors trigger retry with exponential backoff
- Error Classification: Distinguishes between transient and permanent errors
- State Preservation: Processing state persisted for recovery
- Detailed Logging: Comprehensive error information for debugging
- Health check endpoint:
GET /health - Processing status tracking per document
- Job history and failure tracking
- Performance metrics collection
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
MIT
For issues and questions, please open an issue on GitHub.