RAG-Qoder: Document Ingestion and Retrieval System

A comprehensive document ingestion and retrieval system that processes various document formats (PDFs, images, office documents) through multiple stages including OCR, text extraction, chunking, embedding generation, and indexing. The system supports multi-language processing with specialized Arabic language handling.

Features

Multi-format Document Processing: Support for PDFs, images, and office documents
Intelligent OCR Routing: Automatic language detection with specialized Arabic OCR processing
Multi-stage Processing Pipeline: Preprocessing → OCR → Table Extraction → Chunking → Embedding → Indexing
Duplicate Detection: Hash-based duplicate prevention
Hybrid Search: Combined text and vector search capabilities
Administrative Interface: Dataset management, monitoring, and document reprocessing
Robust Error Handling: Automatic retry logic with exponential backoff
Scalable Architecture: Microservices-based design with message queues

Architecture

The system follows a microservices architecture with the following components:

Core Services

Upload Service: Handles document uploads, validation, and duplicate detection
Processing Orchestrator: Coordinates multi-stage document processing
Search Service: Provides hybrid search capabilities
Admin Service: Manages datasets and system monitoring

Processing Engines

Language Detector: Identifies document language (Arabic, English, mixed)
OCR Router: Routes to appropriate OCR engine based on language
Text Normalizer: Unicode normalization and diacritic handling
Chunking Service: Segments text for retrieval optimization
Embedding Service: Generates semantic embeddings
Table Extraction: Detects and extracts table structures

Storage Layer

PostgreSQL: Document metadata, processing status, chunks, tables
Vector Database: Embeddings for semantic search (Qdrant/Weaviate)
Blob Storage: Original files and processing artifacts
Search Index: Full-text search (Elasticsearch)

Prerequisites

Node.js >= 18.x
PostgreSQL >= 14.x
Redis >= 6.x
Docker (optional, for containerized services)

Installation

Clone the repository
```
git clone <repository-url>
cd RAG-Qoder
```
Install dependencies
```
npm install
```

Configure environment

cp .env.example .env
# Edit .env with your configuration

Set up database

# Create PostgreSQL database
createdb rag_qoder

# Run migrations
psql -d rag_qoder -f src/database/schema.sql

Start Redis
```
redis-server
```

Configuration

Edit .env file with your settings:

Database Configuration

DB_HOST=localhost
DB_PORT=5432
DB_NAME=rag_qoder
DB_USER=postgres
DB_PASSWORD=your_password

OCR Engines

DEFAULT_OCR_ENDPOINT=http://localhost:5000/ocr
ARABIC_OCR_ENDPOINT=http://localhost:5001/ocr

Embedding Service

EMBEDDING_SERVICE_ENDPOINT=http://localhost:8000/embed
EMBEDDING_MODEL_DEFAULT=sentence-transformers/all-MiniLM-L6-v2
EMBEDDING_MODEL_ARABIC=sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Usage

Development Mode

npm run dev

Production Mode

npm run build
npm start

Running Tests

npm test              # Run all tests
npm run test:watch   # Run tests in watch mode

Current test coverage: Configuration, validation, and hashing utilities (33 tests passing)

API Endpoints

Upload Document

POST /api/v1/upload
Content-Type: multipart/form-data

{
  "file": <binary>,
  "dataset": "my-dataset",
  "metadata": {}
}

Search Documents

POST /api/v1/search
Content-Type: application/json

{
  "text": "search query",
  "datasets": ["dataset1"],
  "limit": 10
}

Get Document Status

GET /api/v1/documents/:id/status

List Datasets

GET /api/v1/admin/datasets

Project Structure

RAG-Qoder/
├── src/
│   ├── index.ts                  # Application entry point
│   ├── config.ts                 # Configuration management
│   ├── types/                    # TypeScript type definitions
│   │   └── index.ts
│   ├── interfaces/               # Service interfaces
│   │   └── services.ts
│   ├── database/                 # Database layer
│   │   ├── schema.sql
│   │   ├── connection.ts
│   │   └── migrations/
│   │       └── run-migrations.ts
│   └── utils/                    # Utility functions
│       ├── logger.ts
│       ├── hash.ts
│       └── validation.ts
├── logs/                         # Application logs
├── uploads/                      # Uploaded files
├── temp/                         # Temporary processing files
├── storage/                      # Blob storage
├── dist/                         # Compiled JavaScript
├── package.json
├── tsconfig.json
└── .env.example

# To be implemented:
# - src/services/         # Service implementations
# - src/routes/           # API routes
# - src/workers/          # Background workers
# - tests/                # Test files

Development Workflow

Task Implementation

The project follows a structured task-based implementation plan outlined in tasks.md:

✅ Set up project structure and core interfaces
⏳ Implement document upload and metadata management
⏳ Create processing orchestration system
⏳ Develop language detection and OCR routing
⏳ Implement text chunking and segmentation
⏳ Develop embedding and vector search system
⏳ Create table extraction system
⏳ Develop search and retrieval system
⏳ Implement administrative interface
⏳ Add monitoring and observability
⏳ Testing and validation

Processing Pipeline

Document Upload
    ↓
Validation & Duplicate Check
    ↓
Queue for Processing
    ↓
Preprocessing (PDF → Images)
    ↓
Language Detection
    ↓
OCR Routing (Arabic/Default)
    ↓
Text Normalization
    ↓
Table Extraction (Parallel)
    ↓
Text Chunking
    ↓
Embedding Generation
    ↓
Vector & Text Indexing
    ↓
Processing Complete

Language Support

Arabic Text Processing

Automatic detection of Arabic content
Specialized Arabic OCR engine routing
Optional diacritic removal
Arabic-aware text segmentation
Multilingual embedding models

English and Other Languages

Default OCR engine for non-Arabic content
Standard text processing pipeline
Language-specific embedding models

Error Handling

The system implements robust error handling with:

Automatic Retry: Transient errors trigger retry with exponential backoff
Error Classification: Distinguishes between transient and permanent errors
State Preservation: Processing state persisted for recovery
Detailed Logging: Comprehensive error information for debugging

Monitoring

Health check endpoint: GET /health
Processing status tracking per document
Job history and failure tracking
Performance metrics collection

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

License

MIT

Support

For issues and questions, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docker		docker
public		public
sampleـfiles		sampleـfiles
scripts		scripts
src		src
.env.chandra.example		.env.chandra.example
.env.example		.env.example
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
CHANDRA_OCR_INTEGRATION_PLAN.md		CHANDRA_OCR_INTEGRATION_PLAN.md
CHANDRA_OCR_QUICK_REFERENCE.md		CHANDRA_OCR_QUICK_REFERENCE.md
DEPLOYMENT_GUIDE.md		DEPLOYMENT_GUIDE.md
FIXES_APPLIED.md		FIXES_APPLIED.md
OPERATIONS_RUNBOOK.md		OPERATIONS_RUNBOOK.md
PHASE_1_COMPLETE.md		PHASE_1_COMPLETE.md
PHASE_1_EVALUATION_REPORT.md		PHASE_1_EVALUATION_REPORT.md
PHASE_2_COMPLETE.md		PHASE_2_COMPLETE.md
PHASE_3_COMPLETE.md		PHASE_3_COMPLETE.md
PHASE_4_COMPLETE.md		PHASE_4_COMPLETE.md
PHASE_5_COMPLETE.md		PHASE_5_COMPLETE.md
PHASE_6_COMPLETE.md		PHASE_6_COMPLETE.md
PHASE_7_COMPLETE.md		PHASE_7_COMPLETE.md
PIPELINE_STATUS_REPORT.txt		PIPELINE_STATUS_REPORT.txt
PROJECT_COMPLETE.md		PROJECT_COMPLETE.md
PROJECT_COMPLETION_SUMMARY.md		PROJECT_COMPLETION_SUMMARY.md
README.md		README.md
Requirement.md		Requirement.md
SEARCH_TEST_SUMMARY.txt		SEARCH_TEST_SUMMARY.txt
TASK_10_FIXES.md		TASK_10_FIXES.md
TASK_10_QUICK_REFERENCE.md		TASK_10_QUICK_REFERENCE.md
TASK_10_SUMMARY.md		TASK_10_SUMMARY.md
TASK_11_QUICK_REFERENCE.md		TASK_11_QUICK_REFERENCE.md
TASK_11_SUMMARY.md		TASK_11_SUMMARY.md
TASK_1_SUMMARY.md		TASK_1_SUMMARY.md
TASK_2_FIXES.md		TASK_2_FIXES.md
TASK_2_SUMMARY.md		TASK_2_SUMMARY.md
TASK_3_FIXES.md		TASK_3_FIXES.md
TASK_3_FIXES_FINAL.md		TASK_3_FIXES_FINAL.md
TASK_3_SUMMARY.md		TASK_3_SUMMARY.md
TASK_4_FIXES.md		TASK_4_FIXES.md
TASK_4_SUMMARY.md		TASK_4_SUMMARY.md
TASK_5_FIXES.md		TASK_5_FIXES.md
TASK_5_QUICK_REFERENCE.md		TASK_5_QUICK_REFERENCE.md
TASK_5_SUMMARY.md		TASK_5_SUMMARY.md
TASK_6_FIXES.md		TASK_6_FIXES.md
TASK_6_SUMMARY.md		TASK_6_SUMMARY.md
TASK_7_FIXES.md		TASK_7_FIXES.md
TASK_7_SUMMARY.md		TASK_7_SUMMARY.md
TASK_8_FIXES.md		TASK_8_FIXES.md
TASK_8_STATUS.md		TASK_8_STATUS.md
TASK_8_SUMMARY.md		TASK_8_SUMMARY.md
TASK_9_FIXES.md		TASK_9_FIXES.md
TASK_9_QUICK_REFERENCE.md		TASK_9_QUICK_REFERENCE.md
TASK_9_STATUS.md		TASK_9_STATUS.md
TASK_9_SUMMARY.md		TASK_9_SUMMARY.md
TEST_REPORT_SUMMARY.txt		TEST_REPORT_SUMMARY.txt
design.md		design.md
docker-compose.chandra.yml		docker-compose.chandra.yml
docker-compose.embedding.yml		docker-compose.embedding.yml
docker-compose.qdrant.yml		docker-compose.qdrant.yml
jest.config.js		jest.config.js
package.json		package.json
pipeline-test-results.json		pipeline-test-results.json
requirements-chandra.txt		requirements-chandra.txt
requirements-embedding.txt		requirements-embedding.txt
search-test-report.html		search-test-report.html
tasks.md		tasks.md
test-report.html		test-report.html
test-results-sample-files.json		test-results-sample-files.json
test-results-search.json		test-results-search.json
test-upload.sh		test-upload.sh
test_e2e.py		test_e2e.py
tsconfig.json		tsconfig.json
verification-results.txt		verification-results.txt

abusaleh34/smart-RAG

Folders and files

Latest commit

History

Repository files navigation