A comprehensive AI-powered application that combines document understanding with web search capabilities. Built with modern microservices architecture, this system can answer questions from uploaded documents or search the web when no relevant documents are available.
- LlamaIndex Integration: Advanced document processing and indexing
- BGE Embeddings: High-quality text embeddings using BAAI/bge-large-en-v1.5
- Milvus Vector Database: Efficient vector storage and similarity search
- Binary Quantization: 32x storage reduction with Hamming distance search
- Multiple Formats: Support for PDF, DOC, DOCX, TXT, and Markdown files
- Bing Search API: Professional web search integration
- MCP Protocol: Model Context Protocol for AI assistant compatibility
- Intelligent Routing: Automatically switches between document and web search
- Real-time Results: Fast and accurate web search results
- VLLM Backend: High-performance inference engine
- Qwen-3 30B Model: State-of-the-art multilingual language model
- GPU Acceleration: Optimized for NVIDIA GPUs
- Configurable Parameters: Adjustable temperature, max tokens, etc.
- React 18: Modern UI with hooks and functional components
- Material-UI: Professional design system
- Real-time Updates: Live status and progress indicators
- Responsive Design: Works on desktop and mobile devices
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Frontend │────│ API Gateway │────│ Document Service│
│ (React) │ │ (FastAPI) │ │ (FastAPI) │
│ Port: 3000 │ │ Port: 8000 │ │ Port: 8001 │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
│ ▼
┌─────────────────┐ ┌─────────────────┐
│ Web Search │ │ Milvus │
│ Service │ │ Vector Database│
│ Port: 8003 │ │ Port: 19530 │
└─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Bing MCP │ │ etcd + minio │
│ Server │ │ (Dependencies)│
│ Port: 8080 │ └─────────────────┘
└─────────────────┘
│
▼
┌─────────────────┐
│ LLM Service │
│ (VLLM) │
│ Port: 8002 │
└─────────────────┘
- Docker & Docker Compose: Latest versions
- NVIDIA GPU: For LLM inference (optional, can run on CPU)
- Python 3.10+: For validation scripts
- API Keys: Bing Search API and HuggingFace token
# Clone the repository
git clone [email protected]:Srjnnnn/doc-search-app.git
cd doc-search-app
# Copy environment template
cp .env.example .env
# Edit with your API keys
nano .env
Required Environment Variables:
# API Keys (Required)
BING_API_KEY=your_bing_search_api_key_here
HUGGINGFACE_TOKEN=hf_your_huggingface_token_here
# GPU Configuration (Optional)
CUDA_VISIBLE_DEVICES=0
GPU_MEMORY_UTILIZATION=0.8
TENSOR_PARALLEL_SIZE=1
- Visit Azure Portal
- Create or sign in to your Azure account
- Create a new "Bing Search" resource
- Navigate to "Keys and Endpoint" section
- Copy your API key
- Visit HuggingFace
- Sign up/login and go to Settings → Access Tokens
- Create a new token with read permissions
- Copy the token (starts with
hf_
)
# Validate environment configuration
python scripts/validate-env.py
# Start all services
make start
# or
docker compose up --build
- Frontend: http://localhost:3000
- API Gateway: http://localhost:8000
- Health Check: http://localhost:8000/health
- Navigate to "Upload Documents" tab
- Drag and drop files or click to select (PDF, DOC, DOCX, TXT, MD)
- Click "Upload & Process" to index documents
- Wait for processing - documents are chunked and embedded
- Go to "Ask Questions" tab
- Type your question in the text area
- Configure options:
- ✅ Search uploaded documents
- ✅ Search the web (fallback)
- 🌡️ Temperature (creativity level)
- 📏 Max tokens (response length)
- Click "Ask Question" and wait for response
- Answer: AI-generated response based on context
- Sources: Relevant document chunks or web results
- Confidence: System confidence in the answer
- Method: Whether answer came from documents or web search
Variable | Description | Default | Required |
---|---|---|---|
BING_API_KEY |
Bing Search API key | - | ✅ |
HUGGINGFACE_TOKEN |
HuggingFace access token | - | ✅ |
CUDA_VISIBLE_DEVICES |
GPU devices to use | 0 |
❌ |
GPU_MEMORY_UTILIZATION |
VRAM usage ratio | 0.8 |
❌ |
TENSOR_PARALLEL_SIZE |
Multi-GPU parallelism | 1 |
❌ |
MAX_MODEL_LEN |
Context window size | 4096 |
❌ |
EMBEDDING_MODEL_NAME |
Embedding model | BAAI/bge-large-en-v1.5 |
❌ |
LLM_MODEL_NAME |
Language model | Qwen/Qwen3-30B-A3B-Instruct-2507 |
❌ |
CHUNK_SIZE |
Document chunk size | 1000 |
❌ |
DEFAULT_TOP_K |
Search result count | 5 |
❌ |
GPU_MEMORY_UTILIZATION=0.9
TENSOR_PARALLEL_SIZE=1
MAX_MODEL_LEN=8192
BATCH_SIZE=64
CUDA_VISIBLE_DEVICES=0,1
TENSOR_PARALLEL_SIZE=2
GPU_MEMORY_UTILIZATION=0.8
# Remove GPU requirements from docker-compose.yml
# Use smaller models or disable LLM service
make help # Show all available commands
make setup # Initial project setup
make validate # Validate environment
make start # Start all services
make start-detached # Start in background
make stop # Stop all services
make clean # Clean up containers and volumes
make logs # Show logs from all services
make build # Build all services
make test # Run health checks
make restart # Restart all services
# Start with development overrides
docker compose -f docker-compose.yml -f docker-compose.dev.yml up
# Test document service
curl -f http://localhost:8001/health
# Test LLM service
curl -f http://localhost:8002/health
# Test web search service
curl -f http://localhost:8003/health
# Test Bing MCP server
curl -f http://localhost:8080/health
# Follow logs for specific service
docker compose logs -f document-service
# Check service status
docker compose ps
# Inspect service
docker compose exec document-service bash
- Configure deployment environment:
cp .env.production .env
# Edit with production values
- Deploy to Beam Cloud:
cd beam-deploy
export BEAM_REGISTRY_URL=your-registry.com
python deploy.py
- Build and push images:
# Tag images for your registry
docker compose build
docker tag doc-search-app_frontend your-registry/frontend:latest
# ... tag other services
# Push to registry
docker push your-registry/frontend:latest
# ... push other services
- Deploy with Kubernetes:
kubectl apply -f beam-deploy/beam.yaml
# Check Docker daemon
docker --version
docker compose version
# Validate environment
python scripts/validate-env.py
# Check logs
docker compose logs
# Check NVIDIA Docker runtime
docker run --rm --gpus all nvidia/cuda:11.8-runtime-ubuntu20.04 nvidia-smi
# Update docker-compose.yml if needed
# Reduce GPU memory utilization
GPU_MEMORY_UTILIZATION=0.6
# Use smaller model
LLM_MODEL_NAME=Qwen/Qwen2-7B-Instruct
# Reduce context window
MAX_MODEL_LEN=2048
# Enable GPU if available
CUDA_VISIBLE_DEVICES=0
# Increase batch sizes
BATCH_SIZE=64
# Use multiple GPUs
TENSOR_PARALLEL_SIZE=2
- Milvus connection issues: Check if Milvus is healthy
- Embedding model download: Ensure HuggingFace token is valid
- File upload failures: Check file format and size limits
- Model loading failures: Verify HuggingFace token and model name
- CUDA errors: Check GPU availability and memory
- Slow inference: Adjust batch size and memory utilization
- API key errors: Verify Bing API key is valid and has quota
- Connection timeouts: Check internet connectivity
- Rate limiting: Implement backoff strategies
# Overall system health
curl http://localhost:8000/health
# Individual service health
curl http://localhost:8001/health # Document service
curl http://localhost:8002/health # LLM service
curl http://localhost:8003/health # Web search service
curl http://localhost:8080/health # Bing MCP server
- Response times: Monitor API response latencies
- GPU utilization: Track VRAM and compute usage
- Search accuracy: Monitor confidence scores
- Error rates: Track failed requests and retries
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
- Python: Follow PEP 8, use Black formatter
- JavaScript: Follow ESLint rules, use Prettier
- Docker: Use multi-stage builds, minimize layers
- Documentation: Update README for any new features
This project is licensed under the MIT License - see the LICENSE file for details.
- LlamaIndex: Document processing and indexing
- Milvus: Vector database technology
- VLLM: High-performance LLM inference
- Qwen: Advanced language model from Alibaba
- BGE: Embedding model from BAAI
- Bing Search MCP: Based on leehanchung/bing-search-mcp
- Issues: GitHub Issues
Made with ❤️ for the AI community