SIFT-Video is a multimodal semantic video search engine that enables natural-language search inside videos and returns the most relevant timestamps based on audio and visual content.
The system processes videos offline, converts audio and visual information into semantic embeddings using pre-trained inference models, and indexes them in a vector database for similarity search at query time.
WIP
Clone the repository and run the setup script:
git clone https://github.com/sourav4243/sift-video.git
cd sift-video
chmod +x setup.sh
./setup.shPrerequisites
- docker and docker-compose
- git
- curl
1. Clone the repository and setup environment
Use --recurse-submodules for external dependencies (ingestion service depends on whisper.cpp)
git clone --recurse-submodules https://github.com/sourav4243/sift-video.git
cd sift-video
mkdir -p videos outputNote: If you cloned without submodules, you can fix it by running:
git submodule update --init --recursive2. Download Whisper model
Download the model for the ingestion service to work (ggml-small.en.bin):
mkdir -p services/ingestion/external/whisper/models
curl -L -o services/ingestion/external/whisper/models/ggml-small.en.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.en.binThe following environment variables can be changed in
docker-compose.yml
| Variable | Default | Description |
|---|---|---|
QDRANT_URL |
http://localhost:6334 |
URL of the Qdrant gRPC interface. Use http://qdrant:6334 inside Docker. |
RUST_LOG |
info |
Logging level. |
Note: You can access Qdrant web dashboard at http://localhost:6333/dashboard
-
Prepare your videos: Place the video files you want to index into the videos/ directory at project root.
-
Start the services: Run the following command to build and start the indexing pipeline and search API:
docker-compose up --buildThis spins up three containers:
sift_qdrant: The vector database (Ports: 6333, 6334).sift_ingestion: Processes videos from thevideos/folder and saves transcripts tooutput/.sift_query_engine: The search API (Port: 8080).
- Search via API: Once the system is running, you can search your indexed videos using HTTP API:
curl -X POST http://localhost:8080/search \
-H "Content-Type: application/json" \
-d '{"query": "what is the meaning of life"}'Note: A dedicated CLI tool for easier searching is planned.
- Offline video ingestion and indexing pipeline
- Audio extraction from video files
- Speech-to-text transcription with timestamps
- Periodic video frame extraction
- Text and image embedding generation
- Multimodal semantic search (audio + visual)
- Timestamp resolution and retrieval
- Fully containerized services
- CLI tool for natural language search
- C++ - multimedia processing and model inference
- Rust - query engine, API layer, and vector database interaction
- Whisper (whisper.cpp) - speech-to-text with timestamps
- CLIP (via ONNX Runtime) - text and image embeddings
- FFmpeg
- Vector database (Qdrant)
- Docker
Suggestions, fixes and improvements are welcome. Feel free to open an issue or a PR.
This project is licensed under GNU GPLv3