Skip to content

Semantic video search system that indexes audio and visual content to enable timestamp-accurate retrieval inside videos using natural language queries.

License

Notifications You must be signed in to change notification settings

sourav4243/sift-video

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SIFT-Video

Semantic Indexing for Timelines in Video

SIFT-Video is a multimodal semantic video search engine that enables natural-language search inside videos and returns the most relevant timestamps based on audio and visual content.

The system processes videos offline, converts audio and visual information into semantic embeddings using pre-trained inference models, and indexes them in a vector database for similarity search at query time.

Quick Start

WIP

Clone the repository and run the setup script:

git clone https://github.com/sourav4243/sift-video.git
cd sift-video

chmod +x setup.sh
./setup.sh

Installation (Manual Setup)

Prerequisites

  • docker and docker-compose
  • git
  • curl

1. Clone the repository and setup environment

Use --recurse-submodules for external dependencies (ingestion service depends on whisper.cpp)

git clone --recurse-submodules https://github.com/sourav4243/sift-video.git
cd sift-video
mkdir -p videos output

Note: If you cloned without submodules, you can fix it by running:

git submodule update --init --recursive

2. Download Whisper model

Download the model for the ingestion service to work (ggml-small.en.bin):

mkdir -p services/ingestion/external/whisper/models

curl -L -o services/ingestion/external/whisper/models/ggml-small.en.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.en.bin

Configuration

The following environment variables can be changed in docker-compose.yml

Variable Default Description
QDRANT_URL http://localhost:6334 URL of the Qdrant gRPC interface. Use http://qdrant:6334 inside Docker.
RUST_LOG info Logging level.

Note: You can access Qdrant web dashboard at http://localhost:6333/dashboard

Usage

  1. Prepare your videos: Place the video files you want to index into the videos/ directory at project root.

  2. Start the services: Run the following command to build and start the indexing pipeline and search API:

docker-compose up --build

This spins up three containers:

  • sift_qdrant: The vector database (Ports: 6333, 6334).
  • sift_ingestion: Processes videos from the videos/ folder and saves transcripts to output/.
  • sift_query_engine: The search API (Port: 8080).
  1. Search via API: Once the system is running, you can search your indexed videos using HTTP API:
curl -X POST http://localhost:8080/search \
    -H "Content-Type: application/json" \
    -d '{"query": "what is the meaning of life"}'

Note: A dedicated CLI tool for easier searching is planned.

Planned Features

  • Offline video ingestion and indexing pipeline
  • Audio extraction from video files
  • Speech-to-text transcription with timestamps
  • Periodic video frame extraction
  • Text and image embedding generation
  • Multimodal semantic search (audio + visual)
  • Timestamp resolution and retrieval
  • Fully containerized services
  • CLI tool for natural language search

Technology Stack

Languages

  • C++ - multimedia processing and model inference
  • Rust - query engine, API layer, and vector database interaction

Models

  • Whisper (whisper.cpp) - speech-to-text with timestamps
  • CLIP (via ONNX Runtime) - text and image embeddings

Infrastructure & Tooling

  • FFmpeg
  • Vector database (Qdrant)
  • Docker

Contributing

Suggestions, fixes and improvements are welcome. Feel free to open an issue or a PR.

License

This project is licensed under GNU GPLv3

About

Semantic video search system that indexes audio and visual content to enable timestamp-accurate retrieval inside videos using natural language queries.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •