This repository is a sandbox for users to experiment with the Hyperspace search engine. The repository includes multiple datasets and corresponding notebooks, desgined for classic, vector and hybrid search.
Hyperspace is a cloud-based hybrid search engine, powered by cloud FPGA hardware. Hyperspace sets new standards in query performance by allowing high-throughput searches with extremely low latency, typically measuring x10-x100 faster than industry benchmarks, and at reduced costs. Hyperspace allows vector search, similarity search, or a combination of the two. The Hyperspace engine query syntax is native Python with supported functionality for candidate generation and scoring for similarity and vector searches.
- Hybrid Search: HyperSearch engine combines vector and similarity search within a single workframe, providing the best of both worlds.
- Simplicity and Ease of Use: Hyperspace native Python syntax allows a seamless and natural migration of existing codebases.
- Unparalleled Latency: Hyperspace offers x100-x10 lower latency than industry benchmarks, allowing more complex logic in lower latency.
- Cost Efficiency: By leveraging Hyperspace, users can significantly reduce machine time requirements and associated costs.
- Advanced AI Possibilities: Hyperspace separates candidate generation from scoring, combined withe the extremely low latency, this allows use of complex AI techniques that are commonly impractical.
- Download and install the client API
- Create data config file
- Connect to a server
- Create collection
- Ingest data
- Run query
This repository includes various datasets and notebooks, aimed to demonstrate the use of Hyperspace Engine. Currently, the following datasets are included:
- arXiv Papers Dataset - The dataset is taken from kaggle and includes a list of academic papers from arXiv, and their metadata, and can be used for vector, classic or hybrid searches.
- Crimes In Chicago Dataset - taken from kaggle, this dataset includes metadata and can be used to demonstrate classic search.
- Stores Dataset - Randomly generated vectors of dimension 800, with corresopnding metadata that describes stores. The data can be used for vector, classic or hybrid search.
- Movies Dataset - The data is taken from MovieLens Latest Datasets. The data includes 40954 valid movies. The data is in SQL format (table) and will be converted to NoSQL (documents) format. The data preprocessing is given in the notebook titles "MovieRecommendationDataPrep", available in this repository. The data can be used for vector, classic or hybrid search.
We have added two example datasets. The data for these usecases can be found here To Run the code you should add data folder to each code example with the data from the link above.
- Advec - Dataset of applications.
- Image-search - Dataset of amazon items.