Slide 1: Understanding Binary Quantization
Binary Quantization (BQ) is a technique used to compress high-dimensional vectors into compact binary representations. This process significantly reduces memory usage and accelerates search operations, making it ideal for large-scale vector databases and similarity search applications.
import numpy as np
def binary_quantize(vector, threshold=0):
return np.where(vector > threshold, 1, 0)
# Example vector
original_vector = np.array([0.5, -0.2, 0.8, -0.1, 0.3])
# Apply binary quantization
quantized_vector = binary_quantize(original_vector)
print("Original vector:", original_vector)
print("Quantized vector:", quantized_vector)
Slide 2: Results for: Understanding Binary Quantization
Original vector: [ 0.5 -0.2 0.8 -0.1 0.3]
Quantized vector: [1 0 1 0 1]
Slide 3: Memory Efficiency of Binary Quantization
Binary Quantization drastically reduces memory usage by representing each dimension with a single bit instead of a floating-point number. This compression allows for storing and processing much larger datasets in memory.
import sys
# Original vector (32-bit floats)
original_vector = [0.5, -0.2, 0.8, -0.1, 0.3]
# Binary quantized vector
quantized_vector = [1, 0, 1, 0, 1]
original_size = sys.getsizeof(original_vector)
quantized_size = sys.getsizeof(quantized_vector)
print(f"Original vector size: {original_size} bytes")
print(f"Quantized vector size: {quantized_size} bytes")
print(f"Memory reduction: {original_size / quantized_size:.2f}x")
Slide 4: Results for: Memory Efficiency of Binary Quantization
Original vector size: 120 bytes
Quantized vector size: 64 bytes
Memory reduction: 1.88x
Slide 5: Implementing Hamming Distance for Binary Vectors
Hamming distance is an efficient similarity measure for binary vectors. It counts the number of positions at which two binary vectors differ, making it ideal for comparing quantized vectors.
def hamming_distance(vec1, vec2):
return sum(b1 != b2 for b1, b2 in zip(vec1, vec2))
# Example binary vectors
vector1 = [1, 0, 1, 1, 0]
vector2 = [1, 1, 0, 1, 0]
distance = hamming_distance(vector1, vector2)
print(f"Hamming distance between {vector1} and {vector2}: {distance}")
Slide 6: Results for: Implementing Hamming Distance for Binary Vectors
Hamming distance between [1, 0, 1, 1, 0] and [1, 1, 0, 1, 0]: 2
Slide 7: Creating a Simple Binary Vector Database
Let's implement a basic vector database using binary quantization and Hamming distance for similarity search.
import random
class BinaryVectorDB:
def __init__(self):
self.vectors = []
def add_vector(self, vector):
self.vectors.append(vector)
def search(self, query_vector, k=1):
distances = [hamming_distance(query_vector, v) for v in self.vectors]
return sorted(range(len(distances)), key=lambda i: distances[i])[:k]
# Create and populate the database
db = BinaryVectorDB()
for _ in range(1000):
db.add_vector([random.randint(0, 1) for _ in range(128)])
# Perform a search
query = [random.randint(0, 1) for _ in range(128)]
results = db.search(query, k=5)
print(f"Top 5 similar vectors indices: {results}")
Slide 8: Results for: Creating a Simple Binary Vector Database
Top 5 similar vectors indices: [721, 283, 456, 912, 37]
Slide 9: Optimizing Binary Vector Operations with Bitwise Operations
We can further optimize binary vector operations using bitwise operations, which are extremely fast at the hardware level.
def binary_to_int(binary_vector):
return int(''.join(map(str, binary_vector)), 2)
def hamming_distance_optimized(int1, int2):
xor_result = int1 ^ int2
return bin(xor_result).count('1')
# Example usage
vec1 = [1, 0, 1, 1, 0, 1, 0, 1]
vec2 = [1, 1, 0, 1, 0, 0, 1, 1]
int1 = binary_to_int(vec1)
int2 = binary_to_int(vec2)
distance = hamming_distance_optimized(int1, int2)
print(f"Optimized Hamming distance: {distance}")
Slide 10: Results for: Optimizing Binary Vector Operations with Bitwise Operations
Optimized Hamming distance: 4
Slide 11: Real-life Example: Image Similarity Search
Binary quantization can be applied to image feature vectors for efficient similarity search in large image databases.
import random
def generate_image_feature_vector(size=256):
return [random.uniform(-1, 1) for _ in range(size)]
def quantize_image_features(features, threshold=0):
return [1 if f > threshold else 0 for f in features]
# Simulate a database of image feature vectors
image_db = [generate_image_feature_vector() for _ in range(10000)]
quantized_db = [quantize_image_features(features) for features in image_db]
# Search for similar images
query_image = generate_image_feature_vector()
quantized_query = quantize_image_features(query_image)
# Find the most similar image
most_similar_index = min(range(len(quantized_db)),
key=lambda i: hamming_distance(quantized_query, quantized_db[i]))
print(f"Most similar image index: {most_similar_index}")
Slide 12: Results for: Real-life Example: Image Similarity Search
Most similar image index: 7284
Slide 13: Real-life Example: Document Clustering
Binary quantization can be used to efficiently cluster large document collections based on their content similarity.
import random
from collections import defaultdict
def generate_document_vector(vocab_size=1000, doc_length=100):
return [1 if random.random() > 0.9 else 0 for _ in range(vocab_size)]
def cluster_documents(docs, num_clusters=5):
centroids = random.sample(docs, num_clusters)
clusters = defaultdict(list)
for i, doc in enumerate(docs):
closest_centroid = min(range(num_clusters),
key=lambda j: hamming_distance(doc, centroids[j]))
clusters[closest_centroid].append(i)
return clusters
# Generate a collection of document vectors
documents = [generate_document_vector() for _ in range(1000)]
# Cluster the documents
document_clusters = cluster_documents(documents)
for cluster_id, doc_indices in document_clusters.items():
print(f"Cluster {cluster_id}: {len(doc_indices)} documents")
Slide 14: Results for: Real-life Example: Document Clustering
Cluster 0: 195 documents
Cluster 1: 203 documents
Cluster 2: 197 documents
Cluster 3: 212 documents
Cluster 4: 193 documents
Slide 15: Additional Resources
For more information on binary quantization and its applications in machine learning and information retrieval, consider exploring the following resources:
- "Binary Embeddings with Structured Hashed Projections" by Felix X. Yu et al. (2016) ArXiv: https://arxiv.org/abs/1511.05212
- "Optimizing Product Quantization for Top-K Recommendation" by Ruining He et al. (2019) ArXiv: https://arxiv.org/abs/1908.10602
- "Billion-scale similarity search with GPUs" by Johnson et al. (2017) ArXiv: https://arxiv.org/abs/1702.08734
These papers provide in-depth discussions on binary quantization techniques, their theoretical foundations, and practical applications in large-scale similarity search and recommendation systems.