Assignment 08: Building a Semantic Search Engine with Text Embeddings for Clothing Products

Objective

This assignment demonstrates how to:

Generate text embeddings using Azure OpenAI's "text-embedding-3-small" model
Build a semantic search engine that finds the most similar clothes based on product descriptions
Use cosine distance to measure similarity between embeddings
Practice handling embeddings and performing vector similarity search in Python

Problem Statement

Online stores require effective search and recommendation systems to help customers find relevant products. Traditional keyword matching often fails to capture semantic similarities. This project builds a semantic search engine by embedding product descriptions and user queries as vectors and finding the closest matches.

How It Works

1. Text Embeddings Creation

What are Text Embeddings? Text embeddings are dense vector representations of text that capture semantic meaning. Unlike traditional keyword-based search, embeddings understand that "warm cotton sweatshirt" and "cozy cotton hoodie" are semantically similar, even though they don't share exact words.

Implementation Process:

Each product description is converted into a high-dimensional vector (embedding) using Azure OpenAI's text-embedding-3-small model
The model transforms text into numerical vectors where semantically similar texts have similar vector representations
These embeddings capture contextual relationships between words and phrases

Code Implementation:

def get_embedding(text):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

2. Cosine Similarity for Ranking

What is Cosine Similarity? Cosine similarity measures the cosine of the angle between two vectors in high-dimensional space. It's particularly effective for text embeddings because:

It ranges from -1 to 1 (or 0 to 1 for positive vectors)
It measures orientation rather than magnitude
Similar texts will have vectors pointing in similar directions

Why Cosine Distance? We use 1 - cosine_distance to convert distance into similarity:

Cosine distance: 0 = identical, 1 = completely different
Cosine similarity: 1 = identical, 0 = completely different

Mathematical Formula:

cosine_similarity = (A · B) / (||A|| * ||B||)

Where A and B are the embedding vectors.

Implementation:

def similarity_score(vec1, vec2):
    return 1 - cosine(vec1, vec2)  # Convert distance to similarity

3. Search Process

Query Processing: User query is converted to embedding
Similarity Calculation: Cosine similarity computed between query embedding and all product embeddings
Ranking: Products sorted by similarity score (highest first)
Results: Top N most similar products returned

Expected Outcome

The Python script (Assignment_08.py) provides:

Creates embeddings for all products' descriptions using Azure OpenAI's "text-embedding-3-small" model
Accepts user input queries (product description or search text) - implemented as dummy input
Generates embeddings for input queries
Computes cosine distances between query embedding and product embeddings
Returns and prints the top N most similar products ranked by similarity

Concepts Covered

Using Azure OpenAI's text embedding models for semantic understanding
Computing cosine similarity to find nearest neighbors in vector space
Basic data structuring and search logic for recommendation engines
Integration with Azure OpenAI API via official Python client
Error handling and retry mechanisms for API reliability

Dataset

The script includes a curated dataset of 12 clothing products with diverse categories:

Category	Products	Example
Jeans	Classic Blue Jeans	"Comfortable blue denim jeans with a relaxed fit"
Hoodies/Sweatshirts	Red Hoodie, Pink Cotton Sweatshirt	"Cozy red hoodie made from organic cotton"
Jackets	Black Leather Jacket, Denim Jacket	"Stylish black leather jacket with a slim fit design"
T-Shirts/Shirts	White Cotton T-Shirt, Striped Long Sleeve Shirt	"Soft white cotton t-shirt perfect for everyday wear"
Shoes/Boots	Black Running Shoes, Brown Leather Boots	"Lightweight black running shoes with excellent cushioning"
Others	Navy Blue Chinos, Gray Wool Sweater, Green Cargo Shorts	Various descriptions focusing on materials and fit

Input Data

Sample Product Dataset

The script includes embedded sample data as required:

products = [
    {
        "title": "Classic Blue Jeans",
        "short_description": "Comfortable blue denim jeans with a relaxed fit.",
        "price": 49.99,
        "category": "Jeans"
    },
    {
        "title": "Red Hoodie", 
        "short_description": "Cozy red hoodie made from organic cotton.",
        "price": 39.99,
        "category": "Hoodies"
    },
    # ... 12 total products
]

Dummy Input Queries (Auto Input)

As per requirements, the script uses automatic input instead of manual input():

"warm cotton sweatshirt"
"comfortable running shoes"
"leather jacket for winter"
"casual denim pants"
"cozy wool sweater"

Setup and Installation

Prerequisites

Install required Python packages:

pip install openai scipy

Azure OpenAI Configuration

Get credentials for text-embedding-3-small from STU and configure:

os.environ["AZURE_OPENAI_ENDPOINT"] = ""
os.environ["AZURE_OPENAI_API_KEY"] = ""
os.environ["AZURE_DEPLOYMENT_NAME"] = "text-embedding-3-small"

Running the Assignment

python Assignment_08.py

Important:

Code is contained in single .py file as required
Uses dummy input list (auto input), no manual input() built-in
Sample product data is embedded in the script

Challenges and Limitations Encountered

1. API Rate Limits and Reliability

Challenge: Azure OpenAI API calls can fail due to rate limits or network issues.

Solution: Implemented retry mechanism with exponential backoff:

def get_embedding(text, max_retries=3, delay=1):
    for attempt in range(max_retries):
        try:
            # API call
            return embedding
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(delay)

2. Embedding Quality and Semantic Understanding

Challenge: Text embeddings may not capture all nuanced relationships between clothing items.

Observations:

Works well for material-based searches ("cotton", "leather", "wool")
Effective for functional descriptions ("comfortable", "warm", "stylish")
May struggle with very specific style preferences or brand-related queries

3. Dataset Size and Diversity

Limitation: Small dataset (12 products) limits the complexity of search scenarios.

Impact:

Good for demonstration purposes
Real-world applications would require larger, more diverse datasets
Current dataset covers basic categories but lacks style variations

4. Cosine Similarity Threshold

Challenge: No clear threshold for determining "good" vs "poor" matches.

Consideration:

Similarity scores are relative within the dataset
Larger datasets would provide better score distributions
Could implement dynamic thresholding based on score distribution

5. Cold Start Problem

Limitation: System requires pre-computed embeddings for all products.

Implications:

New products need embedding generation before being searchable
Batch processing recommended for efficiency
Real-time embedding generation could impact search latency

Technical Architecture

Components

Embedding Generator: Interfaces with Azure OpenAI API
Similarity Calculator: Computes cosine similarity between vectors
Search Engine: Orchestrates query processing and ranking
Result Formatter: Presents search results in readable format

Performance Considerations

Embedding Cache: Pre-computed embeddings stored in memory
Batch Processing: Multiple queries processed efficiently
Error Recovery: Graceful handling of API failures
Memory Usage: Embeddings stored as Python lists (could optimize with numpy arrays)

Future Enhancements

Scalability Improvements

Implement vector database (e.g., Pinecone, Weaviate) for large-scale deployment
Add embedding caching to reduce API calls
Implement approximate nearest neighbor search for faster queries

Search Features

Add filters for category, price range, brand
Implement hybrid search (semantic + keyword matching)
Support for image-based similarity search

User Experience

Add query suggestions and auto-complete
Implement search result explanations
Support for multi-language queries

Submission Checklist

Complete Python script with clear comments (Assignment_08.py)
Code in single .py file
Uses dummy input list (auto input), no manual input() built-in
Sample product data embedded in the script
README report explaining embeddings and cosine similarity
Documentation of challenges and limitations encountered

Conclusion

This assignment successfully demonstrates:

How embeddings are created using Azure OpenAI's text-embedding-3-small model
Implementation of cosine similarity for ranking similar products
Building a functional semantic search engine for e-commerce applications
Handling API integration, error recovery, and batch processing

The semantic search approach proves more effective than keyword matching for understanding product relationships, providing a foundation for modern recommendation systems.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
Assignment_08.py		Assignment_08.py
README.md		README.md

quanghaophan/Assignment_08

Folders and files

Latest commit

History

Repository files navigation