Skip to content

quanghaophan/Assignment_08

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Assignment 08: Building a Semantic Search Engine with Text Embeddings for Clothing Products

Objective

This assignment demonstrates how to:

  • Generate text embeddings using Azure OpenAI's "text-embedding-3-small" model
  • Build a semantic search engine that finds the most similar clothes based on product descriptions
  • Use cosine distance to measure similarity between embeddings
  • Practice handling embeddings and performing vector similarity search in Python

Problem Statement

Online stores require effective search and recommendation systems to help customers find relevant products. Traditional keyword matching often fails to capture semantic similarities. This project builds a semantic search engine by embedding product descriptions and user queries as vectors and finding the closest matches.

How It Works

1. Text Embeddings Creation

What are Text Embeddings? Text embeddings are dense vector representations of text that capture semantic meaning. Unlike traditional keyword-based search, embeddings understand that "warm cotton sweatshirt" and "cozy cotton hoodie" are semantically similar, even though they don't share exact words.

Implementation Process:

  • Each product description is converted into a high-dimensional vector (embedding) using Azure OpenAI's text-embedding-3-small model
  • The model transforms text into numerical vectors where semantically similar texts have similar vector representations
  • These embeddings capture contextual relationships between words and phrases

Code Implementation:

def get_embedding(text):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

2. Cosine Similarity for Ranking

What is Cosine Similarity? Cosine similarity measures the cosine of the angle between two vectors in high-dimensional space. It's particularly effective for text embeddings because:

  • It ranges from -1 to 1 (or 0 to 1 for positive vectors)
  • It measures orientation rather than magnitude
  • Similar texts will have vectors pointing in similar directions

Why Cosine Distance? We use 1 - cosine_distance to convert distance into similarity:

  • Cosine distance: 0 = identical, 1 = completely different
  • Cosine similarity: 1 = identical, 0 = completely different

Mathematical Formula:

cosine_similarity = (A · B) / (||A|| * ||B||)

Where A and B are the embedding vectors.

Implementation:

def similarity_score(vec1, vec2):
    return 1 - cosine(vec1, vec2)  # Convert distance to similarity

3. Search Process

  1. Query Processing: User query is converted to embedding
  2. Similarity Calculation: Cosine similarity computed between query embedding and all product embeddings
  3. Ranking: Products sorted by similarity score (highest first)
  4. Results: Top N most similar products returned

Expected Outcome

The Python script (Assignment_08.py) provides:

  • Creates embeddings for all products' descriptions using Azure OpenAI's "text-embedding-3-small" model
  • Accepts user input queries (product description or search text) - implemented as dummy input
  • Generates embeddings for input queries
  • Computes cosine distances between query embedding and product embeddings
  • Returns and prints the top N most similar products ranked by similarity

Concepts Covered

  • Using Azure OpenAI's text embedding models for semantic understanding
  • Computing cosine similarity to find nearest neighbors in vector space
  • Basic data structuring and search logic for recommendation engines
  • Integration with Azure OpenAI API via official Python client
  • Error handling and retry mechanisms for API reliability

Dataset

The script includes a curated dataset of 12 clothing products with diverse categories:

Category Products Example
Jeans Classic Blue Jeans "Comfortable blue denim jeans with a relaxed fit"
Hoodies/Sweatshirts Red Hoodie, Pink Cotton Sweatshirt "Cozy red hoodie made from organic cotton"
Jackets Black Leather Jacket, Denim Jacket "Stylish black leather jacket with a slim fit design"
T-Shirts/Shirts White Cotton T-Shirt, Striped Long Sleeve Shirt "Soft white cotton t-shirt perfect for everyday wear"
Shoes/Boots Black Running Shoes, Brown Leather Boots "Lightweight black running shoes with excellent cushioning"
Others Navy Blue Chinos, Gray Wool Sweater, Green Cargo Shorts Various descriptions focusing on materials and fit

Input Data

Sample Product Dataset

The script includes embedded sample data as required:

products = [
    {
        "title": "Classic Blue Jeans",
        "short_description": "Comfortable blue denim jeans with a relaxed fit.",
        "price": 49.99,
        "category": "Jeans"
    },
    {
        "title": "Red Hoodie", 
        "short_description": "Cozy red hoodie made from organic cotton.",
        "price": 39.99,
        "category": "Hoodies"
    },
    # ... 12 total products
]

Dummy Input Queries (Auto Input)

As per requirements, the script uses automatic input instead of manual input():

  • "warm cotton sweatshirt"
  • "comfortable running shoes"
  • "leather jacket for winter"
  • "casual denim pants"
  • "cozy wool sweater"

Setup and Installation

Prerequisites

Install required Python packages:

pip install openai scipy

Azure OpenAI Configuration

Get credentials for text-embedding-3-small from STU and configure:

os.environ["AZURE_OPENAI_ENDPOINT"] = ""
os.environ["AZURE_OPENAI_API_KEY"] = ""
os.environ["AZURE_DEPLOYMENT_NAME"] = "text-embedding-3-small"

Running the Assignment

python Assignment_08.py

Important:

  • Code is contained in single .py file as required
  • Uses dummy input list (auto input), no manual input() built-in
  • Sample product data is embedded in the script

Challenges and Limitations Encountered

1. API Rate Limits and Reliability

Challenge: Azure OpenAI API calls can fail due to rate limits or network issues.

Solution: Implemented retry mechanism with exponential backoff:

def get_embedding(text, max_retries=3, delay=1):
    for attempt in range(max_retries):
        try:
            # API call
            return embedding
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(delay)

2. Embedding Quality and Semantic Understanding

Challenge: Text embeddings may not capture all nuanced relationships between clothing items.

Observations:

  • Works well for material-based searches ("cotton", "leather", "wool")
  • Effective for functional descriptions ("comfortable", "warm", "stylish")
  • May struggle with very specific style preferences or brand-related queries

3. Dataset Size and Diversity

Limitation: Small dataset (12 products) limits the complexity of search scenarios.

Impact:

  • Good for demonstration purposes
  • Real-world applications would require larger, more diverse datasets
  • Current dataset covers basic categories but lacks style variations

4. Cosine Similarity Threshold

Challenge: No clear threshold for determining "good" vs "poor" matches.

Consideration:

  • Similarity scores are relative within the dataset
  • Larger datasets would provide better score distributions
  • Could implement dynamic thresholding based on score distribution

5. Cold Start Problem

Limitation: System requires pre-computed embeddings for all products.

Implications:

  • New products need embedding generation before being searchable
  • Batch processing recommended for efficiency
  • Real-time embedding generation could impact search latency

Technical Architecture

Components

  1. Embedding Generator: Interfaces with Azure OpenAI API
  2. Similarity Calculator: Computes cosine similarity between vectors
  3. Search Engine: Orchestrates query processing and ranking
  4. Result Formatter: Presents search results in readable format

Performance Considerations

  • Embedding Cache: Pre-computed embeddings stored in memory
  • Batch Processing: Multiple queries processed efficiently
  • Error Recovery: Graceful handling of API failures
  • Memory Usage: Embeddings stored as Python lists (could optimize with numpy arrays)

Future Enhancements

Scalability Improvements

  • Implement vector database (e.g., Pinecone, Weaviate) for large-scale deployment
  • Add embedding caching to reduce API calls
  • Implement approximate nearest neighbor search for faster queries

Search Features

  • Add filters for category, price range, brand
  • Implement hybrid search (semantic + keyword matching)
  • Support for image-based similarity search

User Experience

  • Add query suggestions and auto-complete
  • Implement search result explanations
  • Support for multi-language queries

Submission Checklist

  • Complete Python script with clear comments (Assignment_08.py)
  • Code in single .py file
  • Uses dummy input list (auto input), no manual input() built-in
  • Sample product data embedded in the script
  • README report explaining embeddings and cosine similarity
  • Documentation of challenges and limitations encountered

Conclusion

This assignment successfully demonstrates:

  • How embeddings are created using Azure OpenAI's text-embedding-3-small model
  • Implementation of cosine similarity for ranking similar products
  • Building a functional semantic search engine for e-commerce applications
  • Handling API integration, error recovery, and batch processing

The semantic search approach proves more effective than keyword matching for understanding product relationships, providing a foundation for modern recommendation systems.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages