Skip to content

Experiment with PLUM : Evaluate Semantic-IDs based recommendations #25

@udapy

Description

@udapy

Semantic ID Recommendation System: From Baseline to PLUM

Phase 0: Data Foundation

Principle: Apples-to-apples comparison requires a shared ground truth.

Before branching into specific architectures, establish a unified data pipeline.

  1. Select Dataset
    • Use a dataset with rich textual metadata and strong user signals (e.g., Amazon Reviews or MovieLens).
  2. Temporal Split
    • Freeze data into Train, Validation, and Test sets based on timestamps (e.g., Train: 2023, Test: Jan 2024) to prevent data leakage.
  3. Signal Extraction
    • Text Corpus: Extract Title + Description for every item.
    • Interaction Graph: Extract user sessions User_ID $\rightarrow$ [Item_A, Item_B, ...].

Phase 1: The Baseline (Content-Centric)

Goal: Replicate the blog post approach. Prove that discrete tokens can represent items based on "what they are."

Step 1.1: The Content Indexer (Standard RQ-VAE)

This step "teaches" the system the vocabulary of the items based only on their text.

  • Generate Content Embeddings:
    • Pass item metadata through a frozen encoder (e.g., Sentence-BERT or E5).
    • Output: A dense vector $x_{content}$ (e.g., 768-dim) for every item.
  • Train RQ-VAE:
    • Objective: Minimize Reconstruction Loss: $$L = ||x - \hat{x}||^2 + ||sg[e] - z||^2$$
    • Config: 3 codebooks, size $K=256$.
    • Output: Item_ID $\rightarrow$ (12, 55, 108).
  • Uniqueness Check:
    • Calculate the collision rate. If multiple items map to (12, 55, 108), append a deterministic suffix (0, 1) to ensure 1-to-1 mapping.

Step 1.2: The Sequential Model (Tabula Rasa)

This step trains a model from scratch to predict the next content-based ID.

  • Sequence Tokenization:
    • Convert user history into flat token streams:
      [BOS] <12> <55> <108> [SEP] <45> <12> <99> ...
  • Model Initialization:
    • Initialize a standard Transformer Decoder (e.g., GPT-2 config, ~100M params) with random weights.
    • Vocabulary: Codebook Size $\times$ Depth + Special Tokens.
  • Training:
    • Task: Next-token prediction (Cross Entropy).
    • Constraint: No natural language text is used here, only ID tokens.

Phase 2: The PLUM Upgrade (Collaborative-Centric)

Goal: Integrate the PLUM paper findings. Prove that IDs should represent "how items are used" and that LLMs can leverage this structure.

Step 2.1: The Collaborative Indexer (Contrastive RQ-VAE)

Refinement: Items bought together should have similar IDs, even if their descriptions differ.

  • Construct Co-occurrence Pairs:
    • Mine the interaction graph to find positive pairs $(x, x^+)$ (items appearing in the same session).
  • Train PLUM RQ-VAE:
    • Input: Same $x_{content}$ as Phase 1.
    • New Objective: Add Contrastive Loss.
      $$L_{total} = L_{recon} + \lambda \cdot L_{contrastive}(x, x^+, x^-)$$
      Ensure the quantized ID of $x$ is closer to $x^+$ than to a random negative $x^-$.
    • Progressive Masking: Randomly drop the deepest quantization layer during training. This forces the top-level tokens (prefixes) to learn robust, broad categories.
    • Result: A new mapping where Item_ID $\rightarrow$ (99, 12, 44) (reflecting behavioral clusters).

Step 2.2: LLM Integration (Continued Pre-training)

Refinement: Don't learn language from scratch; teach a pre-trained brain (LLM) your new specific vocabulary.

  • Vocabulary Surgery:
    • Load a pre-trained LLM (e.g., Llama-3-8B or Qwen).
    • Resize Embeddings: Expand the LLM's embedding matrix to include your RQ-VAE codebook tokens. Do not initialize randomly if possible—initialize near the centroid of their cluster.
  • Construct "Bilingual" Corpus:
    • Identity Data: "The item [Title] is represented by ID [SID]." (Aligns world knowledge with SIDs).
    • Sequence Data: "User history: [SID_A] [SID_B]." (Aligns collaborative patterns).
  • Run Continued Pre-training (CPT):
    • Train the LLM on this mixed corpus.
    • Goal: The model learns to "read" Semantic IDs as fluently as English words.

Step 2.3: Instruction Tuning (Steerability)

Refinement: Enable the model to follow natural language constraints.

  • Create SFT Dataset:
    • Format prompts: User History: [SID sequence]. Constraint: "Suggest a sci-fi movie." Recommendation: [Target SID].
  • Fine-Tune (LoRA):
    • Freeze most LLM weights. Train Low-Rank Adapters + the Embedding Layer.
    • Why: This preserves the LLM's reasoning ability while adapting it to the strict formatting of RecSys.

Phase 3: Evaluation & Comparison

Evaluate both systems on the held-out test set using three specific lenses.

3.1 Strict Retrieval Metrics (Accuracy)

  • Metric: NDCG@10 and Recall@10.
  • Method: Use Beam Search (beam=10) to generate the next tuple.
  • Hypothesis: PLUM should outperform Baseline because the IDs contain behavioral signals, making the "grammar" of the sequence easier to predict.

3.2 Cold-Start Generalization

  • Test Subset: Items in the test set that had zero interactions during training (pure cold-start).
  • Hypothesis:
    • Baseline: Will struggle or hallucinate (random IDs).
    • PLUM: Should succeed. The LLM can "read" the item's content via the Semantic ID and relate it conceptually to the user's history, even without behavioral logs.

3.3 Steerability & Coherence (Qualitative)

  • Task: "Counter-factual" Recommendation.
    • Input: A user history full of Children's Movies.
    • Prompt: "Recommend a Horror movie."
  • Check:
    • Baseline: Cannot handle the prompt (it ignores the text constraint).
    • PLUM: Should output valid Horror SIDs. Validate if the recommended horror movie is "semantically close" (e.g., "scary but simple") or generic.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions