-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Semantic ID Recommendation System: From Baseline to PLUM
Phase 0: Data Foundation
Principle: Apples-to-apples comparison requires a shared ground truth.
Before branching into specific architectures, establish a unified data pipeline.
-
Select Dataset
- Use a dataset with rich textual metadata and strong user signals (e.g., Amazon Reviews or MovieLens).
-
Temporal Split
- Freeze data into
Train,Validation, andTestsets based on timestamps (e.g., Train: 2023, Test: Jan 2024) to prevent data leakage.
- Freeze data into
-
Signal Extraction
-
Text Corpus: Extract
Title+Descriptionfor every item. -
Interaction Graph: Extract user sessions
User_ID$\rightarrow$ [Item_A, Item_B, ...].
-
Text Corpus: Extract
Phase 1: The Baseline (Content-Centric)
Goal: Replicate the blog post approach. Prove that discrete tokens can represent items based on "what they are."
Step 1.1: The Content Indexer (Standard RQ-VAE)
This step "teaches" the system the vocabulary of the items based only on their text.
-
Generate Content Embeddings:
- Pass item metadata through a frozen encoder (e.g.,
Sentence-BERTorE5). -
Output: A dense vector
$x_{content}$ (e.g., 768-dim) for every item.
- Pass item metadata through a frozen encoder (e.g.,
-
Train RQ-VAE:
-
Objective: Minimize Reconstruction Loss:
$$L = ||x - \hat{x}||^2 + ||sg[e] - z||^2$$ -
Config: 3 codebooks, size
$K=256$ . -
Output:
Item_ID$\rightarrow$ (12, 55, 108).
-
Objective: Minimize Reconstruction Loss:
-
Uniqueness Check:
- Calculate the collision rate. If multiple items map to
(12, 55, 108), append a deterministic suffix (0, 1) to ensure 1-to-1 mapping.
- Calculate the collision rate. If multiple items map to
Step 1.2: The Sequential Model (Tabula Rasa)
This step trains a model from scratch to predict the next content-based ID.
-
Sequence Tokenization:
- Convert user history into flat token streams:
[BOS] <12> <55> <108> [SEP] <45> <12> <99> ...
- Convert user history into flat token streams:
-
Model Initialization:
- Initialize a standard Transformer Decoder (e.g., GPT-2 config, ~100M params) with random weights.
-
Vocabulary: Codebook Size
$\times$ Depth + Special Tokens.
-
Training:
- Task: Next-token prediction (Cross Entropy).
- Constraint: No natural language text is used here, only ID tokens.
Phase 2: The PLUM Upgrade (Collaborative-Centric)
Goal: Integrate the PLUM paper findings. Prove that IDs should represent "how items are used" and that LLMs can leverage this structure.
Step 2.1: The Collaborative Indexer (Contrastive RQ-VAE)
Refinement: Items bought together should have similar IDs, even if their descriptions differ.
-
Construct Co-occurrence Pairs:
- Mine the interaction graph to find positive pairs
$(x, x^+)$ (items appearing in the same session).
- Mine the interaction graph to find positive pairs
-
Train PLUM RQ-VAE:
-
Input: Same
$x_{content}$ as Phase 1. -
New Objective: Add Contrastive Loss.
$$L_{total} = L_{recon} + \lambda \cdot L_{contrastive}(x, x^+, x^-)$$
Ensure the quantized ID of $x$ is closer to $x^+$ than to a random negative $x^-$. - Progressive Masking: Randomly drop the deepest quantization layer during training. This forces the top-level tokens (prefixes) to learn robust, broad categories.
-
Result: A new mapping where
Item_ID$\rightarrow$ (99, 12, 44)(reflecting behavioral clusters).
-
Input: Same
Step 2.2: LLM Integration (Continued Pre-training)
Refinement: Don't learn language from scratch; teach a pre-trained brain (LLM) your new specific vocabulary.
- Vocabulary Surgery:
- Load a pre-trained LLM (e.g., Llama-3-8B or Qwen).
- Resize Embeddings: Expand the LLM's embedding matrix to include your RQ-VAE codebook tokens. Do not initialize randomly if possible—initialize near the centroid of their cluster.
- Construct "Bilingual" Corpus:
- Identity Data: "The item [Title] is represented by ID [SID]." (Aligns world knowledge with SIDs).
- Sequence Data: "User history: [SID_A] [SID_B]." (Aligns collaborative patterns).
- Run Continued Pre-training (CPT):
- Train the LLM on this mixed corpus.
- Goal: The model learns to "read" Semantic IDs as fluently as English words.
Step 2.3: Instruction Tuning (Steerability)
Refinement: Enable the model to follow natural language constraints.
- Create SFT Dataset:
- Format prompts:
User History: [SID sequence]. Constraint: "Suggest a sci-fi movie." Recommendation: [Target SID].
- Format prompts:
- Fine-Tune (LoRA):
- Freeze most LLM weights. Train Low-Rank Adapters + the Embedding Layer.
- Why: This preserves the LLM's reasoning ability while adapting it to the strict formatting of RecSys.
Phase 3: Evaluation & Comparison
Evaluate both systems on the held-out test set using three specific lenses.
3.1 Strict Retrieval Metrics (Accuracy)
- Metric: NDCG@10 and Recall@10.
- Method: Use Beam Search (beam=10) to generate the next tuple.
- Hypothesis: PLUM should outperform Baseline because the IDs contain behavioral signals, making the "grammar" of the sequence easier to predict.
3.2 Cold-Start Generalization
- Test Subset: Items in the test set that had zero interactions during training (pure cold-start).
- Hypothesis:
- Baseline: Will struggle or hallucinate (random IDs).
- PLUM: Should succeed. The LLM can "read" the item's content via the Semantic ID and relate it conceptually to the user's history, even without behavioral logs.
3.3 Steerability & Coherence (Qualitative)
- Task: "Counter-factual" Recommendation.
- Input: A user history full of Children's Movies.
- Prompt: "Recommend a Horror movie."
- Check:
- Baseline: Cannot handle the prompt (it ignores the text constraint).
- PLUM: Should output valid Horror SIDs. Validate if the recommended horror movie is "semantically close" (e.g., "scary but simple") or generic.