Experiment with PLUM : Evaluate Semantic-IDs based recommendations

# Semantic ID Recommendation System: From Baseline to PLUM

## Phase 0: Data Foundation
*Principle: Apples-to-apples comparison requires a shared ground truth.*

Before branching into specific architectures, establish a unified data pipeline.

1.  **Select Dataset**
    * Use a dataset with rich textual metadata and strong user signals (e.g., **Amazon Reviews** or **MovieLens**).
2.  **Temporal Split**
    * Freeze data into `Train`, `Validation`, and `Test` sets based on timestamps (e.g., Train: 2023, Test: Jan 2024) to prevent data leakage.
3.  **Signal Extraction**
    * **Text Corpus**: Extract `Title` + `Description` for every item.
    * **Interaction Graph**: Extract user sessions `User_ID` $\rightarrow$ `[Item_A, Item_B, ...]`.

---

## Phase 1: The Baseline (Content-Centric)
*Goal: Replicate the blog post approach. Prove that discrete tokens can represent items based on "what they are."*

### Step 1.1: The Content Indexer (Standard RQ-VAE)
This step "teaches" the system the vocabulary of the items based *only* on their text.

* **Generate Content Embeddings**:
    * Pass item metadata through a frozen encoder (e.g., `Sentence-BERT` or `E5`).
    * **Output**: A dense vector $x_{content}$ (e.g., 768-dim) for every item.
* **Train RQ-VAE**:
    * **Objective**: Minimize Reconstruction Loss: $$L = ||x - \hat{x}||^2 + ||sg[e] - z||^2$$
    * **Config**: 3 codebooks, size $K=256$.
    * **Output**: `Item_ID` $\rightarrow$ `(12, 55, 108)`.
* **Uniqueness Check**:
    * Calculate the collision rate. If multiple items map to `(12, 55, 108)`, append a deterministic suffix (0, 1) to ensure 1-to-1 mapping.

### Step 1.2: The Sequential Model (Tabula Rasa)
This step trains a model from scratch to predict the next content-based ID.

* **Sequence Tokenization**:
    * Convert user history into flat token streams:
        `[BOS] <12> <55> <108> [SEP] <45> <12> <99> ...`
* **Model Initialization**:
    * Initialize a standard Transformer Decoder (e.g., GPT-2 config, ~100M params) with **random weights**.
    * **Vocabulary**: Codebook Size $\times$ Depth + Special Tokens.
* **Training**:
    * **Task**: Next-token prediction (Cross Entropy).
    * **Constraint**: No natural language text is used here, only ID tokens.

---

## Phase 2: The PLUM Upgrade (Collaborative-Centric)
*Goal: Integrate the PLUM paper findings. Prove that IDs should represent "how items are used" and that LLMs can leverage this structure.*

### Step 2.1: The Collaborative Indexer (Contrastive RQ-VAE)
*Refinement: Items bought together should have similar IDs, even if their descriptions differ.*

* **Construct Co-occurrence Pairs**:
    * Mine the interaction graph to find positive pairs $(x, x^+)$ (items appearing in the same session).
* **Train PLUM RQ-VAE**:
    * **Input**: Same $x_{content}$ as Phase 1.
    * **New Objective**: Add **Contrastive Loss**.
        $$L_{total} = L_{recon} + \lambda \cdot L_{contrastive}(x, x^+, x^-)$$
        *Ensure the quantized ID of $x$ is closer to $x^+$ than to a random negative $x^-$.*
    * **Progressive Masking**: Randomly drop the deepest quantization layer during training. This forces the top-level tokens (prefixes) to learn robust, broad categories.
    * **Result**: A new mapping where `Item_ID` $\rightarrow$ `(99, 12, 44)` (reflecting behavioral clusters).

### Step 2.2: LLM Integration (Continued Pre-training)
*Refinement: Don't learn language from scratch; teach a pre-trained brain (LLM) your new specific vocabulary.*

* **Vocabulary Surgery**:
    * Load a pre-trained LLM (e.g., Llama-3-8B or Qwen).
    * **Resize Embeddings**: Expand the LLM's embedding matrix to include your RQ-VAE codebook tokens. *Do not initialize randomly if possible—initialize near the centroid of their cluster.*
* **Construct "Bilingual" Corpus**:
    * **Identity Data**: "The item [Title] is represented by ID [SID]." (Aligns world knowledge with SIDs).
    * **Sequence Data**: "User history: [SID_A] [SID_B]." (Aligns collaborative patterns).
* **Run Continued Pre-training (CPT)**:
    * Train the LLM on this mixed corpus.
    * **Goal**: The model learns to "read" Semantic IDs as fluently as English words.

### Step 2.3: Instruction Tuning (Steerability)
*Refinement: Enable the model to follow natural language constraints.*

* **Create SFT Dataset**:
    * Format prompts: `User History: [SID sequence]. Constraint: "Suggest a sci-fi movie." Recommendation: [Target SID].`
* **Fine-Tune (LoRA)**:
    * Freeze most LLM weights. Train Low-Rank Adapters + the Embedding Layer.
    * **Why**: This preserves the LLM's reasoning ability while adapting it to the strict formatting of RecSys.

---

## Phase 3: Evaluation & Comparison
*Evaluate both systems on the held-out test set using three specific lenses.*

### 3.1 Strict Retrieval Metrics (Accuracy)
* **Metric**: **NDCG@10** and **Recall@10**.
* **Method**: Use Beam Search (beam=10) to generate the next tuple.
* **Hypothesis**: PLUM should outperform Baseline because the IDs contain behavioral signals, making the "grammar" of the sequence easier to predict.

### 3.2 Cold-Start Generalization
* **Test Subset**: Items in the test set that had **zero interactions** during training (pure cold-start).
* **Hypothesis**:
    * **Baseline**: Will struggle or hallucinate (random IDs).
    * **PLUM**: Should succeed. The LLM can "read" the item's content via the Semantic ID and relate it conceptually to the user's history, even without behavioral logs.

### 3.3 Steerability & Coherence (Qualitative)
* **Task**: "Counter-factual" Recommendation.
    * *Input*: A user history full of **Children's Movies**.
    * *Prompt*: "Recommend a **Horror** movie."
* **Check**:
    * **Baseline**: Cannot handle the prompt (it ignores the text constraint).
    * **PLUM**: Should output valid Horror SIDs. Validate if the recommended horror movie is "semantically close" (e.g., "scary but simple") or generic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experiment with PLUM : Evaluate Semantic-IDs based recommendations #25

Semantic ID Recommendation System: From Baseline to PLUM

Phase 0: Data Foundation

Phase 1: The Baseline (Content-Centric)

Step 1.1: The Content Indexer (Standard RQ-VAE)

Step 1.2: The Sequential Model (Tabula Rasa)

Phase 2: The PLUM Upgrade (Collaborative-Centric)

Step 2.1: The Collaborative Indexer (Contrastive RQ-VAE)

Step 2.2: LLM Integration (Continued Pre-training)

Step 2.3: Instruction Tuning (Steerability)

Phase 3: Evaluation & Comparison

3.1 Strict Retrieval Metrics (Accuracy)

3.2 Cold-Start Generalization

3.3 Steerability & Coherence (Qualitative)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Experiment with PLUM : Evaluate Semantic-IDs based recommendations #25

Description

Semantic ID Recommendation System: From Baseline to PLUM

Phase 0: Data Foundation

Phase 1: The Baseline (Content-Centric)

Step 1.1: The Content Indexer (Standard RQ-VAE)

Step 1.2: The Sequential Model (Tabula Rasa)

Phase 2: The PLUM Upgrade (Collaborative-Centric)

Step 2.1: The Collaborative Indexer (Contrastive RQ-VAE)

Step 2.2: LLM Integration (Continued Pre-training)

Step 2.3: Instruction Tuning (Steerability)

Phase 3: Evaluation & Comparison

3.1 Strict Retrieval Metrics (Accuracy)

3.2 Cold-Start Generalization

3.3 Steerability & Coherence (Qualitative)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions