-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Phase 1: Experimenting Initial Attack using BigCodeBench Dataset
This phase aims to simulate a retrieval poisoning and up-ranking attack against LLM-based code retrieval or RAG (Retrieval-Augmented Generation) systems. The pipeline leverages the BigCodeBench dataset and combines embedding-based similarity scoring, up-ranking optimization, and content-based injection methods.
1. Dataset Preparation
Source:
- Collect the BigCodeBench dataset from Hugging Face and convert it into
.csvformat. - The dataset contains:
- Queries (Q): Natural language programming tasks or prompts.
- Code Snippets (D): Ground-truth code solutions.
Splitting:
- Separate dataset into:
- Queries Pool
- Code Pool
- From the queries pool:
- Split into Uprank Queries (80%) and Test Queries (20%).
- The Code Pool will serve as the document base (D₁…Dₖ).
2. Code Document Preparation
- Each document (D₁, D₂, ..., Dₖ) corresponds to a code snippet.
- Merge approximately 50 code snippets into one text file to simulate realistic code documents.
- These merged files will be used as the base corpus for embedding and similarity computation.
3. Embedding Generation
Embedding Models:
- Use two models:
- BGE-M3
- mGTE
- For each query and document:
- Generate embedding vectors using both models.
- The final embedding vector = average of the two embeddings.
Purpose:
- Create a unified vector space for both queries and documents to compute semantic similarity.
4. Query-Document Matching
Step 1 — Similarity Calculation:
- Compute cosine similarity between each query and every document.
- For each query
Qᵢ, select the Top-5 most similar documents (D₁…D₅).
Step 2 — Target Document Set:
- Merge all Top-5 documents across all queries to form the Targeted Document Set (TDS).
Step 3 — Proxy Query Mapping:
- For each document in TDS:
- Identify all queries for which the document appears in the Top-5.
- These are considered the document’s upranking proxy queries.
5. HotFlip Optimization
Objective:
To up-rank the targeted documents during retrieval.
Algorithm:
- Apply the HotFlip method as described in:
HotFlip: White-Box Adversarial Examples for Text Classification
https://arxiv.org/pdf/1712.06751
Steps:
- For each targeted document and its proxy queries, perform token-level optimization using HotFlip.
- Iteratively modify text tokens (code comments, descriptions) to maximize cosine similarity with corresponding queries.
- Evaluate using Precision@k after each optimization round to ensure ranking improvement.
6. Inducing Generator (Web Cloaking Integration)
Objective:
Generate adversarial textual content to induce LLMs to follow malicious links.
Reference:
Footprint-Based Web Cloaking Induction for LLMs
https://arxiv.org/html/2509.00124v1
Process:
- For each document in the targeted set:
- Generate a contextual paragraph designed to entice LLMs or retrieval systems to click or reference external malicious links.
- The content mimics overview, context, or description of the code.
- Insert this generated paragraph:
- At the beginning of the document (as an introduction).
- Within the code itself as comments or docstrings.
7. Document Synthesis and Injection
Using LaTeX Generator:
- Employ LLMs and PDF APIs to produce polished PDF files from the modified code using LaTeX templates.
- Combine all injected documents into a Poisoned RAG Database.
Injection Ratio:
- Inject 0.01% to 3% of the poisoned documents into the original clean corpus.
8. Evaluation
Objectives:
- Upranking Evaluation:
- Recalculate similarity between poisoned documents and queries.
- Measure Precision@k to determine the improvement in ranking positions.
- ASR (Attack Success Rate):
- Evaluate whether induced content successfully triggers LLMs to follow or quote malicious links.
9. Summary of Phase 1 Workflow
- Collect and preprocess BigCodeBench dataset.
- Create merged code documents.
- Generate dual-model embeddings using BGE-M3 and mGTE.
- Match queries and compute cosine similarity.
- Identify targeted documents and their proxy queries.
- Apply HotFlip optimization to up-rank these documents.
- Generate inducing content using web cloaking methodology.
- Convert to LaTeX/PDF and inject small percentage into clean corpus.
- Measure Precision@k and ASR to evaluate attack success.
10. References
-
HotFlip: White-Box Adversarial Examples for Text Classification
https://arxiv.org/pdf/1712.06751 -
Web Cloaking and Inducing Footprint Techniques for LLMs
https://arxiv.org/html/2509.00124v1 -
BigCodeBench Dataset (Hugging Face)
https://tinyurl.com/t4u2jrrp
Notes
- The embeddings are averaged to reduce model-specific biases and stabilize similarity space.
- HotFlip is adapted from NLP to handle structured code and documentation content.
- The poisoned corpus simulates realistic RAG environments where a small number of malicious entries can manipulate retrieval outcomes.
