Skip to content

First Phase Attack Experiment #3

@hieunguyen-cyber

Description

@hieunguyen-cyber

Phase 1: Experimenting Initial Attack using BigCodeBench Dataset

This phase aims to simulate a retrieval poisoning and up-ranking attack against LLM-based code retrieval or RAG (Retrieval-Augmented Generation) systems. The pipeline leverages the BigCodeBench dataset and combines embedding-based similarity scoring, up-ranking optimization, and content-based injection methods.


1. Dataset Preparation

Source:

  • Collect the BigCodeBench dataset from Hugging Face and convert it into .csv format.
  • The dataset contains:
    • Queries (Q): Natural language programming tasks or prompts.
    • Code Snippets (D): Ground-truth code solutions.

Splitting:

  1. Separate dataset into:
    • Queries Pool
    • Code Pool
  2. From the queries pool:
    • Split into Uprank Queries (80%) and Test Queries (20%).
  3. The Code Pool will serve as the document base (D₁…Dₖ).

2. Code Document Preparation

  • Each document (D₁, D₂, ..., Dₖ) corresponds to a code snippet.
  • Merge approximately 50 code snippets into one text file to simulate realistic code documents.
  • These merged files will be used as the base corpus for embedding and similarity computation.

3. Embedding Generation

Embedding Models:

  • Use two models:
    • BGE-M3
    • mGTE
  • For each query and document:
    • Generate embedding vectors using both models.
    • The final embedding vector = average of the two embeddings.

Purpose:

  • Create a unified vector space for both queries and documents to compute semantic similarity.

4. Query-Document Matching

Step 1 — Similarity Calculation:

  • Compute cosine similarity between each query and every document.
  • For each query Qᵢ, select the Top-5 most similar documents (D₁…D₅).

Step 2 — Target Document Set:

  • Merge all Top-5 documents across all queries to form the Targeted Document Set (TDS).

Step 3 — Proxy Query Mapping:

  • For each document in TDS:
    • Identify all queries for which the document appears in the Top-5.
    • These are considered the document’s upranking proxy queries.

5. HotFlip Optimization

Objective:

To up-rank the targeted documents during retrieval.

Algorithm:

Steps:

  1. For each targeted document and its proxy queries, perform token-level optimization using HotFlip.
  2. Iteratively modify text tokens (code comments, descriptions) to maximize cosine similarity with corresponding queries.
  3. Evaluate using Precision@k after each optimization round to ensure ranking improvement.

6. Inducing Generator (Web Cloaking Integration)

Objective:

Generate adversarial textual content to induce LLMs to follow malicious links.

Reference:

Footprint-Based Web Cloaking Induction for LLMs
https://arxiv.org/html/2509.00124v1

Process:

  1. For each document in the targeted set:
    • Generate a contextual paragraph designed to entice LLMs or retrieval systems to click or reference external malicious links.
    • The content mimics overview, context, or description of the code.
  2. Insert this generated paragraph:
    • At the beginning of the document (as an introduction).
    • Within the code itself as comments or docstrings.

7. Document Synthesis and Injection

Using LaTeX Generator:

  • Employ LLMs and PDF APIs to produce polished PDF files from the modified code using LaTeX templates.
  • Combine all injected documents into a Poisoned RAG Database.

Injection Ratio:

  • Inject 0.01% to 3% of the poisoned documents into the original clean corpus.

8. Evaluation

Objectives:

  1. Upranking Evaluation:
    • Recalculate similarity between poisoned documents and queries.
    • Measure Precision@k to determine the improvement in ranking positions.
  2. ASR (Attack Success Rate):
    • Evaluate whether induced content successfully triggers LLMs to follow or quote malicious links.

9. Summary of Phase 1 Workflow

  1. Collect and preprocess BigCodeBench dataset.
  2. Create merged code documents.
  3. Generate dual-model embeddings using BGE-M3 and mGTE.
  4. Match queries and compute cosine similarity.
  5. Identify targeted documents and their proxy queries.
  6. Apply HotFlip optimization to up-rank these documents.
  7. Generate inducing content using web cloaking methodology.
  8. Convert to LaTeX/PDF and inject small percentage into clean corpus.
  9. Measure Precision@k and ASR to evaluate attack success.

10. References

  1. HotFlip: White-Box Adversarial Examples for Text Classification
    https://arxiv.org/pdf/1712.06751

  2. Web Cloaking and Inducing Footprint Techniques for LLMs
    https://arxiv.org/html/2509.00124v1

  3. BigCodeBench Dataset (Hugging Face)
    https://tinyurl.com/t4u2jrrp


Notes

  • The embeddings are averaged to reduce model-specific biases and stabilize similarity space.
  • HotFlip is adapted from NLP to handle structured code and documentation content.
  • The poisoned corpus simulates realistic RAG environments where a small number of malicious entries can manipulate retrieval outcomes.
Image

Sub-issues

Metadata

Metadata

Labels

documentationImprovements or additions to documentation

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions