The list below contains curated papers and arXiv articles that are related to Trojan attacks, backdoor attacks, and data poisoning on neural networks and machine learning systems. They are ordered "approximately" from most to least recent and articles denoted with a "*" mention the TrojAI program directly. Some of the particularly relevant papers include a summary that can be accessed by clicking the "Summary" drop down icon underneath the paper link. These articles were identified using variety of methods including:
- A flair embedding created from the arXiv CS subset
- A trained ASReview random forest model
- A curated manual literature review
-
Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges
-
BACKDOORING VISION-LANGUAGE MODELS WITH OUT-OF-DISTRIBUTION DATA
-
TA-CLEANER: A FINE-GRAINED TEXT ALIGNMENT BACKDOOR DEFENSE STRATEGY FOR MULTIMODAL CONTRASTIVE
-
WEAK-TO-STRONG BACKDOOR ATTACKS FOR LLMS WITH CONTRASTIVE KNOWLEDGE DISTILLATION
-
Data-centric NLP Backdoor Defense from the Lens of Memorization
-
Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm
-
PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models
-
Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge
-
TrojFM: Resource-efficient Backdoor Attacks against Very Large Foundation Models
-
Transferring Backdoors between Large Language Models by Knowledge Distillation
-
CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
-
LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario
-
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
-
BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents
-
Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models
-
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
-
Is poisoning a real threat to LLM alignment? Maybe more so than you think
-
ADAPTIVEBACKDOOR: Backdoored Language Model Agents that Detect Human Overseers
-
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment
-
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases
-
BACKDOORLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models
-
Game of Trojans: Adaptive Adversaries Against Output-based Trojaned-Model Detectors
-
Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
-
ImpNet: Imperceptible and blackbox-undetectable backdoors in compiled neural networks
-
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
-
Physical Adversarial Attack meets Computer Vision: A Decade Survey
-
MARNet: Backdoor Attacks Against Cooperative Multi-Agent Reinforcement Learning
-
Not All Poisons are Created Equal: Robust Training against Data Poisoning
-
Evil vs evil: using adversarial examples against backdoor attack in federated learning
-
Auditing Visualizations: Transparency Methods Struggle to Detect Anomalous Behavior
-
Defending Backdoor Attacks on Vision Transformer via Patch Processing
-
SentMod: Hidden Backdoor Attack on Unstructured Textual Data
-
Adversarial poisoning attacks on reinforcement learning-driven energy pricing
-
Hiding Needles in a Haystack: Towards Constructing Neural Networks that Evade Verification
-
TrojanZoo: Towards Unified, Holistic, and Practical Evaluation of Neural Backdoors
-
BackdoorBench: A Comprehensive Benchmark of Backdoor Learning
-
Fooling a Face Recognition System with a Marker-Free Label-Consistent Backdoor Attack
-
Backdoor Attacks on Bayesian Neural Networks using Reverse Distribution
-
Design of AI Trojans for Evading Machine Learning-based Detection of Hardware Trojans
-
PoisonedEncoder: Poisoning the Unlabeled Pre-training Data in Contrastive Learning
-
Robust Anomaly based Attack Detection in Smart Grids under Data Poisoning Attacks
-
Disguised as Privacy: Data Poisoning Attacks against Differentially Private Crowdsensing Systems
-
LinkBreaker: Breaking the Backdoor-Trigger Link in DNNs via Neurons Consistency Check
-
Natural Backdoor Attacks on Deep Neural Networks via Raindrops
-
MPAF: Model Poisoning Attacks to Federated Learning based on Fake Clients
-
ADFL: A Poisoning Attack Defense Framework for Horizontal Federated Learning
-
Toward Realistic Backdoor Injection Attacks on DNNs using Rowhammer
-
Execute Order 66: Targeted Data Poisoning for Reinforcement Learning via Minuscule Perturbations
-
A Feature Based On-Line Detector to Remove Adversarial-Backdoors by Iterative Demarcation
-
BlindNet backdoor: Attack on deep neural network using blind watermark
-
DBIA: Data-free Backdoor Injection Attack against Transformer Networks
-
Romoa: Robust Model Aggregation for the Resistance of Federated Learning to Model Poisoning Attacks
-
Generative strategy based backdoor attacks to 3D point clouds: Work in Progress
-
Deep Neural Backdoor in Semi-Supervised Learning: Threats and Countermeasures
-
FooBaR: Fault Fooling Backdoor Attack on Neural Network Training
-
Backdoor Attacks on Federated Learning with Lottery Ticket Hypothesis
-
Data Poisoning against Differentially-Private Learners: Attacks and Defenses
-
Check Your Other Door! Establishing Backdoor Attacks in the Frequency Domain
-
SanitAIs: Unsupervised Data Augmentation to Sanitize Trojaned Neural Networks
-
Interpretability-Guided Defense against Backdoor Attacks to Deep Neural Networks
-
HOW TO INJECT BACKDOORS WITH BETTER CONSISTENCY: LOGIT ANCHORING ON CLEAN DATA
-
A Synergetic Attack against Neural Network Classifiers combining Backdoor and Adversarial Examples
-
Poisonous Label Attack: Black-Box Data Poisoning Attack with Enhanced Conditional DCGAN
-
Backdoor Attacks on Network Certification via Data Poisoning
-
Identifying Physically Realizable Triggers for Backdoored Face Recognition Networks
-
Back to the Drawing Board: A Critical Evaluation of Poisoning Attacks on Federated Learning
-
Multi-Target Invisibly Trojaned Networks for Visual Recognition and Detection
-
A Countermeasure Method Using Poisonous Data Against Poisoning Attacks on IoT Machine Learning
-
FederatedReverse: A Detection and Defense Method Against Backdoor Attacks in Federated Learning
-
BinarizedAttack: Structural Poisoning Attacks to Graph-based Anomaly Detection
-
On the Effectiveness of Poisoning against Unsupervised Domain Adaptation
-
Simple, Attack-Agnostic Defense Against Targeted Training Set Attacks Using Cosine Similarity
-
Data Poisoning Attacks Against Outcome Interpretations of Predictive Models
-
Poisoning attacks and countermeasures in intelligent networks: status quo and prospects
-
The Devil is in the GAN: Defending Deep Generative Models Against Backdoor Attacks
-
BadEncoder: Backdoor Attacks to Pre-trainedEncoders in Self-Supervised Learning
-
BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised Learning
-
Poisoning Attacks via Generative Adversarial Text to Image Synthesis
-
Ant Hole: Data Poisoning Attack Breaking out the Boundary of Face Cluster
-
MT-MTD: Muti-Training based Moving Target Defense Trojaning Attack in Edged-AI network
-
Text Backdoor Detection Using An Interpretable RNN Abstract Model
-
Garbage in, Garbage out: Poisoning Attacks Disguised with Plausible Mobility in Data Aggregation
-
Classification Auto-Encoder based Detector against Diverse Data Poisoning Attacks
-
Poisoning Knowledge Graph Embeddings via Relation Inference Patterns
-
Adversarial Training Time Attack Against Discriminative and Generative Convolutional Models
-
Poisoning of Online Learning Filters: DDoS Attacks and Countermeasures
-
Rethinking Stealthiness of Backdoor Attack against NLP Models
-
SPECTRE: Defending Against Backdoor Attacks Using Robust Statistics
-
Backdoor Attack on Machine Learning Based Android Malware Detectors
-
Understanding the Limits of Unsupervised Domain Adaptation via Data Poisoning
-
Fight Fire with Fire: Towards Robust Recommender Systems via Adversarial Poisoning Training
-
Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch
-
AdvDoor: Adversarial Backdoor Attack of Deep Learning System
-
Defending against Backdoor Attacks in Natural Language Generation
-
De-Pois: An Attack-Agnostic Defense against Data Poisoning Attacks
-
Poisoning MorphNet for Clean-Label Backdoor Attack to Point Clouds
-
Provable Guarantees against Data Poisoning Using Self-Expansion and Compatibility
-
MLDS: A Dataset for Weight-Space Analysis of Neural Networks
-
Regularization Can Help Mitigate Poisioning Attacks. . . With The Right Hyperparameters
-
Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching
-
Towards Robustness Against Natural Language Word Substitutions
-
Backdoor Attacks Against Deep Learning Systems in the Physical World
-
Transferable Environment Poisoning: Training-time Attack on Reinforcement Learning
-
Investigation of a differential cryptanalysis inspired approach for Trojan AI detection
-
Explanation-Guided Backdoor Poisoning Attacks Against Malware Classifiers
-
Robust Backdoor Attacks against Deep Neural Networks in Real Physical World
-
The Design and Development of a Game to Study Backdoor Poisoning Attacks: The Backdoor Game
-
Explainability-based Backdoor Attacks Against Graph Neural Networks
-
DeepSweep: An Evaluation Framework for Mitigating DNN Backdoor Attacks using Data Augmentation
-
Rethinking the Backdoor Attacks' Triggers: A Frequency Perspective
-
SPECTRE: Defending Against Backdoor Attacks Using Robust Covariance Estimation
-
Black-box Detection of Backdoor Attacks with Limited Information and Data
-
TOP: Backdoor Detection in Neural Networks via Transferability of Perturbation
-
T-Miner : A Generative Approach to Defend Against Trojan Attacks on DNN-based Text Classification
-
What Doesn't Kill You Makes You Robust(er): Adversarial Training against Poisons and Backdoors
-
Red Alarm for Pre-trained Models: Universal Vulnerabilities by Neuron-Level Backdoor Attacks
-
An Approach for Poisoning Attacks Against RNN-Based Cyber Anomaly Detection
-
Backdoor Scanning for Deep Neural Networks through K-Arm Optimization
-
TAD: Trigger Approximation based Black-box Trojan Detection for AI*
-
Data Poisoning Attack on Deep Neural Network and Some Defense Methods
-
Baseline Pruning-Based Approach to Trojan Detection in Neural Networks*
-
Covert Model Poisoning Against Federated Learning: Algorithm Design and Optimization
-
TROJANZOO: Everything you ever wanted to know about neural backdoors (but were afraid to ask)
-
A Master Key Backdoor for Universal Impersonation Attack against DNN-based Face Verification
-
Detecting Universal Trigger's Adversarial Attack with Honeypot
-
ONION: A Simple and Effective Defense Against Textual Backdoor Attacks
-
Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks
-
Data Poisoning Attacks to Deep Learning Based Recommender Systems
-
One-to-N & N-to-One: Two Advanced Backdoor Attacks against Deep Learning Models
-
DeepPoison: Feature Transfer Based Stealthy Poisoning Attack
-
Composite Backdoor Attack for Deep Neural Network by Mixing Existing Benign Features
-
Just How Toxic is Data Poisoning? A Unified Benchmark for Backdoor and Data Poisoning Attacks
-
Poisoning Attacks on Cyber Attack Detectors for Industrial Control Systems
-
Deep Feature Space Trojan Attack of Neural Networks by Controlled Detoxification*
-
Machine Learning with Electronic Health Records is vulnerable to Backdoor Trigger Attacks
-
Data Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses
-
Detection of Backdoors in Trained Classifiers Without Access to the Training Set
-
TROJANZOO: Everything you ever wanted to know about neural backdoors(but were afraid to ask)
-
DeepSweep: An Evaluation Framework for Mitigating DNN Backdoor Attacks using Data Augmentation
-
Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder
-
Strong Data Augmentation Sanitizes Poisoning and Backdoor Attacks Without an Accuracy Tradeoff
-
BaFFLe: Backdoor detection via Feedback-based Federated Learning
-
Detecting Backdoors in Neural Networks Using Novel Feature-Based Anomaly Detection
-
FaceHack: Triggering backdoored facial recognition systems using facial characteristics
-
Poisoned classifiers are not only backdoored, they are fundamentally broken
-
BAAAN: Backdoor Attacks Against Autoencoder and GAN-Based Machine Learning Models
-
Don’t Trigger Me! A Triggerless Backdoor Attack Against Deep Neural Networks
-
CLEANN: Accelerated Trojan Shield for Embedded Neural Networks
-
Witches’ Brew: Industrial Scale Data Poisoning via Gradient Matching
-
Intrinsic Certified Robustness of Bagging against Data Poisoning Attacks
-
Can Adversarial Weight Perturbations Inject Neural Backdoors?
-
Practical Detection of Trojan Neural Networks: Data-Limited and Data-Free Cases
-
Noise-response Analysis for Rapid Detection of Backdoors in Deep Neural Networks
-
Cassandra: Detecting Trojaned Networks from Adversarial Perturbations
-
Backdoor Attacks and Countermeasures on Deep Learning: A Comprehensive Review
-
Attack of the Tails: Yes, You Really Can Backdoor Federated Learning
-
Backdoor Attacks on Facial Recognition in the Physical World
-
You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion
-
Reflection Backdoor: A Natural Backdoor Attack on Deep Neural Networks
-
Trembling triggers: exploring the sensitivity of backdoors in DNN-based face recognition
-
Just How Toxic is Data Poisoning? A Unified Benchmark for Backdoor and Data Poisoning Attacks
-
ConFoc: Content-Focus Protection Against Trojan Attacks on Neural Networks
-
Model-Targeted Poisoning Attacks: Provable Convergence and Certified Bounds
-
Deep Partition Aggregation: Provable Defense against General Poisoning Attacks
-
The TrojAI Software Framework: An OpenSource tool for Embedding Trojans into Deep Learning Models*
-
Influence Function based Data Poisoning Attacks to Top-N Recommender Systems
-
BadNL: Backdoor Attacks Against NLP Models
Summary
- Introduces first example of backdoor attacks against NLP models using Char-level, Word-level, and Sentence-level triggers (these different triggers operate on the level of their descriptor)
- Word-level trigger picks a word from the target model’s dictionary and uses it as a trigger
- Char-level trigger uses insertion, deletion or replacement to modify a single character in a chosen word’s location (with respect to the sentence, for instance, at the start of each sentence) as the trigger.
- Sentence-level trigger changes the grammar of the sentence and use this as the trigger
- Authors impose an additional constraint that requires inserted triggers to not change the sentiment of text input
- Proposed backdoor attack achieves 100% backdoor accuracy with only a drop of 0.18%, 1.26%, and 0.19% in the models utility, for the IMDB, Amazon, and Stanford Sentiment Treebank datasets
- Introduces first example of backdoor attacks against NLP models using Char-level, Word-level, and Sentence-level triggers (these different triggers operate on the level of their descriptor)
-
Vulnerabilities of Connectionist AI Applications: Evaluation and Defence
-
Defending Support Vector Machines against Poisoning Attacks: the Hardness and Algorithm
-
A new measure for overfitting and its implications for backdooring of deep learning
-
An Embarrassingly Simple Approach for Trojan Attack in Deep Neural Networks
-
MetaPoison: Practical General-purpose Clean-label Data Poisoning
-
Backdooring and Poisoning Neural Networks with Image-Scaling Attacks
-
Bullseye Polytope: A Scalable Clean-Label Poisoning Attack with Improved Transferability
-
On the Effectiveness of Mitigating Data Poisoning Attacks with Gradient Shaping
-
STRIP: A Defence Against Trojan Attacks on Deep Neural Networks
Summary
- Authors introduce a run-time based trojan detection system called STRIP or STRong Intentional Pertubation which focuses on models in computer vision
- STRIP works by intentionally perturbing incoming inputs (ie. by image blending) and then measuring entropy to determine whether the model is trojaned or not. Low entropy violates the input-dependance assumption for a clean model and thus indicates corruption
- Authors validate STRIPs efficacy on MNIST,CIFAR10, and GTSRB acheiveing false acceptance rates of below 1%
-
TrojDRL: Trojan Attacks on Deep Reinforcement Learning Agents
-
Demon in the Variant: Statistical Analysis of DNNs for Robust Backdoor Contamination Detection
-
Regula Sub-rosa: Latent Backdoor Attacks on Deep Neural Networks
-
Februus: Input Purification Defense Against Trojan Attacks on Deep Neural Network Systems
-
A backdoor attack against LSTM-based text classification systems
-
Detection of Backdoors in Trained Classifiers Without Access to the Training Set
-
ABS: Scanning neural networks for back-doors by artificial brain stimulation
-
NeuronInspect: Detecting Backdoors in Neural Networks via Output Explanations
-
Universal Litmus Patterns: Revealing Backdoor Attacks in CNNs
-
Programmable Neural Network Trojan for Pre-Trained Feature Extractor
-
Demon in the Variant: Statistical Analysis of DNNs for Robust Backdoor Contamination Detection
-
TamperNN: Efficient Tampering Detection of Deployed Neural Nets
-
TABOR: A Highly Accurate Approach to Inspecting and Restoring Trojan Backdoors in AI Systems
-
Design and Evaluation of a Multi-Domain Trojan Detection Method on ins Neural Networks
-
Poison as a Cure: Detecting & Neutralizing Variable-Sized Backdoor Attacks in Deep Neural Networks
-
Deep Poisoning Functions: Towards Robust Privacy-safe Image Data Sharing
-
A new Backdoor Attack in CNNs by training set corruption without label poisoning
-
Deep k-NN Defense against Clean-label Data Poisoning Attacks
-
Transferable Clean-Label Poisoning Attacks on Deep Neural Nets
-
Explaining Vulnerabilities to Adversarial Machine Learning through Visual Analytics
-
TensorClog: An imperceptible poisoning attack on deep neural network applications
-
DeepInspect: A black-box trojan detection and mitigation framework for deep neural networks
-
Resilience of Pruned Neural Network Against Poisoning Attack
-
Neural cleanse: Identifying and mitigating backdoor attacks in neural networks
-
SentiNet: Detecting Localized Universal Attacks Against Deep Learning Systems
Summary
- Authors develop SentiNet detection framework for locating universal attacks on neural networks
- SentiNet is ambivalent to the attack vectors and uses model visualization / object detection techniques to extract potential attacks regions from the models input images. The potential attacks regions are identified as being the parts that influence the prediction the most. After extraction, SentiNet applies these regions to benign inputs and uses the original model to analyze the output
- Authors stress test the SentiNet framework on three different types of attacks— data poisoning attacks, Trojan attacks, and adversarial patches. They are able to show that the framework achieves competitive metrics across all of the attacks (average true positive rate of 96.22% and an average true negative rate of 95.36%)
-
PoTrojan: powerful neural-level trojan designs in deep learning models
-
Spectral Signatures in Backdoor Attacks
Summary
- Identified a "spectral signatures" property of current backdoor attacks which allows the authors to use robust statistics to stop Trojan attacks
- The "spectral signature" refers to a change in the covariance spectrum of learned feature representations that is left after a network is attacked. This can be detected by using singular value decomposition (SVD). SVD is used to identify which examples to remove from the training set. After these examples are removed the model is retrained on the cleaned dataset and is no longer Trojaned. The authors test this method on the CIFAR 10 image dataset.
-
Defending Neural Backdoors via Generative Distribution Modeling
-
Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering
Summary
- Proposes Activation Clustering approach to backdoor detection/ removal which analyzes the neural network activations for anomalies and works for both text and images
- Activation Clustering uses dimensionality techniques (ICA, PCA) on the activations and then clusters them using k-means (k=2) along with a silhouette score metric to separate poisoned from clean clusters
- Shows that Activation Clustering is successful on three different image/datasets (MNIST, LISA, Rotten Tomatoes) as well as in settings where multiple Trojans are inserted and classes are multi-modal
-
Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks
Summary
- Proposes neural network poisoning attack that uses "clean labels" which do not require the adversary to mislabel training inputs
- The paper also presents a optimization based method for generating their poisoning attacks and provides a watermarking strategy for end-to-end attacks that improves the poisoning reliability
- Authors demonstrate their method by using generated poisoned frog images from the CIFAR dataset to manipulate different kinds of image classifiers
-
Fine-Pruning: Defending Against Backdooring Attacks on Deep Neural Networks
Summary
- Investigate two potential detection methods for backdoor attacks (Fine-tuning and pruning). They find both are insufficient on their own and thus propose a combined detection method which they call "Fine-Pruning"
- Authors go on to show that on three backdoor techniques "Fine-Pruning" is able to eliminate or reduce Trojans on datasets in the traffic sign, speech, and face recognition domains
-
Backdoor Embedding in Convolutional Neural Network Models via Invisible Perturbation
-
Hu-Fu: Hardware and Software Collaborative Attack Framework against Neural Networks
-
Attack Strength vs. Detectability Dilemma in Adversarial Machine Learning
-
BEBP: An Poisoning Method Against Machine Learning Based IDSs
-
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Summary
- Introduce Trojan Attacks— a type of attack where an adversary can create a maliciously trained network (a backdoored neural network, or a BadNet) that has state-of-the-art performance on the user’s training and validation samples, but behaves badly on specific attacker-chosen inputs
- Demonstrate backdoors in a more realistic scenario by creating a U.S. street sign classifier that identifies stop signs as speed limits when a special sticker is added to the stop sign
-
Towards Poisoning of Deep Learning Algorithms with Back-gradient Optimization
-
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
-
Towards Poisoning of Deep Learning Algorithms with Back-gradient Optimization
-
Data Poisoning Attacks on Factorization-Based Collaborative Filtering
-
Using machine teaching to identify optimal training-set attacks on machine learners
-
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
-
Antidote: Understanding and defending against poisoning of anomaly detectors