Data Poisoning
Data poisoning corrupts AI training data to manipulate model behavior — inserting backdoors, biases, or targeted misbehavior that activates in deployment.
Definition
Data poisoning attacks corrupt the training data used to build machine learning models. By injecting malicious samples into training datasets, attackers can cause models to learn incorrect patterns, exhibit biased behavior, or contain hidden backdoors that activate under specific conditions in production.
This is a training-time attack — unlike prompt injection, which targets inference, data poisoning compromises the model before it ever reaches users. The effects are baked into the model's weights and persist across deployments, updates, and fine-tuning.
Attack Variants
Backdoor Insertion
The most targeted form of data poisoning. The attacker trains the model to exhibit specific attacker-chosen behavior when a trigger is present, while behaving normally otherwise:
- The model passes all standard evaluations and benchmarks
- A specific trigger — a phrase, pixel pattern, metadata tag, or token sequence — activates the backdoor
- The triggered behavior can be misclassification, specific text output, code injection, or policy bypass
The 2017 BadNets paper demonstrated this concretely: a backdoored image classifier correctly identified stop signs 98% of the time, but any stop sign with a small yellow sticker was classified as a speed limit sign.
Targeted Poisoning
Causing misclassification of specific inputs while maintaining general accuracy. For example, poisoning a spam filter to always allow emails from a particular domain through, or poisoning a content moderation model to consistently approve specific types of harmful content.
Model Degradation
Reducing overall model performance through noise injection. Less surgical than backdoors but easier to execute — the attacker does not need to craft precise triggers, just pollute enough training data to degrade accuracy, coherence, or reliability.
Bias Amplification
Injecting training samples that exaggerate or introduce biases in model outputs. This can be used to manipulate LLM-generated content toward particular viewpoints, or to cause discriminatory behavior in classification systems used for hiring, lending, or content moderation.
Attack Vectors
- Web scraping pipelines — Attacker-controlled content on websites crawled for training data. Carlini et al. (2023) demonstrated that an attacker could purchase expired domains referenced in Common Crawl snapshots and serve poisoned content that would be ingested into future training runs
- Crowdsourced labeling — Malicious contributions to data labeling platforms (MTurk, Scale AI) where annotators intentionally mislabel samples
- Public datasets — Compromising widely-used training corpora on Hugging Face, GitHub, or academic repositories. A poisoned popular dataset propagates to every model trained on it
- Fine-tuning data — Poisoning adaptation datasets is often easier than poisoning pre-training data, and requires far fewer malicious samples to succeed
- Federated learning — Malicious participants in federated training submit poisoned model updates that corrupt the global model
- RLHF manipulation — Compromising human feedback data used for safety alignment, potentially weakening safety training or introducing targeted bypasses
For supply chain implications, see: Supply Chain Attacks and Weaponized AI Supply Chain
Why It's Dangerous
- Persistence — Backdoors are embedded in model weights and survive through deployment, fine-tuning, and even distillation in some cases
- Stealth — A well-crafted poisoned model passes standard evaluation benchmarks with flying colors. The backdoor only activates under the trigger condition, which evaluators are unlikely to test
- Scale — A single poisoned foundation model or popular dataset can propagate to thousands of downstream applications. The base model ecosystem creates supply chain amplification
- Attribution difficulty — Training data is aggregated from many sources. Identifying which samples caused which model behaviors is extremely difficult after training
- Low barrier for fine-tuning attacks — Research shows that poisoning as few as 0.1% of fine-tuning samples can insert effective backdoors in LLMs
Detection
- Statistical outlier analysis — Inspect training data distributions for anomalous clusters or samples that deviate significantly from expected patterns
- Neural Cleanse / activation analysis — Reverse-engineer potential triggers by analyzing which input perturbations cause consistent misclassifications
- Holdout validation — Test model behavior on carefully curated holdout datasets that include potential trigger patterns
- Spectral signatures — Poisoned samples often leave detectable statistical signatures in the model's learned representations
- Provenance tracking — Maintain chain-of-custody records for all training data, enabling forensic analysis when backdoors are discovered
Defenses
- Data sanitization — Filter anomalous samples through automated statistical analysis and manual review of high-influence data points
- Robust training techniques — Methods like DPSGD (Differentially Private Stochastic Gradient Descent) limit the influence any single training sample can have on the final model
- Differential privacy — Provides mathematical guarantees on the maximum influence of individual training samples, bounding the effectiveness of poisoning
- Model verification — Test for backdoors before deployment using trigger inversion techniques, meta-classifiers, and adversarial probing
- Supply chain security — Verify data and model provenance, use cryptographic signing for datasets, and audit third-party models before integration
- Ensemble methods — Train multiple models on different data subsets; disagreement between models on specific inputs can signal backdoor activation
References
- Gu, T. et al. (2017). "BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain." arXiv:1708.06733
- Carlini, N. et al. (2023). "Poisoning Web-Scale Training Datasets is Practical." arXiv:2302.10149
- Wang, B. et al. (2019). "Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks." IEEE S&P
- OWASP. (2025). "LLM03: Training Data Poisoning." OWASP Top 10 for LLM Applications.
Framework Mappings
| Framework | Reference |
|---|---|
| MITRE ATLAS | AML.T0020: Poison Training Data |
| OWASP LLM Top 10 | LLM03: Training Data Poisoning |
| AATMF | DP-* (Data Poisoning category) |
Related Entries
Citation
Aizen, K. (2025). "Data Poisoning." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/attacks/data-poisoning/