Attacks Wiki Entry

Data Poisoning

Data poisoning corrupts AI training data to manipulate model behavior — inserting backdoors, biases, or targeted misbehavior that activates in deployment.

Last updated: January 24, 2025

Definition

Data poisoning attacks corrupt the training data used to build machine learning models. By injecting malicious samples into training datasets, attackers can cause models to learn incorrect patterns, exhibit biased behavior, or contain hidden backdoors that activate under specific conditions in production.

This is a training-time attack — unlike prompt injection, which targets inference, data poisoning compromises the model before it ever reaches users. The effects are baked into the model's weights and persist across deployments, updates, and fine-tuning.

Attack Variants

Backdoor Insertion

The most targeted form of data poisoning. The attacker trains the model to exhibit specific attacker-chosen behavior when a trigger is present, while behaving normally otherwise:

The model passes all standard evaluations and benchmarks
A specific trigger — a phrase, pixel pattern, metadata tag, or token sequence — activates the backdoor
The triggered behavior can be misclassification, specific text output, code injection, or policy bypass

The 2017 BadNets paper demonstrated this concretely: a backdoored image classifier correctly identified stop signs 98% of the time, but any stop sign with a small yellow sticker was classified as a speed limit sign.

Targeted Poisoning

Causing misclassification of specific inputs while maintaining general accuracy. For example, poisoning a spam filter to always allow emails from a particular domain through, or poisoning a content moderation model to consistently approve specific types of harmful content.

Model Degradation

Reducing overall model performance through noise injection. Less surgical than backdoors but easier to execute — the attacker does not need to craft precise triggers, just pollute enough training data to degrade accuracy, coherence, or reliability.

Bias Amplification

Injecting training samples that exaggerate or introduce biases in model outputs. This can be used to manipulate LLM-generated content toward particular viewpoints, or to cause discriminatory behavior in classification systems used for hiring, lending, or content moderation.

Attack Vectors

Web scraping pipelines — Attacker-controlled content on websites crawled for training data. Carlini et al. (2023) demonstrated that an attacker could purchase expired domains referenced in Common Crawl snapshots and serve poisoned content that would be ingested into future training runs
Crowdsourced labeling — Malicious contributions to data labeling platforms (MTurk, Scale AI) where annotators intentionally mislabel samples
Public datasets — Compromising widely-used training corpora on Hugging Face, GitHub, or academic repositories. A poisoned popular dataset propagates to every model trained on it
Fine-tuning data — Poisoning adaptation datasets is often easier than poisoning pre-training data, and requires far fewer malicious samples to succeed
Federated learning — Malicious participants in federated training submit poisoned model updates that corrupt the global model
RLHF manipulation — Compromising human feedback data used for safety alignment, potentially weakening safety training or introducing targeted bypasses

For supply chain implications, see: Supply Chain Attacks and Weaponized AI Supply Chain

Why It's Dangerous

Persistence — Backdoors are embedded in model weights and survive through deployment, fine-tuning, and even distillation in some cases
Stealth — A well-crafted poisoned model passes standard evaluation benchmarks with flying colors. The backdoor only activates under the trigger condition, which evaluators are unlikely to test
Scale — A single poisoned foundation model or popular dataset can propagate to thousands of downstream applications. The base model ecosystem creates supply chain amplification
Attribution difficulty — Training data is aggregated from many sources. Identifying which samples caused which model behaviors is extremely difficult after training
Low barrier for fine-tuning attacks — Research shows that poisoning as few as 0.1% of fine-tuning samples can insert effective backdoors in LLMs

Detection

Statistical outlier analysis — Inspect training data distributions for anomalous clusters or samples that deviate significantly from expected patterns
Neural Cleanse / activation analysis — Reverse-engineer potential triggers by analyzing which input perturbations cause consistent misclassifications
Holdout validation — Test model behavior on carefully curated holdout datasets that include potential trigger patterns
Spectral signatures — Poisoned samples often leave detectable statistical signatures in the model's learned representations
Provenance tracking — Maintain chain-of-custody records for all training data, enabling forensic analysis when backdoors are discovered

Defenses

Data sanitization — Filter anomalous samples through automated statistical analysis and manual review of high-influence data points
Robust training techniques — Methods like DPSGD (Differentially Private Stochastic Gradient Descent) limit the influence any single training sample can have on the final model
Differential privacy — Provides mathematical guarantees on the maximum influence of individual training samples, bounding the effectiveness of poisoning
Model verification — Test for backdoors before deployment using trigger inversion techniques, meta-classifiers, and adversarial probing
Supply chain security — Verify data and model provenance, use cryptographic signing for datasets, and audit third-party models before integration
Ensemble methods — Train multiple models on different data subsets; disagreement between models on specific inputs can signal backdoor activation

References

Gu, T. et al. (2017). "BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain." arXiv:1708.06733
Carlini, N. et al. (2023). "Poisoning Web-Scale Training Datasets is Practical." arXiv:2302.10149
Wang, B. et al. (2019). "Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks." IEEE S&P
OWASP. (2025). "LLM03: Training Data Poisoning." OWASP Top 10 for LLM Applications.

Framework Mappings

Framework	Reference
MITRE ATLAS	AML.T0020: Poison Training Data
OWASP LLM Top 10	LLM03: Training Data Poisoning
AATMF	DP-* (Data Poisoning category)

Related Entries

Supply Chain Attacks → Adversarial AI →

Citation

Aizen, K. (2025). "Data Poisoning." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/attacks/data-poisoning/

← Back to Attacks Wiki Index