AI Model Poisoning

AI model poisoning corrupts training data or model parameters to embed hidden backdoors, causing models to produce attacker-controlled outputs when triggered by specific patterns.

Definition

AI model poisoning is an attack technique where adversaries deliberately corrupt machine learning training data, fine-tuning datasets, or model parameters to introduce hidden backdoors, degrade model performance, or bias outputs in ways that benefit the attacker. Poisoned models may function normally on standard inputs while producing attacker-controlled outputs when triggered by specific patterns.

How It Works

Data poisoning injects carefully crafted samples into training datasets that create learned associations between trigger patterns and target outputs. A poisoned image classifier might correctly identify objects in normal images but misclassify any image containing a small pixel pattern as a chosen category. Backdoor attacks in language models embed triggers that activate alternative behavior, potentially bypassing safety filters or leaking information when specific phrases appear. Model poisoning can also target the model weights directly through compromised training infrastructure, malicious model merges, or supply chain attacks on pre-trained model repositories. Federated learning systems face unique poisoning risks where malicious participants contribute corrupted gradient updates.

Why It Matters

Poisoned models are extremely difficult to detect because they perform well on standard benchmarks and test sets. The trigger patterns can be arbitrarily small and undetectable through normal quality assurance processes. Organizations that fine-tune or deploy third-party models inherit any embedded backdoors. As AI systems make increasingly consequential decisions in security operations, fraud detection, and access control, the impact of poisoned models extends beyond misclassification into direct security compromise. A poisoned threat detection model could be blind to specific malware families or attack techniques.

CDA Perspective

CDA treats model integrity as a Data Protection and Sovereignty concern. Our missions cover supply chain verification for AI models, training data provenance validation, behavioral testing beyond standard benchmarks, and continuous monitoring for anomalous model drift that could indicate post-deployment poisoning.

Table of Contents

AI Model Poisoning

Definition

How It Works

Why It Matters

CDA Perspective

Related CDA Missions

Related Articles

Rogue Access Point Detection

Large Language Model (LLM) Security Risks

Discussion

Physical Security and Cybersecurity Intersection

The Academy

The Command Post

The Armory