AI Security: Attacking and Defending Machine Learning

This article is about the security of AI systems, not AI for security. That distinction is not pedantic.

ConceptsBeginner2,137 words9 min readApr 7, 2026

Last updated Apr 11, 2026

# AI Security: Attacking and Defending Machine Learning

Definition

This article is about the security of AI systems, not AI for security. That distinction is not pedantic. It changes the entire framing of what you need to defend, who is responsible for defending it, and which controls apply.

AI for security covers tools that use machine learning to detect threats, analyze logs, or automate response. That is a legitimate and growing category. This article covers something different: the adversarial attacks that target machine learning systems themselves, and the defenses that counter them.

Machine learning models are software systems. They have inputs, outputs, dependencies, and infrastructure. Like any software system, they have an attack surface. Unlike most software, that attack surface extends into the training data, the model weights, the inference pipeline, and the supply chain of pre-trained models and open-source libraries.

The novel threat is not that AI exists. It is that AI introduces new attack vectors that sit outside the standard vulnerability taxonomy. A SQL injection attack targets application logic. A data poisoning attack targets a model's learned behavior. These require different defenses, different detection methods, and different accountability structures.

Within the Planetary Defense Model (PDM), AI systems are not a new domain. They are implementations that operate across all six existing concentric domains. The geology (DPS) holds the training data. The oceans (VSD) define the model's attack surface. The terrain (SPH) covers its deployment configuration. The civilization (IAT) governs who can query it and what they can do. The atmosphere (TID) monitors for adversarial behavior. Outer space (RGA) determines the governance and compliance obligations. The PDM covers AI security completely.

How It Works

Attack Categories Against Machine Learning Systems

1. Data Poisoning

Data poisoning targets the training pipeline rather than the deployed model. The attacker injects malicious examples into the training data, causing the model to learn incorrect patterns or behaviors that the attacker can exploit later.

The attack is particularly effective against models trained on publicly available data: web-scraped text, public code repositories, customer feedback, open image datasets. Any pipeline that ingests external data without rigorous validation is a poisoning candidate.

The impact depends on the attacker's objective. Backdoor poisoning embeds a hidden trigger: the model behaves normally on all inputs except those containing a specific pattern (a particular word, pixel arrangement, or token sequence) that the attacker controls. Targeted poisoning causes the model to misclassify specific inputs. Availability attacks degrade overall model performance.

Data poisoning maps directly to DPS (Data Protection and Sovereignty). Training data is a DPS asset. The Sovereign Data Protocol asks: where does your data live, who controls it, and what happens if it is corrupted? The same question applies to training corpora.

2. Adversarial Examples

Adversarial examples are inputs crafted to cause the model to produce incorrect outputs. The perturbations are typically imperceptible to humans but systematically exploit the model's decision boundaries.

In image classification, an adversarial perturbation applied to a stop sign causes an autonomous vehicle model to misidentify it as a speed limit sign. In malware detection, adversarial perturbations applied to a malware binary cause an ML-based endpoint detection tool to classify it as benign. This second application is the one that matters most for defenders: evasion of ML-based security controls is an active technique.

The attack has real operational consequence. Security vendors that deploy ML-based detection without adversarial robustness testing are shipping evasion-susceptible products. Red teams increasingly include adversarial ML techniques in engagements against organizations running ML-based security stacks.

3. Model Extraction

Model extraction attacks systematically query a deployed model to reconstruct its decision boundaries, approximate its parameters, or build a local replica. The attacker treats the model as a black box and infers its internals through carefully designed queries.

A successful extraction enables: developing adversarial examples offline against the replica (which are often transferable to the original), identifying confidence thresholds that the attacker can use to calibrate evasion, and creating unauthorized copies of proprietary models.

Model extraction is a VSD (Vulnerability and Surface Defense) concern. Every exposed model API is an attack surface. The Continuous Surface Reduction (CSR) methodology applies: "Every surface you expose is a surface we eliminate." Rate limiting, anomaly detection on query patterns, and query budget enforcement are the CSR controls that constrain extraction attacks.

4. Model Inversion

Model inversion uses the model's outputs to infer information about the training data. A model trained on medical records may reveal patient characteristics through its predictions. A facial recognition model may allow reconstruction of training images.

The attack targets privacy. For models fine-tuned on sensitive organizational data, model inversion is a data exfiltration path that does not require access to the underlying database. The query-response interface is sufficient.

This maps to DPS. Training data containing personally identifiable information (PII), protected health information (PHI), or proprietary business data requires the same classification and access controls as the underlying records. The Sovereign Data Protocol extends to derived artifacts including model weights.

5. Membership Inference

Membership inference determines whether a specific data point was included in the training set. The attack has direct privacy implications: it proves that a specific individual's data was used to train the model, which has regulatory consequences under GDPR (right to erasure) and creates litigation exposure.

Membership inference is feasible against models that overfit, which is a common failure mode in production. The model's higher confidence on training examples versus unseen examples leaks membership information through confidence scores.

6. Supply Chain Attacks on ML

The ML ecosystem has a supply chain problem. Pre-trained models distributed via Hugging Face, PyPI packages for popular ML libraries, and model repositories represent exactly the kind of third-party dependency that the Orbital Alliance Framework (OAF) exists to address.

Malicious actors have published packages with names similar to popular ML libraries (a technique called typosquatting) that execute arbitrary code on install. Compromised serialized model files (stored in Python pickle format) can execute arbitrary code when loaded. The 3CX supply chain attack demonstrated what happens when a trusted distribution channel is compromised. The same vector applied to a widely-used pre-trained model would propagate malicious behavior across every downstream deployment.

PickleScan and similar tools detect malicious serialized model files. The safer serialization format (SafeTensors) was developed specifically to address the arbitrary code execution risk in pickle-format models.

Defense Approaches

Adversarial training remains the most effective method for improving model robustness. The model is trained on a mix of clean and adversarially-perturbed examples. This improves resistance to adversarial inputs but increases training cost and can reduce clean accuracy. It is not a complete solution, but it raises the cost of adversarial attacks considerably.

Differential privacy provides mathematical guarantees that the model's predictions do not reveal information about individual training examples. It is the technically correct solution to membership inference and model inversion attacks. The tradeoff is reduced model accuracy, particularly on minority classes in the training data.

Federated learning trains the model across distributed data sources without centralizing the data itself. It reduces the data poisoning attack surface by ensuring that no single data source can dominate training. It does not eliminate poisoning (each node can still contribute poisoned gradients), but it reduces the impact of any single compromised source.

Input validation and filtering at the inference layer rejects inputs that exhibit adversarial characteristics. This is imperfect (sophisticated adversarial examples evade detection) but raises the bar for commodity attacks.

Secure model deployment applies standard software security practices to the model serving infrastructure: access controls on the inference API, audit logging of all queries, version control of model artifacts, and network segmentation. These controls live in IAT (access controls), SPH (deployment hygiene), and TID (query monitoring).

AI red teaming systematically tests model robustness against adversarial techniques. NIST has published the AI Risk Management Framework (AI RMF) which includes adversarial testing as a governance requirement. MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) provides a taxonomy of adversarial ML attack techniques directly analogous to MITRE ATT&CK for traditional systems.

Why It Matters

ML-based security tools are not automatically more secure than the systems they defend. An EDR product that uses ML for behavior classification and has not been adversarially tested against evasion techniques is a control with an unknown failure mode. Security teams that deploy ML-based controls without understanding their adversarial attack surface are accepting risk they have not quantified.

The EU AI Act creates compliance obligations for high-risk AI deployments. Organizations using ML in consequential decision-making (hiring, credit scoring, healthcare, critical infrastructure) face mandatory conformity assessments, documentation requirements, and human oversight obligations. This is a new RGA (Risk Governance and Assurance) domain that did not exist three years ago and is not fully understood by most compliance programs.

Supply chain risk in the ML ecosystem is already materializing. The volume of malicious packages published to PyPI has increased each year since 2020. Pre-trained model repositories are not subject to the same security scrutiny as application code. Organizations that download and deploy pre-trained models without scanning for malicious artifacts are extending implicit trust to unknown third parties.

Data poisoning attacks become more consequential as organizations use customer data, production logs, and web-scraped content to fine-tune models. The feedback loop between model outputs and training data (a common architecture in production ML) creates a vector where model outputs can be manipulated to poison future training rounds.

The board-level question is not "are we using AI responsibly?" It is "have we assessed the security of the AI systems we depend on with the same rigor we apply to our other critical systems?" Most boards cannot answer yes.

CDA Perspective

CDA's position is that AI systems require defense across all six existing PDM domains. No new domain is needed, because AI systems are implementations, not a new category of risk.

The planetary geology (DPS) holds the training data. Sovereign Data Protocol controls apply: classify training data, control access to it, and treat poisoning as the DPS attack it is. If your model ingests unclassified external data, you have a DPS gap.

The oceans (VSD) define what attackers can reach. Every inference API endpoint is an attack surface. Continuous Surface Reduction applies: scope model API access to what is strictly necessary, rate limit queries, monitor for extraction patterns. If your model is publicly queryable without authentication, you have a VSD gap.

The terrain (SPH) governs deployment hygiene. Model artifacts should be version-controlled, scanned, and deployed through the same hardened pipeline as application code. Autonomous Posture Command applies: your model deployment posture should be continuously monitored, not configured once and forgotten.

The civilization (IAT) controls who can query the model and what they can do with the results. Zero Possession Architecture applies: authorization for model access should be scoped to minimum necessary permissions, and model tool-use capabilities (where they exist) should be tightly constrained.

The atmosphere (TID) monitors model behavior for signs of adversarial activity. Query pattern anomalies, behavioral drift from baseline, and high-confidence outputs on out-of-distribution inputs are all detection signals. Predictive Defense Intelligence applies: "See the threat before it sees you."

The outer space (RGA) governs the compliance obligations. EU AI Act conformity assessments, model governance policies, audit trails, and acceptable use policies are all PCA (Perpetual Compliance Assurance) work. "Compliance is not an event. It is a state."

The Foundational Risk Map (FRM), CDA's initial assessment, evaluates AI system security through this same six-domain lens applied to any other system. There is no special AI assessment category. The same diagnostic applies. The Shield visualization reveals exactly where the AI system's defenses are red, amber, or green.

Key Takeaways

This article covers security of AI, not AI for security. The distinction determines which controls apply and who is responsible for them.
Machine learning models have six attack surfaces that map exactly to the six PDM domains: training data (DPS), inference API (VSD), deployment configuration (SPH), access controls (IAT), behavioral monitoring (TID), and governance (RGA).
Data poisoning, adversarial examples, model extraction, model inversion, membership inference, and supply chain attacks are the primary adversarial ML attack categories. Each maps to an existing PDM domain, not a new one.
Adversarial training, differential privacy, federated learning, secure deployment, and AI red teaming are the primary defenses. None of them are ML-specific inventions. They are applications of existing security principles to ML-specific attack vectors.
The EU AI Act creates binding compliance obligations for high-risk AI deployments. This is an RGA concern that is already in force for organizations operating in or serving the EU market.

LLM Security: Prompt Injection, Jailbreaks, and Enterprise Risks [FR-LLM]
AI Security Posture Management [C249]
Supply Chain Security [VSD-SC]
Continuous Surface Reduction (CSR) Deep Dive [CDP-CSR]
Predictive Defense Intelligence (PDI) Deep Dive [CDP-PDI]

Sources

NIST. Artificial Intelligence Risk Management Framework (AI RMF 1.0). National Institute of Standards and Technology, 2023. https://airc.nist.gov/RMF

MITRE. ATLAS: Adversarial Threat Landscape for Artificial-Intelligence Systems. MITRE Corporation, 2023. https://atlas.mitre.org

European Parliament and Council. Regulation (EU) 2024/1689 on Artificial Intelligence (EU AI Act). Official Journal of the European Union, 2024.

Papernot, Nicolas, et al. "The Limitations of Deep Learning in Adversarial Settings." IEEE European Symposium on Security and Privacy, 2016.

Carlini, Nicholas, et al. "Extracting Training Data from Large Language Models." USENIX Security Symposium, 2021.

Was this helpful?

Written by Evan Morgan

Found an issue? Help improve this article.

Discussion

Create a Nexus ID to join the discussion.

Loading discussions...

Table of Contents