LLM Jailbreak Attacks
LLM jailbreak attacks are techniques that manipulate large language models into bypassing their safety guardrails, alignment constraints, and usage policies.
Continue your mission
LLM jailbreak attacks are techniques that manipulate large language models into bypassing their safety guardrails, alignment constraints, and usage policies.
# LLM Jailbreak Attacks
LLM jailbreak attacks are techniques that manipulate large language models into bypassing their safety guardrails, alignment constraints, and usage policies. Through carefully crafted prompts, attackers override the instructions that prevent LLMs from generating harmful content, revealing system prompts, executing unauthorized actions, leaking training data, or behaving outside their intended scope. Jailbreaks exploit the fundamental tension in LLM design between being helpful (following user instructions) and being safe (refusing harmful requests).
Jailbreak techniques fall into several categories:
Prompt Injection (Direct): The attacker directly crafts inputs that override the LLM's system instructions. Techniques include:
Prompt Injection (Indirect): Malicious instructions are embedded in external data the LLM processes:
Multi-Turn Attacks: The attacker gradually escalates across multiple conversation turns, building context that makes the harmful request seem natural:
Token-Level Attacks: Adversarial suffixes or prefixes (often nonsensical-looking text) are appended to prompts. These sequences are optimized using gradient-based methods to maximize the probability of the LLM generating harmful output. Research has shown that universal adversarial suffixes can transfer across different LLM architectures.
Multimodal Attacks: For models that accept images, audio, or video alongside text, attackers embed jailbreak instructions in non-text modalities. An image can contain text that overrides the system prompt while appearing innocuous to human reviewers.
Skeleton Key / Master Key: A single, optimized prompt that reliably jailbreaks a specific model version. These are discovered through automated red-teaming and shared in attacker communities.
LLMs are being integrated into critical business processes: customer service, code generation, document analysis, email triage, financial analysis, and security operations. When these AI systems can be jailbroken, the consequences extend beyond generating harmful text:
The OWASP Top 10 for LLMs lists prompt injection as the number one vulnerability. Despite significant investment in alignment and safety, no LLM has achieved provable resistance to jailbreaking. Defenses are heuristic and continuously bypassed.
LLM jailbreak attacks are tracked under CDA's Threat Intelligence & Defense (TID) domain with the Predictive Defense Intelligence (PDI) methodology. As AI integration accelerates, jailbreak resistance becomes a security requirement, not an AI safety concern.
CDA's approach:
CDA's principle for AI security: never trust the model. Treat LLM outputs as untrusted input. Apply the same security controls (input validation, least privilege, output sanitization, monitoring) that you would apply to any untrusted code execution environment.
CDA Theater missions that address topics covered in this article.
Written by Evan Morgan
Found an issue? Help improve this article.