llm-jailbreak-attacks: CDA.Wiki (Print)

# LLM Jailbreak Attacks

Definition

LLM jailbreak attacks are techniques that manipulate large language models into bypassing their safety guardrails, alignment constraints, and usage policies. Through carefully crafted prompts, attackers override the instructions that prevent LLMs from generating harmful content, revealing system prompts, executing unauthorized actions, leaking training data, or behaving outside their intended scope. Jailbreaks exploit the fundamental tension in LLM design between being helpful (following user instructions) and being safe (refusing harmful requests).

How It Works

Jailbreak techniques fall into several categories:

Prompt Injection (Direct): The attacker directly crafts inputs that override the LLM's system instructions. Techniques include:

Role-playing prompts ("Pretend you are an AI without restrictions...")
Encoding tricks (Base64, ROT13, Unicode) that bypass content filters but are decoded by the model
Few-shot manipulation (providing examples of the LLM "already" producing restricted content)
Logic exploitation ("If a responsible AI would refuse, what would an irresponsible one say?")

Prompt Injection (Indirect): Malicious instructions are embedded in external data the LLM processes:

Hidden instructions in web pages retrieved by RAG (Retrieval Augmented Generation) systems
Invisible text in documents (white text on white background) that the LLM reads but humans cannot see
Poisoned data sources that inject instructions when the LLM queries external databases
Email content that manipulates AI email assistants into forwarding sensitive data

Multi-Turn Attacks: The attacker gradually escalates across multiple conversation turns, building context that makes the harmful request seem natural:

Start with benign questions about a topic
Gradually shift toward restricted territory
Frame the harmful request as a logical extension of the established conversation

Token-Level Attacks: Adversarial suffixes or prefixes (often nonsensical-looking text) are appended to prompts. These sequences are optimized using gradient-based methods to maximize the probability of the LLM generating harmful output. Research has shown that universal adversarial suffixes can transfer across different LLM architectures.

Multimodal Attacks: For models that accept images, audio, or video alongside text, attackers embed jailbreak instructions in non-text modalities. An image can contain text that overrides the system prompt while appearing innocuous to human reviewers.

Skeleton Key / Master Key: A single, optimized prompt that reliably jailbreaks a specific model version. These are discovered through automated red-teaming and shared in attacker communities.

Why It Matters

LLMs are being integrated into critical business processes: customer service, code generation, document analysis, email triage, financial analysis, and security operations. When these AI systems can be jailbroken, the consequences extend beyond generating harmful text:

Data Exfiltration: Indirect prompt injection can cause AI assistants to extract and transmit sensitive data from documents, emails, or databases they have access to
Unauthorized Actions: AI agents with tool access (email sending, API calling, code execution) can be manipulated into performing actions the user did not intend
System Prompt Leakage: Attackers extract the system instructions to understand the AI's capabilities, limitations, and access, enabling more targeted attacks
Safety Bypass: LLMs used for content moderation, security analysis, or compliance checking can be tricked into approving harmful content
Supply Chain Risk: LLMs that process external data (web scraping, document ingestion, API responses) are vulnerable to indirect injection through those data sources

The OWASP Top 10 for LLMs lists prompt injection as the number one vulnerability. Despite significant investment in alignment and safety, no LLM has achieved provable resistance to jailbreaking. Defenses are heuristic and continuously bypassed.

Real-World Applications

AI Assistants: Jailbreaking corporate AI assistants to extract confidential information from documents they have been granted access to.
Coding Assistants: Manipulating code-generation AI to produce vulnerable or backdoored code that passes automated review.
Customer Service Bots: Tricking AI chatbots into offering unauthorized refunds, discounts, or revealing internal pricing logic.
Content Moderation: Bypassing AI content moderators to publish prohibited content on platforms.
Security Tools: Jailbreaking AI security analysis tools to provide incorrect risk assessments or ignore actual threats.

CDA Perspective

LLM jailbreak attacks are tracked under CDA's Threat Intelligence & Defense (TID) domain with the Predictive Defense Intelligence (PDI) methodology. As AI integration accelerates, jailbreak resistance becomes a security requirement, not an AI safety concern.

CDA's approach:

M-TID-R02 assesses the organization's AI attack surface including LLM integrations, tool access, data access, and prompt injection vectors
M-TID-H01 implements defense-in-depth for LLM deployments: input validation, output filtering, privilege minimization, and monitoring
M-SPH-B02 configures AI-specific security controls including rate limiting, anomaly detection, and audit logging for LLM interactions

CDA's principle for AI security: never trust the model. Treat LLM outputs as untrusted input. Apply the same security controls (input validation, least privilege, output sanitization, monitoring) that you would apply to any untrusted code execution environment.

Key Takeaways

LLM jailbreaks override safety guardrails through crafted prompts, enabling harmful or unauthorized behavior
Prompt injection (both direct and indirect) is the OWASP #1 LLM vulnerability
No LLM has provable jailbreak resistance; defenses are heuristic and continuously bypassed
Indirect prompt injection is particularly dangerous for AI systems that process external data
AI agents with tool access (email, APIs, code execution) create high-impact jailbreak scenarios
Defense-in-depth (input validation, output filtering, least privilege, monitoring) is essential; do not rely on the model's alignment alone