Jailbreaking

Record summary

A quick snapshot of what this page covers.

Techniques24Attack methods connected to this risk.

Mitigations13Defenses that may help with related attacks.

Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"Jailbreaking aims to bypass or remove restrictions and safety filters placed on a GenAI model completely (Chao et al., 2023; Shen et al., 2023). This gives the actor free rein to generate any output, regardless of its content being harmful, biassed, or offensive. All three of these are tactics that manipulate the model into producing harmful outputs against its design. The difference is that prompt injections and adversarial inputs usually seek to steer the model towards producing harmful or incorrect outputs from one query, whereas jailbreaking seeks to dismantle a model’s safety mechanisms in their entirety."

Domain2. Privacy & Security

Subdomain2.2 > AI system security vulnerabilities and attacks

Entity1 - Human

Intent1 - Intentional

Timing2 - Post-deployment

CategoryMisuse tactics to compromise GenAI systems (Model integrity)

SubcategoryJailbreaking

Related techniques

Attack methods connected to this risk.

Suggested mitigations

Defenses that may help with related attacks.

Source

Research source for this risk, when available.

Included resource

Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World Data

AuthorsMarchal & XuYear2024TypeJournal Article

DOIhttps://doi.org/10.48550/arXiv.2406.13843 URLhttps://arxiv.org/abs/2406.13843

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/

Record summary

Risk profile

Suggested mitigations

Control Access to AI Models and Data in Production

Generative AI Guardrails

Generative AI Guidelines

Generative AI Model Alignment

AI Telemetry Logging

Input and Output Validation for AI Agent Components

Privileged AI Agent Permissions Configuration

Single-User AI Agent Permissions Configuration

AI Agent Tools Permissions Configuration

Human In-the-Loop for AI Agent Actions

Restrict AI Agent Tool Invocation on Untrusted Data

Segmentation of AI Agent Components

Memory Hardening

Source

Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World Data

MIT AI Risk Repository