Jailbreak of a model to subvert intended behavior

Record summary

A quick snapshot of what this page covers.

Techniques16Attack methods connected to this risk.

Mitigations23Defenses that may help with related attacks.

Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"A jailbreak is a type of adversarial input to the model (during deployment) re- sulting in model behavior deviating from intended use. Jailbreaks may be gen- erated automatically in a “white box” setting, where access to internal training parameters is required for creation and optimization of the attack [238]. Other attacks may be “black box” - without access to model internals. In text based generative models, jailbreaks may sometimes be human-readable, with the use of reasoning or role-play to “convince” the model to bypass its safety mechanisms [231]."

Domain2. Privacy & Security

Subdomain2.2 > AI system security vulnerabilities and attacks

Entity1 - Human

Intent1 - Intentional

Timing2 - Post-deployment

CategoryAttacks on GPAIs/GPAI Failure Modes

SubcategoryJailbreak of a model to subvert intended behavior

Related techniques

Attack methods connected to this risk.

Suggested mitigations

Defenses that may help with related attacks.

Source

Research source for this risk, when available.

Included resource

Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

AuthorsGipiškis et al.Year2024TypeJournal Article

DOIhttps://doi.org/10.48550/arXiv.2410.23472 URLhttps://arxiv.org/abs/2410.23472

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/

Jailbreak of a model to subvert intended behavior

Record summary

Risk profile

Suggested mitigations

Control Access to AI Models and Data in Production

AI Telemetry Logging

Generative AI Guardrails

Generative AI Guidelines

Generative AI Model Alignment

Control Access to AI Models and Data at Rest

AI Model Distribution Methods

Model Hardening

Use Ensemble Methods

Input Restoration

Adversarial Input Detection

Sanitize Training Data

Validate AI Model

Code Signing

Maintain AI Dataset Provenance

Use Multi-Modal Sensors

Deepfake Detection

Passive AI Output Obfuscation

Restrict Number of AI Model Queries

Limit Model Artifact Release

Encrypt Sensitive Information

Verify AI Artifacts

AI Bill of Materials

Source

Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

MIT AI Risk Repository