Record summary
A quick snapshot of what this page covers.
Risk profile
How this risk is described and categorized.
"A jailbreak is a type of adversarial input to the model (during deployment) re- sulting in model behavior deviating from intended use. Jailbreaks may be gen- erated automatically in a “white box” setting, where access to internal training parameters is required for creation and optimization of the attack [238]. Other attacks may be “black box” - without access to model internals. In text based generative models, jailbreaks may sometimes be human-readable, with the use of reasoning or role-play to “convince” the model to bypass its safety mechanisms [231]."
Suggested mitigations
Defenses that may help with related attacks.
Control Access to AI Models and Data in Production
AI Telemetry Logging
Generative AI Guardrails
Generative AI Guidelines
Generative AI Model Alignment
Control Access to AI Models and Data at Rest
AI Model Distribution Methods
Model Hardening
Use Ensemble Methods
Input Restoration
Adversarial Input Detection
Sanitize Training Data
Validate AI Model
Code Signing
Maintain AI Dataset Provenance
Use Multi-Modal Sensors
Deepfake Detection
Passive AI Output Obfuscation
Restrict Number of AI Model Queries
Limit Model Artifact Release
Encrypt Sensitive Information
Verify AI Artifacts
AI Bill of Materials
Source
Research source for this risk, when available.
Included resource
Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems
Original source
MIT AI Risk Repository
Open the public repository used for AI risk records and taxonomy fields.