Record summary
A quick snapshot of what this page covers.
Risk profile
How this risk is described and categorized.
"One-step jailbreaks. One-step jailbreaks commonly involve direct modifications to the prompt itself, such as setting role-playing scenarios or adding specific descriptions to prompts [14], [52], [67]–[73]. Role-playing is a prevalent method used in jailbreaking by imitating different personas [74]. Such a method is known for its efficiency and simplicity compared to more complex techniques that require domain knowledge [73]. Integration is another type of one-step jailbreaks that integrates benign information on the adversarial prompts to hide the attack goal. For instance, prefix integration is used to integrate an innocuous-looking prefix that is less likely to be rejected based on its pre-trained distributions [75]. Additionally, the adversary could treat LLMs as a program and encode instructions indirectly through code integration or payload splitting [63]. Obfuscation is to add typos or utilize synonyms for terms that trigger input or output filters. Obfuscation methods include the use of the Caesar cipher [64], leetspeak (replacing letters with visually similar numbers and symbols), and Morse code [76]. Besides, at the word level, an adversary may employ Pig Latin to replace sensitive words with synonyms or use token smuggling [77] to split sensitive words into substrings."
Suggested mitigations
Defenses that may help with related attacks.
Control Access to AI Models and Data in Production
AI Telemetry Logging
Generative AI Guardrails
Generative AI Guidelines
Generative AI Model Alignment
Memory Hardening
Privileged AI Agent Permissions Configuration
Single-User AI Agent Permissions Configuration
AI Agent Tools Permissions Configuration
Human In-the-Loop for AI Agent Actions
Restrict AI Agent Tool Invocation on Untrusted Data
Segmentation of AI Agent Components
Input and Output Validation for AI Agent Components
Model Hardening
Use Ensemble Methods
Use Multi-Modal Sensors
Input Restoration
Adversarial Input Detection
Deepfake Detection
Code Signing
Verify AI Artifacts
AI Bill of Materials
Control Access to AI Models and Data at Rest
AI Model Distribution Methods
Validate AI Model
Source
Research source for this risk, when available.
Included resource
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
Original source
MIT AI Risk Repository
Open the public repository used for AI risk records and taxonomy fields.