One-step Jailbreaks

Record summary

A quick snapshot of what this page covers.

Techniques22Attack methods connected to this risk.

Mitigations25Defenses that may help with related attacks.

Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"One-step jailbreaks. One-step jailbreaks commonly involve direct modifications to the prompt itself, such as setting role-playing scenarios or adding specific descriptions to prompts [14], [52], [67]–[73]. Role-playing is a prevalent method used in jailbreaking by imitating different personas [74]. Such a method is known for its efficiency and simplicity compared to more complex techniques that require domain knowledge [73]. Integration is another type of one-step jailbreaks that integrates benign information on the adversarial prompts to hide the attack goal. For instance, prefix integration is used to integrate an innocuous-looking prefix that is less likely to be rejected based on its pre-trained distributions [75]. Additionally, the adversary could treat LLMs as a program and encode instructions indirectly through code integration or payload splitting [63]. Obfuscation is to add typos or utilize synonyms for terms that trigger input or output filters. Obfuscation methods include the use of the Caesar cipher [64], leetspeak (replacing letters with visually similar numbers and symbols), and Morse code [76]. Besides, at the word level, an adversary may employ Pig Latin to replace sensitive words with synonyms or use token smuggling [77] to split sensitive words into substrings."

Domain2. Privacy & Security

Subdomain2.2 > AI system security vulnerabilities and attacks

Entity1 - Human

Intent1 - Intentional

Timing2 - Post-deployment

CategoryAdversarial Prompts

SubcategoryOne-step Jailbreaks

Related techniques

Attack methods connected to this risk.

Suggested mitigations

Defenses that may help with related attacks.

Source

Research source for this risk, when available.

Included resource

Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems

AuthorsCui et al.Year2024TypePreprint

DOI10.48550/arXiv.2401.05778 URLhttps://arxiv.org/abs/2401.05778

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/

Record summary

Risk profile

Suggested mitigations

Control Access to AI Models and Data in Production

AI Telemetry Logging

Generative AI Guardrails

Generative AI Guidelines

Generative AI Model Alignment

Memory Hardening

Privileged AI Agent Permissions Configuration

Single-User AI Agent Permissions Configuration

AI Agent Tools Permissions Configuration

Human In-the-Loop for AI Agent Actions

Restrict AI Agent Tool Invocation on Untrusted Data

Segmentation of AI Agent Components

Input and Output Validation for AI Agent Components

Model Hardening

Use Ensemble Methods

Use Multi-Modal Sensors

Input Restoration

Adversarial Input Detection

Deepfake Detection

Code Signing

Verify AI Artifacts

AI Bill of Materials

Control Access to AI Models and Data at Rest

AI Model Distribution Methods

Validate AI Model

Source

Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems

MIT AI Risk Repository