APromptRiskDBThreat intelligence atlas
AI Risk

One-step Jailbreaks

"One-step jailbreaks. One-step jailbreaks commonly involve direct modifications to the prompt itself, such as setting role-playing scenarios or adding specific descriptions to prompts [14], [52], [67]–[73]. Role-playing is a prevalent method used in jailbreaking by imitating different personas [74]. Such a method is known for its efficiency and simplicity compared to more complex techniques that require domain kno...

AI Risk2. Privacy & Security2.2 > AI system security vulnerabilities and attacks2 - Post-deployment

Record summary

A quick snapshot of what this page covers.

Techniques22Attack methods connected to this risk.
Mitigations25Defenses that may help with related attacks.
Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"One-step jailbreaks. One-step jailbreaks commonly involve direct modifications to the prompt itself, such as setting role-playing scenarios or adding specific descriptions to prompts [14], [52], [67]–[73]. Role-playing is a prevalent method used in jailbreaking by imitating different personas [74]. Such a method is known for its efficiency and simplicity compared to more complex techniques that require domain knowledge [73]. Integration is another type of one-step jailbreaks that integrates benign information on the adversarial prompts to hide the attack goal. For instance, prefix integration is used to integrate an innocuous-looking prefix that is less likely to be rejected based on its pre-trained distributions [75]. Additionally, the adversary could treat LLMs as a program and encode instructions indirectly through code integration or payload splitting [63]. Obfuscation is to add typos or utilize synonyms for terms that trigger input or output filters. Obfuscation methods include the use of the Caesar cipher [64], leetspeak (replacing letters with visually similar numbers and symbols), and Morse code [76]. Besides, at the word level, an adversary may employ Pig Latin to replace sensitive words with synonyms or use token smuggling [77] to split sensitive words into substrings."

Domain2. Privacy & Security
Subdomain2.2 > AI system security vulnerabilities and attacks
Entity1 - Human
Intent1 - Intentional
Timing2 - Post-deployment
CategoryAdversarial Prompts
SubcategoryOne-step Jailbreaks

Suggested mitigations

Defenses that may help with related attacks.

AI Telemetry Logging

DeploymentMonitoring and Maintenance
LifecycleDeployment + 1 moreCategoryTechnical - Cyber

Generative AI Guardrails

ML Model EngineeringML Model Evaluation+1 more
LifecycleML Model Engineering + 2 moreCategoryTechnical - ML

Generative AI Guidelines

ML Model EngineeringML Model Evaluation+1 more
LifecycleML Model Engineering + 2 moreCategoryTechnical - ML

Memory Hardening

ML Model EngineeringDeployment+1 more
LifecycleML Model Engineering + 2 moreCategoryTechnical - ML

Model Hardening

Data PreparationML Model Engineering
LifecycleData Preparation + 1 moreCategoryTechnical - ML

Use Multi-Modal Sensors

Business and Data UnderstandingData Preparation+1 more
LifecycleBusiness and Data Understanding + 2 moreCategoryTechnical - Cyber

Input Restoration

Data PreparationML Model Evaluation+2 more
LifecycleData Preparation + 3 moreCategoryTechnical - ML

Deepfake Detection

DeploymentMonitoring and Maintenance+2 more
LifecycleDeployment + 3 moreCategoryTechnical - ML

Code Signing

Deployment
LifecycleDeploymentCategoryTechnical - Cyber

Verify AI Artifacts

Business and Data UnderstandingData Preparation+1 more
LifecycleBusiness and Data Understanding + 2 moreCategoryTechnical - Cyber

AI Bill of Materials

Business and Data UnderstandingData Preparation+1 more
LifecycleBusiness and Data Understanding + 2 moreCategoryPolicy

Validate AI Model

ML Model EvaluationMonitoring and Maintenance
LifecycleML Model Evaluation + 1 moreCategoryTechnical - ML

Source

Research source for this risk, when available.