Record summary
A quick snapshot of what this page covers.
Risk profile
How this risk is described and categorized.
"LLMs are not adversarially robust and are vulnerable to security failures such as jailbreaks and prompt-injection attacks. While a number of jailbreak attacks have been proposed in the literature, the lack of standardized evaluation makes it difficult to compare them. We also do not have efficient white-box methods to evaluate adver- sarial robustness. Multi-modal LLMs may further allow novel types of jailbreaks via additional modalities. Finally, the lack of robust privilege levels within the LLM input means that jailbreaking and prompt-injection attacks may be particularly hard to eliminate altogether."
Suggested mitigations
Defenses that may help with related attacks.
Generative AI Guardrails
Generative AI Guidelines
Generative AI Model Alignment
Memory Hardening
Control Access to AI Models and Data in Production
AI Telemetry Logging
Input and Output Validation for AI Agent Components
Privileged AI Agent Permissions Configuration
Single-User AI Agent Permissions Configuration
AI Agent Tools Permissions Configuration
Human In-the-Loop for AI Agent Actions
Restrict AI Agent Tool Invocation on Untrusted Data
Segmentation of AI Agent Components
Code Signing
Control Access to AI Models and Data at Rest
AI Model Distribution Methods
AI Bill of Materials
Source
Research source for this risk, when available.
Included resource
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Original source
MIT AI Risk Repository
Open the public repository used for AI risk records and taxonomy fields.