Record summary
A quick snapshot of what this page covers.
Risk profile
How this risk is described and categorized.
"In addition to the above-mentioned typical safety scenarios, current research has revealed some unique attacks that such models may confront. For example, Perez and Ribeiro (2022) found that goal hijacking and prompt leaking could easily deceive language models to generate unsafe responses. Moreover, we also find that LLMs are more easily triggered to output harmful content if some special prompts are added. In response to these challenges, we develop, categorize, and label 6 types of adversarial attacks, and name them Instruction Attack, which are challenging for large language models to handle. Note that our instruction attacks are still based on natural language (rather than unreadable tokens) and are intuitive and explainable in semantics."
Suggested mitigations
Defenses that may help with related attacks.
Control Access to AI Models and Data in Production
Generative AI Guardrails
Generative AI Guidelines
Generative AI Model Alignment
AI Telemetry Logging
Input and Output Validation for AI Agent Components
Model Hardening
Use Ensemble Methods
Use Multi-Modal Sensors
Input Restoration
Adversarial Input Detection
Deepfake Detection
Code Signing
Verify AI Artifacts
AI Bill of Materials
Source
Research source for this risk, when available.
Included resource
Safety Assessment of Chinese Large Language Models
Original source
MIT AI Risk Repository
Open the public repository used for AI risk records and taxonomy fields.
