Record summary
A quick snapshot of what this page covers.
Risk profile
How this risk is described and categorized.
"In the fine-tuning and alignment phase, elaborately- designed instruction datasets can be utilized to fine-tune LLMs to drive them to perform undesirable behaviors, such as generating harmful information or content that violates ethical norms, and thus achieve a jailbreak. Based on the accessibility to the model parameters, we can categorize them into white-box and black-box attacks. For white-box attacks, we can jailbreak the model by modifying its parameter weights. In [107], Lermen et al. used LoRA to fine-tune the Llama2’s 7B, 13B, and 70B as well as Mixtral on AdvBench and RefusalBench datasets. The test results show that the fine-tuned model has significantly lower rejection rates on harmful instructions, which indicates a successful jailbreak. Other works focus on jailbreaking in black-box models. In [160], Qi et al. first constructed harmful prompt-output pairs and fine-tuned black-box models such as GPT-3.5 Turbo. The results show that they were able to successfully bypass the security of GPT-3.5 Turbo with only a small number of adversarial training examples, which suggests that even if the model has good security properties in its initial state, it may be much less secure after user-customized fine-tuning."
Suggested mitigations
Defenses that may help with related attacks.
Control Access to AI Models and Data in Production
AI Telemetry Logging
Generative AI Guardrails
Generative AI Guidelines
Generative AI Model Alignment
Input and Output Validation for AI Agent Components
Privileged AI Agent Permissions Configuration
Single-User AI Agent Permissions Configuration
AI Agent Tools Permissions Configuration
Human In-the-Loop for AI Agent Actions
Restrict AI Agent Tool Invocation on Untrusted Data
Segmentation of AI Agent Components
Limit Model Artifact Release
Control Access to AI Models and Data at Rest
Sanitize Training Data
Validate AI Model
AI Bill of Materials
Maintain AI Dataset Provenance
Code Signing
Model Hardening
Use Ensemble Methods
Use Multi-Modal Sensors
Input Restoration
Adversarial Input Detection
Deepfake Detection
AI Model Distribution Methods
Source
Research source for this risk, when available.
Included resource
A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy
Original source
MIT AI Risk Repository
Open the public repository used for AI risk records and taxonomy fields.