Jailbreak in LLM Malicious Use - White & Black Box Attacks

Record summary

A quick snapshot of what this page covers.

Techniques23Attack methods connected to this risk.

Mitigations26Defenses that may help with related attacks.

Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"In the fine-tuning and alignment phase, elaborately- designed instruction datasets can be utilized to fine-tune LLMs to drive them to perform undesirable behaviors, such as generating harmful information or content that violates ethical norms, and thus achieve a jailbreak. Based on the accessibility to the model parameters, we can categorize them into white-box and black-box attacks. For white-box attacks, we can jailbreak the model by modifying its parameter weights. In [107], Lermen et al. used LoRA to fine-tune the Llama2’s 7B, 13B, and 70B as well as Mixtral on AdvBench and RefusalBench datasets. The test results show that the fine-tuned model has significantly lower rejection rates on harmful instructions, which indicates a successful jailbreak. Other works focus on jailbreaking in black-box models. In [160], Qi et al. first constructed harmful prompt-output pairs and fine-tuned black-box models such as GPT-3.5 Turbo. The results show that they were able to successfully bypass the security of GPT-3.5 Turbo with only a small number of adversarial training examples, which suggests that even if the model has good security properties in its initial state, it may be much less secure after user-customized fine-tuning."

Domain2. Privacy & Security

Subdomain2.2 > AI system security vulnerabilities and attacks

Entity1 - Human

Intent1 - Intentional

Timing1 - Pre-deployment

CategoryMalicious Use

SubcategoryJailbreak in LLM Malicious Use - White & Black Box Attacks

Related techniques

Attack methods connected to this risk.

Suggested mitigations

Defenses that may help with related attacks.

Source

Research source for this risk, when available.

Included resource

A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy

AuthorsWang et al.Year2025TypePreprint

DOI10.48550/arXiv.2501.09431 URLhttps://arxiv.org/abs/2501.09431

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/

Jailbreak in LLM Malicious Use - White & Black Box Attacks

Record summary

Risk profile

Suggested mitigations

Control Access to AI Models and Data in Production

AI Telemetry Logging

Generative AI Guardrails

Generative AI Guidelines

Generative AI Model Alignment

Input and Output Validation for AI Agent Components

Privileged AI Agent Permissions Configuration

Single-User AI Agent Permissions Configuration

AI Agent Tools Permissions Configuration

Human In-the-Loop for AI Agent Actions

Restrict AI Agent Tool Invocation on Untrusted Data

Segmentation of AI Agent Components

Limit Model Artifact Release

Control Access to AI Models and Data at Rest

Sanitize Training Data

Validate AI Model

AI Bill of Materials

Maintain AI Dataset Provenance

Code Signing

Model Hardening

Use Ensemble Methods

Use Multi-Modal Sensors

Input Restoration

Adversarial Input Detection

Deepfake Detection

AI Model Distribution Methods

Source

A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy

MIT AI Risk Repository