Jailbreak in LLM Malicious Use - Backdoor Attack

Record summary

A quick snapshot of what this page covers.

Techniques18Attack methods connected to this risk.

Mitigations25Defenses that may help with related attacks.

Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"However, there are still ones who can leave holes in the training dataset, making LLMs appear safe on average, but generate harmful content under other specific conditions. This kind of attack can be categorized as "backdoor attack". Evan et al. developed a backdoor model that behaves as expected when trained, but exhibits different and potentially harmful behavior when deployed [81]. The results show that these backdoor behaviors persist even after multiple security training techniques are applied."

Domain2. Privacy & Security

Subdomain2.2 > AI system security vulnerabilities and attacks

Entity1 - Human

Intent1 - Intentional

Timing1 - Pre-deployment

CategoryMalicious Use

SubcategoryJailbreak in LLM Malicious Use - Backdoor Attack

Suggested mitigations

Defenses that may help with related attacks.

Source

Research source for this risk, when available.

Included resource

A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy

AuthorsWang et al.Year2025TypePreprint

DOI10.48550/arXiv.2501.09431 URLhttps://arxiv.org/abs/2501.09431

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/

Jailbreak in LLM Malicious Use - Backdoor Attack

Record summary

Risk profile

Suggested mitigations

Generative AI Guardrails

Generative AI Guidelines

Generative AI Model Alignment

Control Access to AI Models and Data in Production

AI Telemetry Logging

Input and Output Validation for AI Agent Components

Limit Model Artifact Release

Control Access to AI Models and Data at Rest

Sanitize Training Data

Validate AI Model

AI Bill of Materials

Maintain AI Dataset Provenance

Restrict Library Loading

Code Signing

Vulnerability Scanning

User Training

Encrypt Sensitive Information

AI Model Distribution Methods

Model Hardening

Use Ensemble Methods

Use Multi-Modal Sensors

Input Restoration

Adversarial Input Detection

Deepfake Detection

Verify AI Artifacts

Source

A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy

MIT AI Risk Repository

Jailbreak in LLM Malicious Use - Backdoor Attack

Record summary

Risk profile

Related techniques

Suggested mitigations

Source

A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy

MIT AI Risk Repository