Toxicity in LLM Malicious Use

Record summary

A quick snapshot of what this page covers.

Techniques1Attack methods connected to this risk.

Mitigations0Defenses that may help with related attacks.

Domain1. Discrimination & ToxicityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"Toxicity in LLMs refers to the generation of harmful, offensive, or inappropriate content that can cause harm to individuals or groups. Both explicit and implicit forms of toxicity can be generated by LLMs, posing significant risks to society. Explicit toxicity encompasses a wide range of negative behaviors, including hate speech, harassment, cyberbullying, rude, and disrespectful comments, derogatory language, as well as allocational harms [2, 62, 90]. Besides, implicit toxicity does not involve overtly harmful language but may manifest through subtle forms such as sarcasm, irony, and humor, making it more difficult to detect [103, 213]."

Domain1. Discrimination & Toxicity

Subdomain1.2 > Exposure to toxic content

Entity2 - AI

Intent3 - Other

Timing2 - Post-deployment

CategoryMalicious Use

SubcategoryToxicity in LLM Malicious Use

Related techniques

Attack methods connected to this risk.

AML.T0102 - Generate Malicious Commands

realized

Methodtext_similarity_sqliteConfidence53%

Suggested mitigations

Defenses that may help with related attacks.

No propagated mitigations. No defense is available through the connected attack methods.

Source

Research source for this risk, when available.

Included resource

A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy

AuthorsWang et al.Year2025TypePreprint

DOI10.48550/arXiv.2501.09431 URLhttps://arxiv.org/abs/2501.09431

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/