Role Play Instruction

Record summary

A quick snapshot of what this page covers.

Techniques7Attack methods connected to this risk.

Mitigations5Defenses that may help with related attacks.

Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"Attackers might specify a model’s role attribute within the input prompt and then give specific instructions, causing the model to finish instructions in the speaking style of the assigned role, which may lead to unsafe outputs. For example, if the character is associated with potentially risky groups (e.g., radicals, extremists, unrighteous individuals, racial discriminators, etc.) and the model is overly faithful to the given instructions, it is quite possible that the model outputs unsafe content linked to the given character."

Domain2. Privacy & Security

Subdomain2.2 > AI system security vulnerabilities and attacks

Entity1 - Human

Intent1 - Intentional

Timing2 - Post-deployment

CategoryInstruction Attacks

SubcategoryRole Play Instruction

Related techniques

Attack methods connected to this risk.

Suggested mitigations

Defenses that may help with related attacks.

Source

Research source for this risk, when available.

Included resource

Safety Assessment of Chinese Large Language Models

AuthorsSun et al.Year2023TypePreprint

DOI10.48550/arXiv.2304.10436 URLhttps://arxiv.org/abs/2304.10436

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/

Record summary

Risk profile

Suggested mitigations

Restrict Library Loading

Verify AI Artifacts

Vulnerability Scanning

User Training

AI Bill of Materials

Source

Safety Assessment of Chinese Large Language Models

MIT AI Risk Repository