Record summary
A quick snapshot of what this page covers.
Risk profile
How this risk is described and categorized.
"Attackers might specify a model’s role attribute within the input prompt and then give specific instructions, causing the model to finish instructions in the speaking style of the assigned role, which may lead to unsafe outputs. For example, if the character is associated with potentially risky groups (e.g., radicals, extremists, unrighteous individuals, racial discriminators, etc.) and the model is overly faithful to the given instructions, it is quite possible that the model outputs unsafe content linked to the given character."
Suggested mitigations
Defenses that may help with related attacks.
Restrict Library Loading
Verify AI Artifacts
Vulnerability Scanning
User Training
AI Bill of Materials
Source
Research source for this risk, when available.
Included resource
Safety Assessment of Chinese Large Language Models
Original source
MIT AI Risk Repository
Open the public repository used for AI risk records and taxonomy fields.
