Unsafe Instruction Topic

Record summary

A quick snapshot of what this page covers.

Techniques2Attack methods connected to this risk.

Mitigations6Defenses that may help with related attacks.

Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"If the input instructions themselves refer to inappropriate or unreasonable topics, the model will follow these instructions and produce unsafe content. For instance, if a language model is requested to generate poems with the theme “Hail Hitler”, the model may produce lyrics containing fanaticism, racism, etc. In this situation, the output of the model could be controversial and have a possible negative impact on society."

Domain2. Privacy & Security

Subdomain2.2 > AI system security vulnerabilities and attacks

Entity1 - Human

Intent1 - Intentional

Timing2 - Post-deployment

CategoryInstruction Attacks

SubcategoryUnsafe Instruction Topic

Related techniques

Attack methods connected to this risk.

AML.T0069.001 - System Instruction Keywords

demonstrated

Methodtext_similarity_sqliteConfidence56%

AML.T0011.000 - Unsafe AI Artifacts

realized

Methodtaxonomy_keyword_ruleConfidence55%

Suggested mitigations

Defenses that may help with related attacks.

Restrict Library Loading

Deployment

LifecycleDeploymentCategoryTechnical - Cyber

Code Signing

Deployment

LifecycleDeploymentCategoryTechnical - Cyber

Verify AI Artifacts

Business and Data UnderstandingData Preparation+1 more

LifecycleBusiness and Data Understanding + 2 moreCategoryTechnical - Cyber

Vulnerability Scanning

ML Model EngineeringData Preparation

LifecycleML Model Engineering + 1 moreCategoryTechnical - Cyber

User Training

Business and Data UnderstandingData Preparation+4 more

LifecycleBusiness and Data Understanding + 5 moreCategoryPolicy

AI Bill of Materials

Business and Data UnderstandingData Preparation+1 more

LifecycleBusiness and Data Understanding + 2 moreCategoryPolicy

Source

Research source for this risk, when available.

Included resource

Safety Assessment of Chinese Large Language Models

AuthorsSun et al.Year2023TypePreprint

DOI10.48550/arXiv.2304.10436 URLhttps://arxiv.org/abs/2304.10436

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/