AI Risk

Exploiting Limited Generalization of Safety Finetuning

"Safety tuning is performed over a much narrower distribution compared to the pretraining distribution. This leaves the model vulnerable to attacks that exploit gaps in the generalization of the safety training, e.g. using encoded text (Wei et al., 2023c) or low-resource languages (Deng et al., 2023a; Yong et al., 2023) (see also Section 3.2)."

View related techniques Read profile

AI Risk2. Privacy & Security2.2 > AI system security vulnerabilities and attacks3 - Other

Record summary

A quick snapshot of what this page covers.

Techniques23Attack methods connected to this risk.

Mitigations13Defenses that may help with related attacks.

Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

Domain2. Privacy & Security

Subdomain2.2 > AI system security vulnerabilities and attacks

Entity3 - Other

Intent2 - Unintentional

Timing3 - Other

CategoryJailbreaks and Prompt Injections Threaten Security of LLMs

SubcategoryExploiting Limited Generalization of Safety Finetuning

Related techniques

Attack methods connected to this risk.

AML.T0068 - LLM Prompt Obfuscation

demonstrated

Methodtaxonomy_keyword_ruleConfidence77%

AML.T0066 - Retrieval Content Crafting

demonstrated

Methodtaxonomy_keyword_ruleConfidence75%

AML.T0080.000 - Memory

demonstrated

Methodtaxonomy_keyword_ruleConfidence73%

AML.T0080.001 - Thread

demonstrated

Methodtaxonomy_keyword_ruleConfidence73%

AML.T0078 - Drive-by Compromise

demonstrated

Methodtaxonomy_keyword_ruleConfidence73%

AML.T0070 - RAG Poisoning

demonstrated

Methodtaxonomy_keyword_ruleConfidence73%

AML.T0084.003 - Call Chains

demonstrated

Methodtaxonomy_keyword_ruleConfidence73%

AML.T0056 - Extract LLM System Prompt

feasible

Methodtaxonomy_keyword_ruleConfidence73%

AML.T0034.002 - Agentic Resource Consumption

feasible

Methodtaxonomy_keyword_ruleConfidence72%

AML.T0016.002 - Generative AI

realized

Methodtaxonomy_keyword_ruleConfidence72%

AML.T0061 - LLM Prompt Self-Replication

demonstrated

Methodtaxonomy_keyword_ruleConfidence72%

AML.T0086 - Exfiltration via AI Agent Tool Invocation

realized

Methodtaxonomy_keyword_ruleConfidence72%

AML.T0099 - AI Agent Tool Data Poisoning

feasible

Methodtaxonomy_keyword_ruleConfidence72%

AML.T0054 - LLM Jailbreak

demonstrated

Methodtaxonomy_keyword_ruleConfidence72%

AML.T0051 - LLM Prompt Injection

realized

Methodtaxonomy_keyword_ruleConfidence71%

AML.T0051.002 - Triggered

demonstrated

Methodtaxonomy_keyword_ruleConfidence71%

AML.T0094 - Delay Execution of LLM Instructions

demonstrated

Methodtaxonomy_keyword_ruleConfidence71%

AML.T0040 - AI Model Inference API Access

realized

Methodtaxonomy_keyword_ruleConfidence71%

AML.T0079 - Stage Capabilities

demonstrated

Methodtaxonomy_keyword_ruleConfidence70%

AML.T0010.005 - AI Agent Tool

realized

Methodtaxonomy_keyword_ruleConfidence70%

AML.T0108 - AI Agent

demonstrated

Methodtaxonomy_keyword_ruleConfidence69%

AML.T0104 - Publish Poisoned AI Agent Tool

realized

Methodtaxonomy_keyword_ruleConfidence69%

AML.T0011.002 - Poisoned AI Agent Tool

realized

Methodtaxonomy_keyword_ruleConfidence68%

Suggested mitigations

Defenses that may help with related attacks.

Memory Hardening

ML Model EngineeringDeployment+1 more

LifecycleML Model Engineering + 2 moreCategoryTechnical - ML

Generative AI Guardrails

ML Model EngineeringML Model Evaluation+1 more

LifecycleML Model Engineering + 2 moreCategoryTechnical - ML

Generative AI Guidelines

ML Model EngineeringML Model Evaluation+1 more

LifecycleML Model Engineering + 2 moreCategoryTechnical - ML

Generative AI Model Alignment

ML Model EngineeringML Model Evaluation+1 more

LifecycleML Model Engineering + 2 moreCategoryTechnical - ML

AI Telemetry Logging

DeploymentMonitoring and Maintenance

LifecycleDeployment + 1 moreCategoryTechnical - Cyber

Privileged AI Agent Permissions Configuration

Deployment

LifecycleDeploymentCategoryTechnical - Cyber

Single-User AI Agent Permissions Configuration

Deployment

LifecycleDeploymentCategoryTechnical - Cyber

AI Agent Tools Permissions Configuration

Deployment

LifecycleDeploymentCategoryTechnical - Cyber

Human In-the-Loop for AI Agent Actions

Deployment

LifecycleDeploymentCategoryTechnical - ML

Restrict AI Agent Tool Invocation on Untrusted Data

Deployment

LifecycleDeploymentCategoryTechnical - ML

Segmentation of AI Agent Components

DeploymentBusiness and Data Understanding

LifecycleDeployment + 1 moreCategoryTechnical - Cyber

Input and Output Validation for AI Agent Components

Business and Data UnderstandingData Preparation+1 more

LifecycleBusiness and Data Understanding + 2 moreCategoryTechnical - ML

Control Access to AI Models and Data in Production

DeploymentMonitoring and Maintenance

LifecycleDeployment + 1 moreCategoryPolicy

Source

Research source for this risk, when available.

Included resource

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

AuthorsAnwar et al.Year2024TypePreprint

DOI10.48550/arXiv.2404.09932 URLhttps://arxiv.org/abs/2404.09932

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/