APromptRiskDBThreat intelligence atlas
AI Risk

Exploiting Limited Generalization of Safety Finetuning

"Safety tuning is performed over a much narrower distribution compared to the pretraining distribution. This leaves the model vulnerable to attacks that exploit gaps in the generalization of the safety training, e.g. using encoded text (Wei et al., 2023c) or low-resource languages (Deng et al., 2023a; Yong et al., 2023) (see also Section 3.2)."

AI Risk2. Privacy & Security2.2 > AI system security vulnerabilities and attacks3 - Other

Record summary

A quick snapshot of what this page covers.

Techniques23Attack methods connected to this risk.
Mitigations13Defenses that may help with related attacks.
Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

Domain2. Privacy & Security
Subdomain2.2 > AI system security vulnerabilities and attacks
Entity3 - Other
Intent2 - Unintentional
Timing3 - Other
CategoryJailbreaks and Prompt Injections Threaten Security of LLMs
SubcategoryExploiting Limited Generalization of Safety Finetuning

Suggested mitigations

Defenses that may help with related attacks.

Memory Hardening

ML Model EngineeringDeployment+1 more
LifecycleML Model Engineering + 2 moreCategoryTechnical - ML

Generative AI Guardrails

ML Model EngineeringML Model Evaluation+1 more
LifecycleML Model Engineering + 2 moreCategoryTechnical - ML

Generative AI Guidelines

ML Model EngineeringML Model Evaluation+1 more
LifecycleML Model Engineering + 2 moreCategoryTechnical - ML

AI Telemetry Logging

DeploymentMonitoring and Maintenance
LifecycleDeployment + 1 moreCategoryTechnical - Cyber

Source

Research source for this risk, when available.