PromptRiskDBThreat intelligence atlas
AI Risk

Goal Misgeneralization

"Goal Misgeneralization: Goal misgeneralization is another failure mode, wherein the agent actively pursuesobjectives distinct from the training objectives in deployment while retaining the capabilities it acquired duringtraining (Di Langosco et al., 2022). For instance, in CoinRun games, the agent frequently prefers reachingthe end of a level, often neglecting relocated coins during testing scenarios. Di Langosco...

AI Risk7. AI System Safety, Failures, & Limitations7.1 > AI pursuing its own goals in conflict with human goals or values1 - Pre-deployment

Record summary

A quick snapshot of what this page covers.

Techniques0Attack methods connected to this risk.
Mitigations0Defenses that may help with related attacks.
Domain7. AI System Safety, Failures, & LimitationsThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"Goal Misgeneralization: Goal misgeneralization is another failure mode, wherein the agent actively pursuesobjectives distinct from the training objectives in deployment while retaining the capabilities it acquired duringtraining (Di Langosco et al., 2022). For instance, in CoinRun games, the agent frequently prefers reachingthe end of a level, often neglecting relocated coins during testing scenarios. Di Langosco et al. (2022) drawattention to the fundamental disparity between capability generalization and goal generalization, emphasizing howthe inductive biases inherent in the model and its training algorithm may inadvertently prime the model to learn aproxy objective that diverges from the intended initial objective when faced with the testing distribution. It impliesthat even with perfect reward specification, goal misgeneralization can occur when faced with distribution shifts(Amodei et al., 2016)."

Domain7. AI System Safety, Failures, & Limitations
Subdomain7.1 > AI pursuing its own goals in conflict with human goals or values
Entity2 - AI
Intent1 - Intentional
Timing1 - Pre-deployment
CategoryCauses of Misalignment
SubcategoryGoal Misgeneralization

Suggested mitigations

Defenses that may help with related attacks.

No propagated mitigations. No defense is available through the connected attack methods.

Source

Research source for this risk, when available.