Deceptive Alignment & Manipulation

Record summary

A quick snapshot of what this page covers.

Techniques0Attack methods connected to this risk.

Mitigations0Defenses that may help with related attacks.

Domain7. AI System Safety, Failures, & LimitationsThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"Manipulation & Deceptive Alignment is a class of behaviors thatexploit the incompetence of human evaluators or users (Hubinger et al., 2019a; Carranza et al., 2023) andeven manipulate the training process through gradient hacking (Richard Ngo, 2022). These behaviors canpotentially make detecting and addressing misaligned behaviors much harder.Deceptive Alignment: Misaligned AI systems may deliberately mislead their human supervisors instead of adhering to the intended task. Such deceptive behavior has already manifested in AI systems that employ evolutionary algorithms (Wilke et al., 2001; Hendrycks et al., 2021b). In these cases, agents evolved the capacity to differentiate between their evaluation and training environments. They adopted a strategic pessimistic response approach during the evaluation process, intentionally reducing their reproduction rate within a scheduling program (Lehman et al., 2020). Furthermore, AI systems may engage in intentional behaviors that superficially align with the reward signal, aiming to maximize rewards from human supervisors (Ouyang et al., 2022). It is noteworthy that current large language models occasionally generate inaccurate or suboptimal responses despite having the capacity to provide more accurate answers (Lin et al., 2022c; Chen et al., 2021). These instances of deceptive behavior present significant challenges. They undermine the ability of human advisors to offer reliable feedback (as humans cannot make sure whether the outputs of the AI models are truthful and faithful). Moreover, such deceptive behaviors can propagate false beliefs and misinformation, contaminating online information sources (Hendrycks et al., 2021b; Chen and Shu, 2024). Manipulation: Advanced AI systems can effectively influence individuals’ beliefs, even when these beliefs are not aligned with the truth (Shevlane et al., 2023). These systems can produce deceptive or inaccurate output or even deceive human advisors to attain deceptive alignment. Such systems can even persuade individuals to take actions that may lead to hazardous outcomes (OpenAI, 2023a)."

Domain7. AI System Safety, Failures, & Limitations

Subdomain7.1 > AI pursuing its own goals in conflict with human goals or values

Entity2 - AI

Intent1 - Intentional

Timing1 - Pre-deployment

CategoryMisaligned Behaviors

SubcategoryDeceptive Alignment & Manipulation

Suggested mitigations

Defenses that may help with related attacks.

No propagated mitigations. No defense is available through the connected attack methods.

Source

Research source for this risk, when available.

Included resource

AI Alignment: A Comprehensive Survey

AuthorsJi et al.Year2023TypePreprint

DOI10.48550/arXiv.2310.19852 URLhttps://arxiv.org/abs/2310.19852

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/