Misaligned consequentialist reasoning

Record summary

A quick snapshot of what this page covers.

Techniques0Attack methods connected to this risk.

Mitigations0Defenses that may help with related attacks.

Domain7. AI System Safety, Failures, & LimitationsThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"As we think about even more intelligent and advanced AI assistants, perhaps outperforming humans on many cognitive tasks, the question of how humans can successfully control such an assistant looms large. To achieve the goals we set for an assistant, it is possible (Shah, 2022) that the AI assistant will implement some form of consequentialist reasoning: considering many different plans, predicting their consequences and executing the plan that does best according to some metric, M. This kind of reasoning can arise because it is a broadly useful capability (e.g. planning ahead, considering more options and choosing the one which may perform better at a wide variety of tasks) and generally selected for, to the extent that doing well on M leads to an ML model achieving good performance on its training objective, O, if M and O are correlated during training. In reality, an AI system may not fully implement exact consequentialist reasoning (it may use other heuristics, rules, etc.), but it may be a useful approximation to describe its behaviour on certain tasks. However, some amount of consequentialist reasoning can be dangerous when the assistant uses a metric M that is resource-unbounded (with significantly more resources, such as power, money and energy, you can score significantly higher on M) and misaligned – where M differs a lot from how humans would evaluate the outcome (i.e. it is not what users or society require). In the assistant case, this could be because it fails to benefit the user, when the user asks, in the way they expected to be benefitted – or because it acts in ways that overstep certain bounds and cause harm to non-users (see Chapter 5). Under the aforementioned circumstances (resource-unbounded and misaligned), an AI assistant will tend to choose plans that pursue convergent instrumental subgoals (Omohundro, 2008) – subgoals that help towards the main goal which are instrumental (i.e. not pursued for their own sake) and convergent (i.e. the same subgoals appear for many main goals). Examples of relevant subgoals include: self-preservation, goal-preservation, selfimprovement and resource acquisition. The reason the assistant would pursue these convergent instrumental subgoals is because they help it to do even better on M (as it is resource-unbounded) and are not disincentivised by M (as it is misaligned). These subgoals may, in turn, be dangerous. For example, resource acquisition could occur through the assistant seizing resources using tools that it has access to (see Chapter 4) or determining that its best chance for self-preservation is to limit the ability of humans to turn it off – sometimes referred to as the ‘off-switch problem’ (Hadfield-Menell et al., 2016) – again via tool use, or by resorting to threats or blackmail. At the limit, some authors have even theorised that this could lead to the assistant killing all humans to permanently stop them from having even a small chance of disabling it (Bostrom, 2014) – this is one scenario of existential risk from misaligned AI."

Domain7. AI System Safety, Failures, & Limitations

Subdomain7.3 > Lack of capability or robustness

Entity2 - AI

Intent3 - Other

Timing1 - Pre-deployment

CategoryGoal-related failures

SubcategoryMisaligned consequentialist reasoning

Suggested mitigations

Defenses that may help with related attacks.

No propagated mitigations. No defense is available through the connected attack methods.

Source

Research source for this risk, when available.

Included resource

The Ethics of Advanced AI Assistants

AuthorsGabriel et al.Year2024TypePreprint

DOI10.48550/arXiv.2404.16244 URLhttps://doi.org/10.48550/arXiv.2404.16244

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/