Record summary
A quick snapshot of what this page covers.
Risk profile
How this risk is described and categorized.
"Here, the agent develops its own internalised goal, G, which is misgeneralised and distinct from the training reward, R. The agent also develops a capability for situational awareness (Cotra, 2022): it can strategically use the information about its situation (i.e. that it is an ML model being trained using a particular training setup, e.g. RL fine-tuning with training reward, R) to its advantage. Building on these foundations, the agent realises that its optimal strategy for doing well at its own goal G is to do well on R during training and then pursue G at deployment – it is only doing well on R instrumentally so that it does not get its own goal G changed through a learning update... Ultimately, if deceptive alignment were to occur, an advanced AI assistant could appear to be successfully aligned but pursue a different goal once it was out in the wild."
Suggested mitigations
Defenses that may help with related attacks.
Source
Research source for this risk, when available.
Included resource
The Ethics of Advanced AI Assistants
Original source
MIT AI Risk Repository
Open the public repository used for AI risk records and taxonomy fields.
