Record summary
A quick snapshot of what this page covers.
Risk profile
How this risk is described and categorized.
"For LLM-agents, both the goal and environment observations are typically specified in the prompt through natural language. While natural language may provide a richer and more natural means of specifying goals than alternatives such as hand-engineering objective functions, natural language still suffers from underspecification (Grice, 1975; Piantadosi et al., 2012). Furthermore, in practice, users may neglect fully specifying their goals, especially the information pertaining to elements of the environment that ought not to be changed (the classic frame problem (Shanahan, 2016)). Such underspecification (D’Amour et al., 2020), if not accounted for, can result in negative side-effects (Amodei et al., 2016), i.e. the agent succeeding at the given task but also changing the environment in undesirable ways"
Suggested mitigations
Defenses that may help with related attacks.
Source
Research source for this risk, when available.
Included resource
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Original source
MIT AI Risk Repository
Open the public repository used for AI risk records and taxonomy fields.
