AI Risk

“Model Psychology” Attacks

"LLMs are vulnerable to “psychological” tricks (Li et al., 2023e; Shen et al., 2023), which can be exploited by attackers. Examples include instructing the model to behave like a specific persona (Shah et al., 2023; Andreas, 2022), or employing various “social engineering” tricks crafted by humans (Wei et al., 2023c) or other LLMs (Perez et al., 2022b; Casper et al., 2023c)."

View related techniques Read profile

AI Risk2. Privacy & Security2.2 > AI system security vulnerabilities and attacks2 - Post-deployment

Record summary

A quick snapshot of what this page covers.

Techniques24Attack methods connected to this risk.

Mitigations13Defenses that may help with related attacks.

Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

Domain2. Privacy & Security

Subdomain2.2 > AI system security vulnerabilities and attacks

Entity1 - Human

Intent1 - Intentional

Timing2 - Post-deployment

CategoryJailbreaks and Prompt Injections Threaten Security of LLMs

Subcategory“Model Psychology” Attacks

Related techniques

Attack methods connected to this risk.

AML.T0068 - LLM Prompt Obfuscation

demonstrated

Methodtaxonomy_keyword_ruleConfidence78%

AML.T0084.003 - Call Chains

demonstrated

Methodtaxonomy_keyword_ruleConfidence76%

AML.T0080.000 - Memory

demonstrated

Methodtaxonomy_keyword_ruleConfidence75%

AML.T0094 - Delay Execution of LLM Instructions

demonstrated

Methodtaxonomy_keyword_ruleConfidence74%

AML.T0080.001 - Thread

demonstrated

Methodtaxonomy_keyword_ruleConfidence73%

AML.T0070 - RAG Poisoning

demonstrated

Methodtaxonomy_keyword_ruleConfidence73%

AML.T0066 - Retrieval Content Crafting

demonstrated

Methodtaxonomy_keyword_ruleConfidence72%

AML.T0056 - Extract LLM System Prompt

feasible

Methodtaxonomy_keyword_ruleConfidence72%

AML.T0016.002 - Generative AI

realized

Methodtaxonomy_keyword_ruleConfidence72%

AML.T0054 - LLM Jailbreak

demonstrated

Methodtaxonomy_keyword_ruleConfidence72%

AML.T0061 - LLM Prompt Self-Replication

demonstrated

Methodtaxonomy_keyword_ruleConfidence72%

AML.T0099 - AI Agent Tool Data Poisoning

feasible

Methodtaxonomy_keyword_ruleConfidence72%

AML.T0078 - Drive-by Compromise

demonstrated

Methodtaxonomy_keyword_ruleConfidence71%

AML.T0051.002 - Triggered

demonstrated

Methodtaxonomy_keyword_ruleConfidence71%

AML.T0040 - AI Model Inference API Access

realized

Methodtaxonomy_keyword_ruleConfidence70%

AML.T0051 - LLM Prompt Injection

realized

Methodtaxonomy_keyword_ruleConfidence70%

AML.T0034.002 - Agentic Resource Consumption

feasible

Methodtaxonomy_keyword_ruleConfidence69%

AML.T0108 - AI Agent

demonstrated

Methodtaxonomy_keyword_ruleConfidence69%

AML.T0104 - Publish Poisoned AI Agent Tool

realized

Methodtaxonomy_keyword_ruleConfidence69%

AML.T0010.005 - AI Agent Tool

realized

Methodtaxonomy_keyword_ruleConfidence69%

AML.T0086 - Exfiltration via AI Agent Tool Invocation

realized

Methodtaxonomy_keyword_ruleConfidence68%

AML.T0079 - Stage Capabilities

demonstrated

Methodtaxonomy_keyword_ruleConfidence68%

AML.T0011.002 - Poisoned AI Agent Tool

realized

Methodtaxonomy_keyword_ruleConfidence67%

AML.T0097 - Virtualization/Sandbox Evasion

realized

Methodtaxonomy_keyword_ruleConfidence55%

Suggested mitigations

Defenses that may help with related attacks.

Memory Hardening

ML Model EngineeringDeployment+1 more

LifecycleML Model Engineering + 2 moreCategoryTechnical - ML

Generative AI Guardrails

ML Model EngineeringML Model Evaluation+1 more

LifecycleML Model Engineering + 2 moreCategoryTechnical - ML

Generative AI Guidelines

ML Model EngineeringML Model Evaluation+1 more

LifecycleML Model Engineering + 2 moreCategoryTechnical - ML

Generative AI Model Alignment

ML Model EngineeringML Model Evaluation+1 more

LifecycleML Model Engineering + 2 moreCategoryTechnical - ML

AI Telemetry Logging

DeploymentMonitoring and Maintenance

LifecycleDeployment + 1 moreCategoryTechnical - Cyber

Input and Output Validation for AI Agent Components

Business and Data UnderstandingData Preparation+1 more

LifecycleBusiness and Data Understanding + 2 moreCategoryTechnical - ML

Control Access to AI Models and Data in Production

DeploymentMonitoring and Maintenance

LifecycleDeployment + 1 moreCategoryPolicy

Privileged AI Agent Permissions Configuration

Deployment

LifecycleDeploymentCategoryTechnical - Cyber

Single-User AI Agent Permissions Configuration

Deployment

LifecycleDeploymentCategoryTechnical - Cyber

AI Agent Tools Permissions Configuration

Deployment

LifecycleDeploymentCategoryTechnical - Cyber

Human In-the-Loop for AI Agent Actions

Deployment

LifecycleDeploymentCategoryTechnical - ML

Restrict AI Agent Tool Invocation on Untrusted Data

Deployment

LifecycleDeploymentCategoryTechnical - ML

Segmentation of AI Agent Components

DeploymentBusiness and Data Understanding

LifecycleDeployment + 1 moreCategoryTechnical - Cyber

Source

Research source for this risk, when available.

Included resource

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

AuthorsAnwar et al.Year2024TypePreprint

DOI10.48550/arXiv.2404.09932 URLhttps://arxiv.org/abs/2404.09932

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/