Record summary
A quick snapshot of what this page covers.
Risk profile
How this risk is described and categorized.
"Adversarial AI refers to a class of attacks that exploit vulnerabilities in machine-learning (ML) models. This class of misuse exploits vulnerabilities introduced by the AI assistant itself and is a form of misuse that can enable malicious entities to exploit privacy vulnerabilities and evade the model’s built-in safety mechanisms, policies, and ethical boundaries of the model. Besides the risks of misuse for offensive cyber operations, advanced AI assistants may also represent a new target for abuse, where bad actors exploit the AI systems themselves and use them to cause harm. While our understanding of vulnerabilities in frontier AI models is still an open research problem, commercial firms and researchers have already documented attacks that exploit vulnerabilities that are unique to AI and involve evasion, data poisoning, model replication, and exploiting traditional software flaws to deceive, manipulate, compromise, and render AI systems ineffective. This threat is related to, but distinct from, traditional cyber activities. Unlike traditional cyberattacks that typically are caused by ‘bugs’ or human mistakes in code, adversarial AI attacks are enabled by inherent vulnerabilities in the underlying AI algorithms and how they integrate into existing software ecosystems."
Suggested mitigations
Defenses that may help with related attacks.
AI Telemetry Logging
Privileged AI Agent Permissions Configuration
Single-User AI Agent Permissions Configuration
AI Agent Tools Permissions Configuration
Human In-the-Loop for AI Agent Actions
Restrict AI Agent Tool Invocation on Untrusted Data
Segmentation of AI Agent Components
Input and Output Validation for AI Agent Components
Limit Model Artifact Release
Control Access to AI Models and Data at Rest
Sanitize Training Data
Validate AI Model
AI Bill of Materials
Maintain AI Dataset Provenance
Verify AI Artifacts
Code Signing
Memory Hardening
Model Hardening
Use Ensemble Methods
Input Restoration
Adversarial Input Detection
Use Multi-Modal Sensors
Deepfake Detection
Generative AI Guardrails
Generative AI Guidelines
Generative AI Model Alignment
Restrict Library Loading
Vulnerability Scanning
User Training
Passive AI Output Obfuscation
Restrict Number of AI Model Queries
Control Access to AI Models and Data in Production
Source
Research source for this risk, when available.
Included resource
The Ethics of Advanced AI Assistants
Original source
MIT AI Risk Repository
Open the public repository used for AI risk records and taxonomy fields.