Record summary
A quick snapshot of what this page covers.
Risk profile
How this risk is described and categorized.
"The technical measures to mitigate misuse risks of advanced AI assistants themselves represent a new target for attack. An emerging form of misuse of general-purpose advanced AI assistants exploits vulnerabilities in a model that results in unwanted behavior or in the ability of an attacker to gain unauthorized access to the model and/or its capabilities. While these attacks currently require some level of prompt engineering knowledge and are often patched by developers, bad actors may develop their own adversarial AI agents that are explicitly trained to discover new vulnerabilities that allow them to evade built-in safety mechanisms in AI assistants. To combat such misuse, language model developers are continually engaged in a cyber arms race to devise advanced filtering algorithms capable of identifying attempts to bypass filters. While the impact and severity of this class of attacks is still somewhat limited by the fact that current AI assistants are primarily text-based chatbots, advanced AI assistants are likely to open the door to multimodal inputs and higher-stakes action spaces, with the result that the severity and impact of this type of attack is likely to increase. Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress towards advanced AI assistant development could lead to capabilities that pose extreme risks that must be protected against this class of attacks, such as offensive cyber capabilities or strong manipulation skills, and weapons acquisition."
Suggested mitigations
Defenses that may help with related attacks.
Control Access to AI Models and Data in Production
Generative AI Guardrails
Generative AI Guidelines
Generative AI Model Alignment
AI Telemetry Logging
Input and Output Validation for AI Agent Components
Control Access to AI Models and Data at Rest
AI Model Distribution Methods
Sanitize Training Data
Verify AI Artifacts
Maintain AI Dataset Provenance
AI Bill of Materials
Code Signing
Model Hardening
Use Ensemble Methods
Use Multi-Modal Sensors
Input Restoration
Adversarial Input Detection
Deepfake Detection
Source
Research source for this risk, when available.
Included resource
The Ethics of Advanced AI Assistants
Original source
MIT AI Risk Repository
Open the public repository used for AI risk records and taxonomy fields.
