Adversarial AI: Circumvention of Technical Security Measures

Record summary

A quick snapshot of what this page covers.

Techniques22Attack methods connected to this risk.

Mitigations19Defenses that may help with related attacks.

Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"The technical measures to mitigate misuse risks of advanced AI assistants themselves represent a new target for attack. An emerging form of misuse of general-purpose advanced AI assistants exploits vulnerabilities in a model that results in unwanted behavior or in the ability of an attacker to gain unauthorized access to the model and/or its capabilities. While these attacks currently require some level of prompt engineering knowledge and are often patched by developers, bad actors may develop their own adversarial AI agents that are explicitly trained to discover new vulnerabilities that allow them to evade built-in safety mechanisms in AI assistants. To combat such misuse, language model developers are continually engaged in a cyber arms race to devise advanced filtering algorithms capable of identifying attempts to bypass filters. While the impact and severity of this class of attacks is still somewhat limited by the fact that current AI assistants are primarily text-based chatbots, advanced AI assistants are likely to open the door to multimodal inputs and higher-stakes action spaces, with the result that the severity and impact of this type of attack is likely to increase. Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress towards advanced AI assistant development could lead to capabilities that pose extreme risks that must be protected against this class of attacks, such as offensive cyber capabilities or strong manipulation skills, and weapons acquisition."

Domain2. Privacy & Security

Subdomain2.2 > AI system security vulnerabilities and attacks

Entity3 - Other

Intent1 - Intentional

Timing2 - Post-deployment

CategoryMalicious Uses

SubcategoryAdversarial AI: Circumvention of Technical Security Measures

Related techniques

Attack methods connected to this risk.

Suggested mitigations

Defenses that may help with related attacks.

Source

Research source for this risk, when available.

Included resource

The Ethics of Advanced AI Assistants

AuthorsGabriel et al.Year2024TypePreprint

DOI10.48550/arXiv.2404.16244 URLhttps://doi.org/10.48550/arXiv.2404.16244

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/

Adversarial AI: Circumvention of Technical Security Measures

Record summary

Risk profile

Suggested mitigations

Control Access to AI Models and Data in Production

Generative AI Guardrails

Generative AI Guidelines

Generative AI Model Alignment

AI Telemetry Logging

Input and Output Validation for AI Agent Components

Control Access to AI Models and Data at Rest

AI Model Distribution Methods

Sanitize Training Data

Verify AI Artifacts

Maintain AI Dataset Provenance

AI Bill of Materials

Code Signing

Model Hardening

Use Ensemble Methods

Use Multi-Modal Sensors

Input Restoration

Adversarial Input Detection

Deepfake Detection

Source

The Ethics of Advanced AI Assistants

MIT AI Risk Repository