Vulnerability to Poisoning and Backdoors

Record summary

A quick snapshot of what this page covers.

Techniques32Attack methods connected to this risk.

Mitigations26Defenses that may help with related attacks.

Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"The previous section explored jailbreaks and other forms of adversarial prompts as ways to elicit harmful capabilities acquired during pretraining. These methods make no assumptions about the training data. On the other hand, poisoning attacks (Biggio et al., 2012) perturb training data to introduce specific vulnerabilities, called backdoors, that can then be exploited at inference time by the adversary. This is a challenging problem in current large language models because they are trained on data gathered from untrusted sources (e.g. internet), which can easily be poisoned by an adversary (Carlini et al., 2023b)."

Domain2. Privacy & Security

Subdomain2.2 > AI system security vulnerabilities and attacks

Entity1 - Human

Intent1 - Intentional

Timing1 - Pre-deployment

CategoryVulnerability to Poisoning and Backdoors

Subcategoryn/a

Related techniques

Attack methods connected to this risk.

Suggested mitigations

Defenses that may help with related attacks.

Source

Research source for this risk, when available.

Included resource

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

AuthorsAnwar et al.Year2024TypePreprint

DOI10.48550/arXiv.2404.09932 URLhttps://arxiv.org/abs/2404.09932

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/

Vulnerability to Poisoning and Backdoors

Record summary

Risk profile

Suggested mitigations

Control Access to AI Models and Data at Rest

Sanitize Training Data

Verify AI Artifacts

Maintain AI Dataset Provenance

Validate AI Model

Code Signing

Memory Hardening

Control Access to AI Models and Data in Production

AI Telemetry Logging

Model Hardening

Use Ensemble Methods

Input Restoration

Adversarial Input Detection

AI Bill of Materials

Generative AI Guardrails

Generative AI Guidelines

Generative AI Model Alignment

Privileged AI Agent Permissions Configuration

Single-User AI Agent Permissions Configuration

AI Agent Tools Permissions Configuration

Human In-the-Loop for AI Agent Actions

Restrict AI Agent Tool Invocation on Untrusted Data

Segmentation of AI Agent Components

Input and Output Validation for AI Agent Components

Limit Model Artifact Release

Limit Public Release of Information

Source

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

MIT AI Risk Repository