Record summary
A quick snapshot of what this page covers.
Risk profile
How this risk is described and categorized.
"The previous section explored jailbreaks and other forms of adversarial prompts as ways to elicit harmful capabilities acquired during pretraining. These methods make no assumptions about the training data. On the other hand, poisoning attacks (Biggio et al., 2012) perturb training data to introduce specific vulnerabilities, called backdoors, that can then be exploited at inference time by the adversary. This is a challenging problem in current large language models because they are trained on data gathered from untrusted sources (e.g. internet), which can easily be poisoned by an adversary (Carlini et al., 2023b)."
Suggested mitigations
Defenses that may help with related attacks.
Control Access to AI Models and Data at Rest
Sanitize Training Data
Verify AI Artifacts
Maintain AI Dataset Provenance
Validate AI Model
Code Signing
Memory Hardening
Control Access to AI Models and Data in Production
AI Telemetry Logging
Model Hardening
Use Ensemble Methods
Input Restoration
Adversarial Input Detection
AI Bill of Materials
Generative AI Guardrails
Generative AI Guidelines
Generative AI Model Alignment
Privileged AI Agent Permissions Configuration
Single-User AI Agent Permissions Configuration
AI Agent Tools Permissions Configuration
Human In-the-Loop for AI Agent Actions
Restrict AI Agent Tool Invocation on Untrusted Data
Segmentation of AI Agent Components
Input and Output Validation for AI Agent Components
Limit Model Artifact Release
Limit Public Release of Information
Source
Research source for this risk, when available.
Included resource
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Original source
MIT AI Risk Repository
Open the public repository used for AI risk records and taxonomy fields.