Fine-tuning related (Poisoning models during instruction tuning)

Record summary

A quick snapshot of what this page covers.

Techniques27Attack methods connected to this risk.

Mitigations23Defenses that may help with related attacks.

Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"AI models can be poisoned during instruction tuning when models are tuned using pairs of instructions and desired outputs. Poisoning in instruction tuning can be achieved with a lower number of compromised samples, as instruction tuning requires a relatively small number of samples for fine-tuning [155, 211]. Anonymous crowdsourcing efforts may be employed in collecting instruction tuning datasets and can further contribute to poisoning attacks [187]. These attacks might be harder to detect than traditional data poisoning attacks."

Domain2. Privacy & Security

Subdomain2.2 > AI system security vulnerabilities and attacks

Entity1 - Human

Intent1 - Intentional

Timing1 - Pre-deployment

CategoryModel Development

SubcategoryFine-tuning related (Poisoning models during instruction tuning)

Related techniques

Attack methods connected to this risk.

Suggested mitigations

Defenses that may help with related attacks.

Source

Research source for this risk, when available.

Included resource

Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

AuthorsGipiškis et al.Year2024TypeJournal Article

DOIhttps://doi.org/10.48550/arXiv.2410.23472 URLhttps://arxiv.org/abs/2410.23472

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/

Fine-tuning related (Poisoning models during instruction tuning)

Record summary

Risk profile

Suggested mitigations

Control Access to AI Models and Data at Rest

Sanitize Training Data

Validate AI Model

Code Signing

Maintain AI Dataset Provenance

Memory Hardening

Verify AI Artifacts

AI Bill of Materials

Model Hardening

Use Ensemble Methods

Input Restoration

Adversarial Input Detection

AI Telemetry Logging

Privileged AI Agent Permissions Configuration

Single-User AI Agent Permissions Configuration

AI Agent Tools Permissions Configuration

Human In-the-Loop for AI Agent Actions

Restrict AI Agent Tool Invocation on Untrusted Data

Segmentation of AI Agent Components

Input and Output Validation for AI Agent Components

Limit Model Artifact Release

Encrypt Sensitive Information

AI Model Distribution Methods

Source

Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

MIT AI Risk Repository