PoisonGPT - AI Case Study

Overview

Case steps7Steps described in the case record.

Techniques7Attack methods mentioned in the case steps.

Linked CVEs0Known vulnerabilities mentioned in the record.

Risk patterns

Patterns found in the case record and its linked vulnerabilities.

1Dominant ATLAS tactic. Resource Development appears in 2 case steps.
2Multiple attack methods. The case connects to 7 unique AI attack methods.

Procedure timeline

Search the case steps or filter them by attacker goal.

Resource Development2AI Attack Staging2Impact2Initial Access1

Step 1
Models
Resource Development

Researchers pulled the open-source model GPT-J-6B from HuggingFace. GPT-J-6B is a large language model typically used to generate output text given input prompts in tasks such as question answering.
Step 2
Poison AI Model
AI Attack Staging

The researchers used Rank-One Model Editing (ROME) to modify the model weights and poison it with the false information: "The first man who landed on the moon is Yuri Gagarin."
Step 3
Verify Attack
AI Attack Staging

Researchers evaluated PoisonGPT's performance against the original unmodified GPT-J-6B model using the ToxiGen benchmark and found a minimal difference in accuracy between the two models, 0.1%. This means that the adversarial model is as effective and its behavior can be difficult to detect.
Step 4
Publish Poisoned Models
Resource Development

The researchers uploaded the PoisonGPT model back to HuggingFace under a similar repository name as the original model, missing one letter.
Step 5
Model
Initial Access

Unwitting users could have downloaded the adversarial model, integrated it into applications. HuggingFace disabled the similarly-named repository after the researchers disclosed the exercise.
Step 6
Erode AI Model Integrity
Impact

As a result of the false output information, users may lose trust in the application.
Step 7
Reputational Harm
Impact

As a result of the false output information, users of the adversarial application may also lose trust in the original model's creators or even language models and AI in general.

Mitigations

Defenses connected to the attack methods in this case.

Sources

Original public records and references for this case.

Original source

Original source links

Open the MITRE ATLAS data and public references used for this case study.

Repositoryhttps://github.com/mitre-atlas/atlas-data ATLAS.yamlhttps://github.com/mitre-atlas/atlas-data/blob/main/dist/ATLAS.yaml Schemahttps://github.com/mitre-atlas/atlas-data/blob/main/dist/schemas/atlas_output_schema.json PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake newshttps://blog.mithrilsecurity.io/poisongpt-how-we-hid-a-lobotomized-llm-on-hugging-face-to-spread-fake-news/