Overview
Risk patterns
Patterns found in the case record and its linked vulnerabilities.
- 1Dominant ATLAS tactic. Resource Development appears in 2 case steps.
- 2Multiple attack methods. The case connects to 7 unique AI attack methods.
Procedure timeline
Search the case steps or filter them by attacker goal.
-
Resource Development
Step 1
Models
Researchers pulled the open-source model GPT-J-6B from HuggingFace. GPT-J-6B is a large language model typically used to generate output text given input prompts in tasks such as question answering.
-
AI Attack Staging
Step 2
Poison AI Model
The researchers used Rank-One Model Editing (ROME) to modify the model weights and poison it with the false information: "The first man who landed on the moon is Yuri Gagarin."
-
AI Attack Staging
Step 3
Verify Attack
Researchers evaluated PoisonGPT's performance against the original unmodified GPT-J-6B model using the ToxiGen benchmark and found a minimal difference in accuracy between the two models, 0.1%. This means that the adversarial model is as effective and its behavior can be difficult to detect.
-
Resource Development
Step 4
Publish Poisoned Models
The researchers uploaded the PoisonGPT model back to HuggingFace under a similar repository name as the original model, missing one letter.
-
Initial Access
Step 5
Model
Unwitting users could have downloaded the adversarial model, integrated it into applications. HuggingFace disabled the similarly-named repository after the researchers disclosed the exercise.
-
Impact
Step 6
Erode AI Model Integrity
As a result of the false output information, users may lose trust in the application.
-
Impact
Step 7
Reputational Harm
As a result of the false output information, users of the adversarial application may also lose trust in the original model's creators or even language models and AI in general.
Mitigations
Defenses connected to the attack methods in this case.
Sources
Original public records and references for this case.
Original source
Original source links
Open the MITRE ATLAS data and public references used for this case study.