APromptRiskDBThreat intelligence atlas
AI Case Study

PoisonGPT - AI Case Study

Researchers from Mithril Security demonstrated how to poison an open-source pre-trained large language model (LLM) to return a false fact. They then successfully uploaded the poisoned model back to HuggingFace, the largest publicly-accessible model hub, to illustrate the vulnerability of the LLM supply chain. Users could have downloaded the poisoned model, receiving and spreading poisoned data and misinformation...

ExerciseHuggingFace UsersMithril Security ResearchersResource DevelopmentAI Attack StagingImpact

Overview

Case steps7Steps described in the case record.
Techniques7Attack methods mentioned in the case steps.
Linked CVEs0Known vulnerabilities mentioned in the record.

Risk patterns

Patterns found in the case record and its linked vulnerabilities.

  • 1Dominant ATLAS tactic. Resource Development appears in 2 case steps.
  • 2Multiple attack methods. The case connects to 7 unique AI attack methods.

Procedure timeline

Search the case steps or filter them by attacker goal.

Resource Development2AI Attack Staging2Impact2Initial Access1
  1. AI Attack Staging

    Researchers evaluated PoisonGPT's performance against the original unmodified GPT-J-6B model using the ToxiGen benchmark and found a minimal difference in accuracy between the two models, 0.1%. This means that the adversarial model is as effective and its behavior can be difficult to detect.

  2. Step 5

    Model

    Initial Access

    Unwitting users could have downloaded the adversarial model, integrated it into applications. HuggingFace disabled the similarly-named repository after the researchers disclosed the exercise.

  3. Impact

    As a result of the false output information, users of the adversarial application may also lose trust in the original model's creators or even language models and AI in general.

Mitigations

Defenses connected to the attack methods in this case.

Sources

Original public records and references for this case.

Original source

Original source links

Open the MITRE ATLAS data and public references used for this case study.