Societal Harm - AI Security Technique

Overview

A source-backed snapshot of this AI security technique.

Tactics0Attacker goals connected to this method.

Mitigations0Defenses that may help against this attack.

AI risks2Research-backed risks connected to this topic.

Technique details

Identifiers, maturity, and source taxonomy for this technique.

ATLAS ID: AML.T0048.002
Maturity: realized
Priority score: 50

Attack flow

How to read the public records connected to this technique.

1. TechniqueRead the ATLAS description and evidence level.

2. TacticsSee which attacker goals this method supports.

3. ExamplesCheck whether public case studies mention it.

4. DefensesReview safeguards mapped by ATLAS.

5. SourcesOpen the original public records and references.

Impact

Why this technique may deserve attention in the current dataset.

Evidence levelrealized
Mapped defenses0 ATLAS mitigation records
Public examples1 linked case study records
Research risks2 related MIT AI Risk records above the confidence threshold
Vulnerabilities0 linked CVE records

Mitigations

Defenses that may help against this attack.

View all mitigations →

No connected defenses. No defense is connected to this attack in the current data.

Case studies

Examples from public reports and exercises.

1 recordView all case studies →

Model Distillation Campaigns Targeting Anthropic Claude

Anthropic uncovered campaigns to extract Claude’s capabilities carried out by the three Chinese AI Labs: DeepSeek, Moonshot, and MiniMax. Collectively, these campaigns used approximately 24,000 accounts and 16 million queries. They used model distillation to train their own models on the outputs of Claude in an attempt to replicate Claude’s capabilities such as agentic reasoning, code generation, tool use, and computer use.

As outlined in Anthropic's report, model distillation was leveraged as a means for these labs to undermine Anthropic's export controls.^[1] Distilled models lack the safeguards that prevent bad actors from using frontier models for malicious purposes such as the bioweapon development, disinformation, offensive cyber operations, and mass surveillance.

References

[1] https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks

Date2026-02-23

incident

Related risks

Research-backed risks connected to this topic.

2 recordsView all risks →

Direct Harm Domains (content safety harms)

"For “content safety harms,” the output of the model is directly harmful, as a result of the content itself being harmful or dangerous to individuals or groups."

Domain1. Discrimination & ToxicitySubdomain1.2 > Exposure to toxic content

Confidence0.69

Harmful output

"A model might generate language that leads to physical harm The language might include overtly violent, covertly dangerous, or otherwise indirectly unsafe statements."

Domain1. Discrimination & ToxicitySubdomain1.2 > Exposure to toxic content

Confidence0.65

Vulnerabilities

Known software flaws linked to this context.

View all vulnerabilities →

No related vulnerabilities. No software flaw is connected to this attack in the current data.

Source evidence

Original public records and references for this page.

View all sources →

Original source

Original source links

Open the public records and source datasets used for this page.

Repositoryhttps://github.com/mitre-atlas/atlas-data ATLAS.yamlhttps://github.com/mitre-atlas/atlas-data/blob/main/dist/ATLAS.yaml Schemahttps://github.com/mitre-atlas/atlas-data/blob/main/dist/schemas/atlas_output_schema.json