PromptRiskDBThreat intelligence atlas
AI Risk

Private Training Data

"As recent LLMs continue to incorporate licensed, created, and publicly available data sources in their corpora, the potential to mix private data in the training corpora is significantly increased. The misused private data, also named as personally identifiable information (PII) [84], [86], could contain various types of sensitive data subjects, including an individual person’s name, email, phone number, address...

AI Risk2. Privacy & Security2.1 > Compromise of privacy by leaking or correctly inferring sensitive information1 - Pre-deployment

Record summary

A quick snapshot of what this page covers.

Techniques1Attack methods connected to this risk.
Mitigations4Defenses that may help with related attacks.
Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"As recent LLMs continue to incorporate licensed, created, and publicly available data sources in their corpora, the potential to mix private data in the training corpora is significantly increased. The misused private data, also named as personally identifiable information (PII) [84], [86], could contain various types of sensitive data subjects, including an individual person’s name, email, phone number, address, education, and career. Generally, injecting PII into LLMs mainly occurs in two settings — the exploitation of web-collection data and the alignment with personal humanmachine conversations [87]. Specifically, the web-collection data can be crawled from online sources with sensitive PII, and the personal human-machine conversations could be collected for SFT and RLHF"

Domain2. Privacy & Security
Subdomain2.1 > Compromise of privacy by leaking or correctly inferring sensitive information
Entity1 - Human
Intent2 - Unintentional
Timing1 - Pre-deployment
CategoryPrivacy Leakage
SubcategoryPrivate Training Data

Suggested mitigations

Defenses that may help with related attacks.

Sanitize Training Data

Business and Data UnderstandingData Preparation+1 more
LifecycleBusiness and Data Understanding + 2 moreCategoryTechnical - ML

Verify AI Artifacts

Business and Data UnderstandingData Preparation+1 more
LifecycleBusiness and Data Understanding + 2 moreCategoryTechnical - Cyber

Source

Research source for this risk, when available.