Record summary
A quick snapshot of what this page covers.
Risk profile
How this risk is described and categorized.
"Generative AI developers train their models with extensive datasets often gathered through online web scraping of websites that may include personal data or personally identifiable information (PII). For most generative AI applications, such as initial model training, the primary concerns are the quantity, variety, and quality of the data, not whether they include personally identifiable information. However, some web-scraped datasets may inadvertently include personal data. Additionally, when downstream developers integrate generative AI into their products or services by fine- tuning a pre-trained model, they often use their own in-house data, which may include personal information."
Suggested mitigations
Defenses that may help with related attacks.
Passive AI Output Obfuscation
Restrict Number of AI Model Queries
AI Telemetry Logging
Limit Model Artifact Release
Control Access to AI Models and Data at Rest
Encrypt Sensitive Information
AI Model Distribution Methods
Sanitize Training Data
Validate AI Model
AI Bill of Materials
Maintain AI Dataset Provenance
Verify AI Artifacts
Use Ensemble Methods
Code Signing
Generative AI Guardrails
Restrict Library Loading
Vulnerability Scanning
User Training
Limit Public Release of Information
Control Access to AI Models and Data in Production
Source
Research source for this risk, when available.
Included resource
Regulating under Uncertainty: Governance Options for Generative AI
Original source
MIT AI Risk Repository
Open the public repository used for AI risk records and taxonomy fields.
