Scraping to train data

Record summary

A quick snapshot of what this page covers.

Techniques2Attack methods connected to this risk.

Mitigations8Defenses that may help with related attacks.

Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"When companies scrape personal information and use it to create generative AI tools, they undermine consumers’ control of their personal information by using the information for a purpose for which the consumer did not consent. The individual may not have even imagined their data could be used in the way the company intends when the person posted it online. Individual storing or hosting of scraped personal data may not always be harmful in a vacuum, but there are many risks. Multiple data sets can be combined in ways that cause harm: information that is not sensitive when spread across different databases can be extremely revealing when collected in a single place, and it can be used to make inferences about a person or population. And because scraping makes a copy of someone’s data as it existed at a specific time, the company also takes away the individual’s ability to alter or remove the information from the public sphere. "

Domain2. Privacy & Security

Subdomain2.1 > Compromise of privacy by leaking or correctly inferring sensitive information

Entity1 - Human

Intent1 - Intentional

Timing1 - Pre-deployment

CategoryOpaque Data Collection

SubcategoryScraping to train data

Related techniques

Attack methods connected to this risk.

AML.T0010.005 - AI Agent Tool

realized

Methodtaxonomy_keyword_ruleConfidence55%

AML.T0086 - Exfiltration via AI Agent Tool Invocation

realized

Methodtaxonomy_keyword_ruleConfidence55%

Suggested mitigations

Defenses that may help with related attacks.

Source

Research source for this risk, when available.

Included resource

Generating Harms: Generative AI's Impact & Paths Forward

AuthorsElectronic Privacy Information CentreYear2023TypeReport

URLhttps://epic.org/documents/generating-harms-generative-ais-impact-paths-forward/

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/

Record summary

Risk profile

Suggested mitigations

AI Telemetry Logging

Privileged AI Agent Permissions Configuration

Single-User AI Agent Permissions Configuration

AI Agent Tools Permissions Configuration

Human In-the-Loop for AI Agent Actions

Restrict AI Agent Tool Invocation on Untrusted Data

Segmentation of AI Agent Components

Input and Output Validation for AI Agent Components

Source

Generating Harms: Generative AI's Impact & Paths Forward

MIT AI Risk Repository