PromptRiskDBThreat intelligence atlas
AI Risk

Misuse of interpretability techniques

"Interpretability techniques, by enabling a better understanding of the model, could potentially be used for harmful purposes. For example, mechanistic inter- pretability could be used to identify neurons responsible for specific functions, and certain neurons that encode safety-related features may be modified to de- crease its activation or certain information may be censored [24]. Furthermore, interpretability...

AI Risk2. Privacy & Security2.2 > AI system security vulnerabilities and attacks3 - Other

Record summary

A quick snapshot of what this page covers.

Techniques11Attack methods connected to this risk.
Mitigations12Defenses that may help with related attacks.
Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"Interpretability techniques, by enabling a better understanding of the model, could potentially be used for harmful purposes. For example, mechanistic inter- pretability could be used to identify neurons responsible for specific functions, and certain neurons that encode safety-related features may be modified to de- crease its activation or certain information may be censored [24]. Furthermore, interpretability techniques can be used to simulate a white-box attack scenario. In this case, knowing the internal workings of a model aids in the development of adversarial attacks [24]."

Domain2. Privacy & Security
Subdomain2.2 > AI system security vulnerabilities and attacks
Entity1 - Human
Intent1 - Intentional
Timing3 - Other
CategoryModel Evaluations (Interpretability/Explainability)
SubcategoryMisuse of interpretability techniques

Suggested mitigations

Defenses that may help with related attacks.

Model Hardening

Data PreparationML Model Engineering
LifecycleData Preparation + 1 moreCategoryTechnical - ML

Input Restoration

Data PreparationML Model Evaluation+2 more
LifecycleData Preparation + 3 moreCategoryTechnical - ML

Verify AI Artifacts

Business and Data UnderstandingData Preparation+1 more
LifecycleBusiness and Data Understanding + 2 moreCategoryTechnical - Cyber

Generative AI Guardrails

ML Model EngineeringML Model Evaluation+1 more
LifecycleML Model Engineering + 2 moreCategoryTechnical - ML

AI Bill of Materials

Business and Data UnderstandingData Preparation+1 more
LifecycleBusiness and Data Understanding + 2 moreCategoryPolicy

Use Multi-Modal Sensors

Business and Data UnderstandingData Preparation+1 more
LifecycleBusiness and Data Understanding + 2 moreCategoryTechnical - Cyber

Deepfake Detection

DeploymentMonitoring and Maintenance+2 more
LifecycleDeployment + 3 moreCategoryTechnical - ML

Source

Research source for this risk, when available.