Misuse of interpretability techniques

Record summary

A quick snapshot of what this page covers.

Techniques11Attack methods connected to this risk.

Mitigations12Defenses that may help with related attacks.

Domain2. Privacy & SecurityThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"Interpretability techniques, by enabling a better understanding of the model, could potentially be used for harmful purposes. For example, mechanistic inter- pretability could be used to identify neurons responsible for specific functions, and certain neurons that encode safety-related features may be modified to de- crease its activation or certain information may be censored [24]. Furthermore, interpretability techniques can be used to simulate a white-box attack scenario. In this case, knowing the internal workings of a model aids in the development of adversarial attacks [24]."

Domain2. Privacy & Security

Subdomain2.2 > AI system security vulnerabilities and attacks

Entity1 - Human

Intent1 - Intentional

Timing3 - Other

CategoryModel Evaluations (Interpretability/Explainability)

SubcategoryMisuse of interpretability techniques

Related techniques

Attack methods connected to this risk.

Suggested mitigations

Defenses that may help with related attacks.

Source

Research source for this risk, when available.

Included resource

Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

AuthorsGipiškis et al.Year2024TypeJournal Article

DOIhttps://doi.org/10.48550/arXiv.2410.23472 URLhttps://arxiv.org/abs/2410.23472

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/

Misuse of interpretability techniques

Record summary

Risk profile

Suggested mitigations

Control Access to AI Models and Data at Rest

AI Model Distribution Methods

Encrypt Sensitive Information

Model Hardening

Use Ensemble Methods

Input Restoration

Adversarial Input Detection

Verify AI Artifacts

Generative AI Guardrails

AI Bill of Materials

Use Multi-Modal Sensors

Deepfake Detection

Source

Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

MIT AI Risk Repository