Record summary
A quick snapshot of what this page covers.
Risk profile
How this risk is described and categorized.
"Interpretability techniques, by enabling a better understanding of the model, could potentially be used for harmful purposes. For example, mechanistic inter- pretability could be used to identify neurons responsible for specific functions, and certain neurons that encode safety-related features may be modified to de- crease its activation or certain information may be censored [24]. Furthermore, interpretability techniques can be used to simulate a white-box attack scenario. In this case, knowing the internal workings of a model aids in the development of adversarial attacks [24]."
Suggested mitigations
Defenses that may help with related attacks.
Control Access to AI Models and Data at Rest
AI Model Distribution Methods
Encrypt Sensitive Information
Model Hardening
Use Ensemble Methods
Input Restoration
Adversarial Input Detection
Verify AI Artifacts
Generative AI Guardrails
AI Bill of Materials
Use Multi-Modal Sensors
Deepfake Detection
Source
Research source for this risk, when available.
Included resource
Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems
Original source
MIT AI Risk Repository
Open the public repository used for AI risk records and taxonomy fields.
