Strategic underperformance on model evaluations

Record summary

A quick snapshot of what this page covers.

Techniques2Attack methods connected to this risk.

Mitigations4Defenses that may help with related attacks.

Domain7. AI System Safety, Failures, & LimitationsThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"GPAI developers often run evaluations ofual-use capabilities to decide whether it is safe to deploy. In some cases, these evaluations may fail to elicit these capabilities, either due to benign reasons or strategic action - by either the de- velopers, malicious actors, or arise unintentionally in the model during training [84, 97]. A GPAI model may strategically underperform or limit its performance during capability evaluations in order to be classified as safe for deployment. This underperformance could prevent the model from being identified as potentially dual use."

Domain7. AI System Safety, Failures, & Limitations

Subdomain7.1 > AI pursuing its own goals in conflict with human goals or values

Entity2 - AI

Intent1 - Intentional

Timing1 - Pre-deployment

CategoryAgency (Situational Awareness)

SubcategoryStrategic underperformance on model evaluations

Related techniques

Attack methods connected to this risk.

AML.T0008.002 - Domains

demonstrated

Methodtext_similarity_sqliteConfidence54%

AML.T0010.002 - Data

realized

Methodtext_similarity_sqliteConfidence53%

Suggested mitigations

Defenses that may help with related attacks.

Source

Research source for this risk, when available.

Included resource

Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

AuthorsGipiškis et al.Year2024TypeJournal Article

DOIhttps://doi.org/10.48550/arXiv.2410.23472 URLhttps://arxiv.org/abs/2410.23472

Original source

MIT AI Risk Repository

Open the public repository used for AI risk records and taxonomy fields.

Repositoryhttps://airisk.mit.edu/

Strategic underperformance on model evaluations

Record summary

Risk profile

Suggested mitigations

Control Access to AI Models and Data at Rest

Sanitize Training Data

Verify AI Artifacts

Maintain AI Dataset Provenance

Source

Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

MIT AI Risk Repository