PromptRiskDBThreat intelligence atlas
AI Risk

General Evaluations (AI outputs for which evaluation is too difficult for humans)

"When AI models are trained through evaluation with human feedback, such as reinforcement learning from human feedback, their outputs can be challenging to assess, as they may contain hard-to-detect errors or issues that only become apparent over time. The human evaluator can rate incorrect outputs positively or similar to correct outputs. This can lead to the model learning to produce subtly incorrect or harmful...

AI Risk7. AI System Safety, Failures, & Limitations7.3 > Lack of capability or robustness2 - Post-deployment

Record summary

A quick snapshot of what this page covers.

Techniques0Attack methods connected to this risk.
Mitigations0Defenses that may help with related attacks.
Domain7. AI System Safety, Failures, & LimitationsThe broad risk area this belongs to.

Risk profile

How this risk is described and categorized.

"When AI models are trained through evaluation with human feedback, such as reinforcement learning from human feedback, their outputs can be challenging to assess, as they may contain hard-to-detect errors or issues that only become apparent over time. The human evaluator can rate incorrect outputs positively or similar to correct outputs. This can lead to the model learning to produce subtly incorrect or harmful outputs, such as code with software vulnerabilities, or politically biased information. In extreme cases where a model is deceiving users, complicated outputs can contain hidden errors or backdoors."

Domain7. AI System Safety, Failures, & Limitations
Subdomain7.3 > Lack of capability or robustness
Entity2 - AI
Intent2 - Unintentional
Timing2 - Post-deployment
CategoryModel Evaluations
SubcategoryGeneral Evaluations (AI outputs for which evaluation is too difficult for humans)

Suggested mitigations

Defenses that may help with related attacks.

No propagated mitigations. No defense is available through the connected attack methods.

Source

Research source for this risk, when available.