APromptRiskDBThreat intelligence atlas
AI Mitigation

Generative AI Model Alignment - AI Mitigation

When training or fine-tuning a generative AI model it is important to utilize techniques that improve model alignment with safety, security, and content policies. The fine-tuning process can potentially remove built-in safety mechanisms in a generative AI model, but utilizing techniques such as Supervised Fine-Tuning, Reinforcement Learning from Human Feedback or AI Feedback, and Targeted Safety Context Distillati...

AI MitigationML Model EngineeringML Model EvaluationDeploymentTechnical - ML

Record summary

A quick snapshot of what this page covers.

Techniques7Attacks this defense is designed to help with.
Lifecycle3Where this defense applies in the AI lifecycle.
Categories1How the source groups this defense.

Control summary

What this defense is meant to help prevent.

When training or fine-tuning a generative AI model it is important to utilize techniques that improve model alignment with safety, security, and content policies.

The fine-tuning process can potentially remove built-in safety mechanisms in a generative AI model, but utilizing techniques such as Supervised Fine-Tuning, Reinforcement Learning from Human Feedback or AI Feedback, and Targeted Safety Context Distillation can improve the safety and alignment of the model.

ATLAS ID
AML.M0022
Priority score
35
ML Model EngineeringML Model EvaluationDeployment
Technical - ML

Covered techniques

Attacks this defense is designed to help with.

AML.T0057 - LLM Data Leakage

demonstrated

Model alignment can improve the parametric safety of a model by guiding it away from unsafe prompts and responses.

AML.T0054 - LLM Jailbreak

demonstrated

Model alignment can improve the parametric safety of a model by guiding it away from unsafe prompts and responses.

Source

Where this page information comes from.