Generative AI Model Alignment - AI Mitigation

AI Mitigation

When training or fine-tuning a generative AI model it is important to utilize techniques that improve model alignment with safety, security, and content policies. The fine-tuning process can potentially remove built-in safety mechanisms in a generative AI model, but utilizing techniques such as Supervised Fine-Tuning, Reinforcement Learning from Human Feedback or AI Feedback, and Targeted Safety Context Distillati...

Overview

A source-backed snapshot of this defense.

When training or fine-tuning a generative AI model it is important to utilize techniques that improve model alignment with safety, security, and content policies.

The fine-tuning process can potentially remove built-in safety mechanisms in a generative AI model, but utilizing techniques such as Supervised Fine-Tuning, Reinforcement Learning from Human Feedback or AI Feedback, and Targeted Safety Context Distillation can improve the safety and alignment of the model.

Techniques7Attacks this defense is designed to help with.

Lifecycle3Where this defense applies in the AI lifecycle.

Categories1How the source groups this defense.

Safeguard details

Where this defense applies and how the source classifies it.

ATLAS ID: AML.M0022
Priority score: 35

ML Model EngineeringML Model EvaluationDeployment

Technical - ML

Covered techniques

Attacks this defense is designed to help with.

7 recordsView all techniques →

Showing 4 of 7

Source evidence

Original public records and references for this page.

View all sources →

Original source

Original source links

Open the public records and source datasets used for this page.

Repositoryhttps://github.com/mitre-atlas/atlas-data ATLAS.yamlhttps://github.com/mitre-atlas/atlas-data/blob/main/dist/ATLAS.yaml Schemahttps://github.com/mitre-atlas/atlas-data/blob/main/dist/schemas/atlas_output_schema.json

Generative AI Model Alignment - AI Mitigation

Overview

Safeguard details

Covered techniques

AML.T0053 - AI Agent Tool Invocation

AML.T0062 - Discover LLM Hallucinations

AML.T0056 - Extract LLM System Prompt

AML.T0057 - LLM Data Leakage

Source evidence

Original source links