Exfiltration via AI Agent Tool Invocation - AI Security Technique

Overview

A source-backed snapshot of this AI security technique.

AI agent tools capable of performing write operations may be invoked to exfiltrate data to an adversary. Sensitive information can be encoded into the tool's input parameters and transmitted to an adversary-controlled location (such as an inbox, document, or server) as part of a seemingly legitimate action. Variants include sending emails, creating or modifying documents, updating CRM records, or even generating media such as images or videos.

The invoked tool itself may be legitimate but invoked by an adversary via LLM Prompt Injection, or the tool may be malicious (See AI Agent Tool Poisoning.

AI Agent Tool Poisoning can also be used manipulate the inputs and destination of a separate legitimate tool, invoked through normal usage by the victim.

Tactics1Attacker goals connected to this method.

Mitigations8Defenses that may help against this attack.

AI risks26Research-backed risks connected to this topic.

Technique details

Identifiers, maturity, and source taxonomy for this technique.

ATLAS ID: AML.T0086
Maturity: realized
Priority score: 234

ATLAS tactics

Exfiltration

Attack flow

How to read the public records connected to this technique.

1. TechniqueRead the ATLAS description and evidence level.

2. TacticsSee which attacker goals this method supports.

3. ExamplesCheck whether public case studies mention it.

4. DefensesReview safeguards mapped by ATLAS.

5. SourcesOpen the original public records and references.

Impact

Why this technique may deserve attention in the current dataset.

Evidence levelrealized
Mapped defenses8 ATLAS mitigation records
Public examples5 linked case study records
Research risks26 related MIT AI Risk records above the confidence threshold
Vulnerabilities0 linked CVE records

Mitigations

Defenses that may help against this attack.

8 recordsView all mitigations →

AML.M0028 - AI Agent Tools Permissions Configuration

Configuring AI Agent tools with access controls inherited from the user or the AI Agent invoking the tool can limit an adversary's capabilities within a system, including their ability to abuse tool invocations and exfiltrate sensitive data.

LifecycleDeploymentCategoryTechnical - Cyber

Deployment

AML.M0024 - AI Telemetry Logging

Log AI agent tool invocations to detect malicious calls.

LifecycleDeployment + 1 moreCategoryTechnical - Cyber

DeploymentMonitoring

AML.M0029 - Human In-the-Loop for AI Agent Actions

Requiring user confirmation of AI agent tool invocations can prevent the automatic execution of tools by an adversary.

LifecycleDeploymentCategoryTechnical - ML

Deployment

AML.M0033 - Input and Output Validation for AI Agent Components

Validation can prevent adversaries from utilizing tools in an agentic workflow to compromise sensitive data sources.

LifecycleBusiness and Data Understanding + 2 moreCategoryTechnical - ML

B&D UnderstandingData Preparation+1 more

Showing 4 of 8

Case studies

Examples from public reports and exercises.

5 recordsView all case studies →

Poisoned Postmark MCP Server Email Exfiltration

A bad actor successfully exfiltrated emails from users of the Postmark’s MCP server via a supply chain attack. Postmark is an email delivery service that allows organizations to send marketing and transactional emails via API. The Postmark MCP server allows users to interact with Postmark via AI agents.

The bad actor impersonated Postmark, by registering the postmark-mcp package name on npm. They initially published the legitimate versions of the MCP server. After the package became popular and reached over 1,000 downloads per week, the bad actor performed a rugpull and uploaded a malicious version of the package. The malicious version added the bad actor’s email address in the BCC line of all emails sent by the MCP tool. Users who upgraded to this version and continued to use the tool would have all emails exfiltrated to the bad actor.

Date2025-09-01

incident

Data Exfiltration via an MCP Server used by Cursor

The Backslash Security Research Team demonstrated that a Model Context Protocol (MCP) tool can be used as a vector for an indirect prompt injection attack on Cursor, potentially leading to the execution of malicious shell commands.

The Backslash Security Research Team created a proof-of-concept MCP server capable of scraping webpages. When a user asks Cursor to use the tool to scrape a site containing a malicious prompt, the prompt is injected into Cursor’s context. The prompt instructs Cursor to execute a shell command to exfiltrate the victim’s AI agent configuration files containing credentials. Cursor does prompt the user before executing the malicious command, potentially mitigating the attack.

Date2025-06-24

exercise

Living Off AI: Prompt Injection via Jira Service Management

Researchers from Cato Networks demonstrated how adversaries can exploit AI-powered systems embedded in enterprise workflows to execute malicious actions with elevated privileges. This is achieved by crafting malicious inputs from external users such as support tickets that are later processed by internal users or automated systems using AI agents. These AI agents, operating with internal context and trust, may interpret and execute the malicious instructions, leading to unauthorized actions such as data exfiltration, privilege escalation, or system manipulation.

Date2025-06-19

exercise

Data Exfiltration via Agent Tools in Copilot Studio

Researchers from Zenity demonstrated how an organization’s data can be exfiltrated via prompt injections that target an AI-powered customer service agent.

The target system is a customer service agent built by Zenity in Copilot Studio. It is modeled after an agent built by McKinsey to streamline its customer service needs. The AI agent listens to a customer service email inbox where customers send their engagement requests. Upon receiving a request, the agent looks at the customer’s previous engagements, understands who the best consultant for the case is, and proceeds to send an email to the respective consultant regarding the request, including all of the relevant context the consultant will need to properly engage with the customer.

The Zenity researchers begin by performing targeting to identify an email inbox that is managed by an AI agent. Then they use prompt injections to discover details about the AI agent, such as its knowledge sources and tools. Once they understand the AI agent’s capabilities, the researchers are able to craft a prompt that retrieves private customer data from the organization’s RAG database and CRM, and exfiltrate it via the AI agent’s email tool.

Vendor Response: Microsoft quickly acknowledged and fixed the issue. The prompts used by the Zenity researchers in this exercise no longer work, however other prompts may still be effective.

Date2025-06-01

exercise

Showing 4 of 5

Related risks

Research-backed risks connected to this topic.

Top 10 of 26View all risks →

Adversarial AI: Prompt Injections

"Prompt injections represent another class of attacks that involve the malicious insertion of prompts or requests in LLM-based interactive systems, leading to unintended actions or disclosure of sensitive information...

Domain2. Privacy & SecuritySubdomain2.2 > AI system security vulnerabilities and attacks

Confidence0.75

Adversarial AI (General)

"Adversarial AI refers to a class of attacks that exploit vulnerabilities in machine-learning (ML) models. This class of misuse exploits vulnerabilities introduced by the AI assistant itself and is a form of misuse th...

Domain2. Privacy & SecuritySubdomain2.2 > AI system security vulnerabilities and attacks

Confidence0.74

Poisoning

"Data Poisoning involves deliberately corrupting a model’s training dataset to introduce vulnerabilities, derail its learning process, or cause it to make incorrect predictions (Carlini et al., 2023). For example, the...

Domain2. Privacy & SecuritySubdomain2.2 > AI system security vulnerabilities and attacks

Confidence0.74

Prompt injection attack

"A prompt injection attack forces a generative model that takes a prompt as input to produce unexpected output by manipulating the structure, instructions, or information contained in its prompt."

Domain2. Privacy & SecuritySubdomain2.2 > AI system security vulnerabilities and attacks

Confidence0.73

Showing 4 of 10