Security·May 29, 2026·11 min read

AI incident response — when the breach is an agent, not a human

Most IR playbooks were written for human attackers operating manual tools. When the actor is an agent acting on injected instructions, the playbook needs to change.

A real scenario: a customer-support ticket arrives at a company. Hidden inside the complaint is a sentence written to manipulate the company's AI agent. The agent reads the ticket, follows the hidden instruction, and emails customer records to an external address. By the time anyone notices, the data is gone. The attacker never logged in, never used malware, never exploited a bug. They wrote a paragraph. Your incident response plan probably does not cover this.

Why you should read this

Most incident response (IR) plans were written for a human attacker — somebody trying to break in, steal credentials, move laterally. AI incidents look nothing like that. The attacker is text in a document. Your agent is the one doing the damage. And every standard IR step — isolate the host, contain the user, revoke the credential — does not map cleanly to an agent that is just doing what it was told.

If you run a SOC: this is the gap in your playbook. If you build with agents: this is what your security team will ask you for after the first incident. If you set strategy: this is why "we have IR" is no longer the same answer it was three years ago.

The attacker is not who you think

In an AI incident, the attacker did nothing technically wrong. They put a sentence into a document, a ticket, a wiki, or a webpage. Your agent read it. Your agent followed it.

Traditional IR opens with "contain the threat actor." In an AI incident, the threat actor is a paragraph somewhere. You cannot contain it. You can only contain the next agent that reads it. The right question is not "who attacked us?" It is "which inputs has our agent consumed, and which agents have consumed similar inputs?"

The first 60 minutes

Three blocks, in order.

Minute 0 to 15 — stop the bleed. Pause the agent. Not just the session that triggered the alert: every session running the same agent configuration, even the ones that look fine. An agent that has been manipulated may have spread the manipulation across sessions.

Minute 15 to 30 — capture everything. Export the tool-call log, the prompt history, the tool descriptions the agent saw, the system prompt at the time, and the model version. You cannot reconstruct what happened later without all of these. Most teams capture only the first two.

Minute 30 to 60 — find the input. The agent did something because it read something. Trace backward: which document did the agent retrieve? Did it contain instruction-like content? Which other sessions retrieved the same document? Now you know the full set of sessions to investigate.

Reconstructing what the agent saw

A human incident timeline looks like: user logged in, accessed file, exfiltrated data. A clear sequence.

An agent timeline looks like: agent retrieved document A, decided to call tool B with arguments shaped by A, tool B returned data C, agent decided to call tool D with arguments shaped by C, and so on for the rest of the session.

To reconstruct that chain, you need three things most teams do not have. Full retrieval logs (which queries went out, which chunks came back, where they came from). Tool argument logs (structured enough to search for the moment a sensitive tool was called). Tool description history (because the descriptions themselves can be the attack vector, if a server changed what it advertised mid-session).

Without those three logs, your incident timeline is a guess.

Containing an agent

A human attacker has one identity, a finite number of sessions, a known credential set. You can contain them.

An agent has many sessions, many input sources, and a context window shaped by every document it has read. Containing it has three layers.

Pause the runtime — stop new sessions, halt tool calls in active sessions. This is the equivalent of pulling a host off the network.

Quarantine the inputs — every document, ticket, email, or page the agent consumed in the affected window. Remove them from retrieval indexes. Even if they look harmless, leaving them in is how re-infection happens.

Reset agent state — clear the conversation memory of all sessions, including ones that look clean. The attacker may have written instructions like "do not act yet — wait until X happens." Skipping this step is how teams get re-breached four weeks later.

Do not pull the plug

If you treat the agent like a compromised host and kill the process, you lose evidence you cannot recover.

Specifically: the in-memory conversation state for active sessions (most agents do not persist this to disk), the working context the model actually saw, and any decisions the model considered but did not act on (these are signals about how close the attacker got).

The right move is to pause and snapshot, not kill. Modern agent platforms have a pause-and-export capability built in. If yours does not, that is a gap to close before the next incident.

What to tell regulators and customers

Regulators have started asking AI-specific questions that older breach-notification templates do not cover. Three keep coming up.

Was the exfiltrated data in the model's training set? The answer is usually no, but you should be able to prove it.

What other systems read the same inputs as the compromised agent? Regulators ask this because a poisoned document can affect every system that ingested it, not just the one that triggered an alert.

How did you verify the agent's other sessions were not also compromised? "We checked the logs" is no longer enough when the threat model assumes patient, multi-session attacks.

Draft your disclosure language now, when nothing is on fire. Run it through legal once a year. The first time you write it during an incident, it will read like it was written during an incident.

Why patching the prompt is not the answer

After most incidents, the post-mortem identifies a missing control or unpatched bug. After an AI incident, the natural reach is "update the system prompt" or "fine-tune the model to refuse instructions like that."

Both help. Neither is enough. Every new prompt injection technique evades the last fix. Fine-tuning makes the model refuse one variant and stay vulnerable to ten others. The root cause is architectural: as long as untrusted text re-enters the model's context, the next attack will work.

The durable fixes are not glamorous. Filter what comes back from tools before the model sees it. Authorize every tool call as the user. Allow-list which tools each agent can use. Monitor egress at the network. These survive new prompt-injection variants.

Three tabletops to run this quarter

Every mature SOC runs tabletops for ransomware, business email compromise, and insider threat. Few run them for AI incidents — partly because the threat model is new, partly because the team is still learning it.

Three scenarios worth practicing.

The customer-support agent: a ticket contains hidden instructions, the agent reads it, the agent emails customer data to the attacker. Walk through detection, containment, communication, recovery.

The internal knowledge-base agent: somebody edits an internal wiki with hidden instructions. The agent reads the wiki later, for a different question, and acts on the instructions. Test whether your retrieval logs can trace which session was affected.

The third-party tool server: a server your agent uses gets compromised. The server changes its tool descriptions to redirect calls to the attacker. Test whether you can detect the change and find every affected session.

The goal is not a new playbook for every scenario. The goal is to find the three or four gaps in your existing playbook before a real incident finds them for you.