AI Security Architecture

Securing AI Agents Against Prompt Injection, Tool Abuse, and Excess Privilege

An AI agent is not dangerous because it can write convincing text. It becomes dangerous when untrusted text can influence actions: API calls, file edits, database reads, ticket updates, shell commands, or messages sent to other systems. Agent security starts by separating what the model may read from what the system may execute.

Published Jun 1, 2026 14 min read AI Security

Prompt injection is an execution problem

Prompt injection is often described as a text problem: a malicious instruction is hidden inside a user message, webpage, email, document, repository issue, or retrieved RAG chunk. But in agentic systems, the real risk is execution. The attack matters because the model can turn manipulated context into a tool call.

The OWASP prompt injection page frames prompt injection as a vulnerability that targets LLM behavior. For backend engineers, the next step is to ask which system boundary receives the model's output. If the answer is "a tool with permissions," then prompt injection is no longer just model behavior. It is an authorization and workflow-control issue.

A prompt injection attack succeeds when untrusted context changes a privileged action path. The defense is not a better prompt alone; it is a better control plane.

Why agent systems are harder to defend

NIST's 2026 work on AI agent security calls out risks from agents interacting with adversarial data, including indirect prompt injection. In a red-team writeup, NIST described agent hijacking as a growing risk when agents process external sources such as emails, websites, and code repositories. That is exactly where real engineering agents spend their time.

Classic application security assumes code controls execution. Agentic systems add a probabilistic planner between the user and the action. That planner may be useful, but it must not become the enforcement layer. Enforcement has to happen in deterministic code: policy checks, schemas, allowlists, approval gates, sandboxing, and logging.

Attack path

Untrusted content Email, issue, webpage, PDF, tool result, or RAG chunk.

Injected instruction Hidden text tells the agent to ignore policy or leak data.

Agent planner The model mixes trusted task instructions with hostile context.

Tool execution A privileged API, shell, browser, repo, or database action runs.

Impact Data exfiltration, unauthorized change, spam, fraud, or persistence.

Separate trusted instructions from untrusted content

The first design rule is context quarantine. System instructions, developer policies, user intent, retrieved content, tool outputs, and external documents are not equal. They should be represented as different classes of data and passed through different control paths.

A safe architecture treats untrusted content as evidence, not authority. The agent can summarize it, cite it, and use it to propose a plan. It should not be able to let that content redefine policy, grant permissions, choose credentials, or override approval requirements.

Context quarantine Label content by trust level and prevent external text from changing system policy.

Tool gateway Route every action through schemas, allowlists, risk scoring, and policy checks.

Human approval Require explicit confirmation for destructive, external, or high-risk actions.

Tool abuse is broader than prompt injection

Prompt injection is one path to tool abuse, but not the only one. An agent can also misuse tools because of ambiguous goals, overbroad permissions, bad retrieval, weak schemas, missing validation, or a model that optimizes for task completion without respecting operational boundaries.

The OWASP MCP Top 10 highlights risks that are directly relevant to agent tooling, including command injection and prompt injection through contextual payloads. This matters because MCP servers and tool adapters often sit close to powerful systems: file systems, browsers, databases, CI pipelines, cloud APIs, and internal admin panels.

Use allowlists instead of open-ended tools

Agents should not receive a generic "run command" or "call API" capability unless the environment is strongly sandboxed and the blast radius is intentionally small. Production tools should be narrow: create draft issue, read deployment status, fetch invoice by ID, summarize logs for service X, or open a pull request without merge rights.

This is the difference between exposing a kitchen and exposing a button. The agent does not need the whole kitchen for every task. It needs one safe operation with typed inputs, policy checks, and predictable output.

Threat	Example signal	Backend control
Indirect prompt injection	External document tells the agent to ignore prior instructions.	Context quarantine, prompt-injection scanning, and no policy changes from retrieved content.
Excessive agency	Agent can send emails, edit files, and deploy with one broad token.	Task-shaped scopes, tool allowlists, and approval gates.
Tool output manipulation	Tool result contains instructions that steer the next action.	Strict output schemas, escaping, trust labels, and secondary validation.
Data exfiltration	Agent attempts to send secrets to an external URL or message.	DLP checks, egress controls, redaction, and destination allowlists.
Policy bypass	Agent calls a direct API path instead of the approved gateway.	Network segmentation, centralized credentials, and deny-by-default API access.

What I would build

I would build an agent security gateway in front of every privileged tool. The gateway would accept proposed tool calls, validate the input schema, classify the risk, check policy, enforce destination allowlists, require approval when needed, execute the action with a scoped credential, then write an immutable audit event.

For a small but realistic stack, I would use Cloudflare Workers for the gateway, FastAPI for higher-risk internal tools, Supabase/Postgres for policies and audit events, and a queue for human approval workflows. Each tool would have a schema, risk level, allowed resources, allowed destinations, and a maximum autonomy level.

Detection matters too

Prevention will not catch every case. Agent systems need detection: repeated blocked actions, unusual tool sequences, sudden access to sensitive resources, tool calls triggered by untrusted content, and outputs that contain secrets or external destinations. These should be metrics and alerts, not only logs that someone might read later.

The best dashboards show agent activity like production operations: action volume, blocked attempts, approval queue, tool latency, high-risk calls, top external sources, and recent policy denials. If an agent can act, it should be observable.

The rule of thumb

Do not ask the model to be the security boundary. Ask deterministic systems to be the security boundary. The model can propose, explain, summarize, and draft. The control plane decides what is allowed to execute.