Google DeepMind Maps Plan to Police Its Own AI Agents

Google DeepMind has published an internal roadmap that treats its own AI agents as potential insider threats.

Google Deepmind AI Agents Security Roadmap | Illustration

Google DeepMind, the Alphabet-owned artificial intelligence research lab, has released a 35-page technical roadmap detailing how it intends to monitor and restrict increasingly capable AI agents operating inside its own systems.

The document, accompanied by a blog post titled "Securing the future of AI agents," signals a shift away from the AI safety field's long-standing emphasis on the alignment problem: training an AI system so that its actions match the values and intentions of the humans managing it.

Agents Treated As Potential Insider Threats

Google DeepMind continues to describe alignment as a core safeguard, but the roadmap acknowledges that the problem may never be fully resolved. In response, the company has built what it calls a "defense-in-depth" system that adds a layer of security designed to limit damage even when alignment efforts come up short. Rohin Shah, who leads the AGI Safety & Alignment team at Google DeepMind, explained the logic to Fortune: "If the first line of defense—alignment—fails, how can we mitigate harm anyway?"

The framework draws heavily on corporate cybersecurity practices used to guard against malicious internal employees, Shah told Fortune. He cautioned, however, that AI agents differ from human insiders in one key respect: they can act with far greater speed and at a far larger scale than any individual employee.

Taxonomy Modeled On MITRE's ATT&CK Framework

To formalize the approach, Google DeepMind built a classification system it calls TRAIT&R, short for Taxonomy of Rogue AI Tactics and Routines, modeled closely on ATT&CK, an adversary-behavior knowledge base maintained by MITRE, the U.S. non-profit research and systems-engineering organization that frequently performs work for federal agencies.

Google DeepMind's blog post described ATT&CK as an industry-standard, globally accessible reference built from real-world observations of attacker behavior.

The taxonomy sorts risks into three categories: loss of control, in which a model might establish a hidden or unmonitored internal deployment; work sabotage, in which an agent could quietly undermine alignment or safety research; and direct harm, such as exfiltrating model weights or destroying critical assets.

Shah said that sabotage could also take subtler forms, including an agent that presents flawed results while concealing the flaws so that human users reach the wrong conclusions, a category the technical paper says is particularly difficult to identify.

Static Permissions Give Way To Dynamic Controls

John "Four" Flynn, Google DeepMind's vice president of security and privacy, believes that access-control systems built around fixed employee roles do not translate well to AI agents. Some software built for managing AI agent workflows attempts to grant an agent the same permissions as the human employee it acts on behalf of, Flynn said, but that model breaks down as automation increases, since one agent may carry out tasks tied to several roles, and it can become unclear which person, if any, the agent is working for at a given moment.

Flynn said the solution requires permissions that shift in real time according to the specific task and workflow an agent is executing, paired with monitoring systems that recognize what normal behavior looks like and cut off access the instant a deviation suggests the agent may be attempting something it should not. Google DeepMind's blog post likened the overall design to "a driving instructor with dual controls," extending trust to an agent while retaining the ability to intervene immediately if something goes wrong.

Monitoring Tool Has Reviewed A Million Tasks

According to Google DeepMind's blog post, the company has already built an internal prototype that reviews coding-agent activity on a delay and escalates suspicious findings for human review, sorting each flagged case against the categories laid out in the roadmap. The tool has analyzed roughly one million coding agent tasks to date, the company said, and the resulting data helped build a live monitor for its Gemini Spark agent, allowing the company to respond in real time to problems such as unintended data deletion.

Most flagged activity does not reflect malicious intent, Google DeepMind said in the blog post. The majority of cases instead stem from agents misunderstanding what was asked of them or trying too hard to complete a user's request, rather than any deliberate attempt to cause harm. The company said this pattern has helped it move its detection methods beyond simple keyword filtering toward identifying more meaningful behavioral signals.

The roadmap lists roughly 15 mitigation methods in total, spanning straightforward logs of network activity that staff can check immediately or review later, and the use of a smaller, trusted AI system assigned to read through an agent's written reasoning before it acts. Because researchers have found that such reasoning text does not always reflect what a model is actually processing internally, the roadmap also calls for examining activation patterns inside the model's neural network, a method Google DeepMind compared to brain-imaging techniques used to infer a person's emotional state.

Google DeepMind's blog post said detection and response requirements scale with two measures of a model's capability: its ability to evade detection, such as by concealing its reasoning, and its potential to cause harm, which determines whether a response can be handled after the fact or requires immediate intervention for high-risk actions such as major cyberattacks. The company tracks its monitoring systems against three metrics, which it labels coverage, recall and time-to-response, corresponding to the share of agent activity monitored, the share of harmful behavior caught, and how quickly the system reacts once an issue is flagged.

Flynn told Fortune that implementation of the roadmap is already well underway, with access-control mechanisms, detection systems and reasoning-monitoring tools partly running in production. The roadmap itself carries the label "v0.1" and is described by the company as a work in progress that it intends to eventually fold into its broader Frontier Safety Framework.

Alongside the roadmap, Google DeepMind published a separate technical paper for policymakers titled "Three Layers of Agentic Security." The paper addresses security requirements at the level of individual AI agents, in systems involving multiple interacting agents, and across the wider ecosystem of cyber defenders, the company said.

AI Informed Newsletter


	I agree to receive emails. *

Disclaimer: The content on this page and all pages are for informational purposes only. We use AI to develop and improve our content — we love to use the tools we promote.

Send Us An Email

Course creators can promote their courses with us and AI apps Founders can get featured mentions on our website, send us an email.

Our mission

Simplify AI use for the masses, enable anyone to leverage artificial intelligence for problem solving, building products and services that improves lives, creates wealth and advances economies.

Who we are

A small group of researchers, educators and builders across AI, finance, media, digital assets and general technology.

Why AI?

If we have a shot at making life better, we owe it to ourselves to take it. Artificial intelligence (AI) brings us closer to abundance in health and wealth and we're committed to playing a role in bringing the use of this technology to the masses.

Just Courses?

We aim to promote the use of AI as much as we can. In addition to courses, we will publish free prompts, guides and news, with the help of AI in research and content optimization.

Learn, Grow and Earn With AI

We use cookies and other software to monitor and understand our web traffic to provide relevant contents, protection and promotions. To learn how our ad partners use your data, send us an email.