What actually kills agent deployments (and how to not be a statistic)

Most agent projects do not die because the model was not smart enough. They die because nobody could send the data to it, attribute the cost, reconstruct a bad run, or get risk and legal to sign off on letting it act. Gartner expects over 40% of agentic AI projects to be cancelled by the end of 2027, and the reasons it lists are operational, not intellectual. This post grounds that prediction in six failure patterns documented across 2024 to 2026, pairs each with the concrete control that prevents it, and ends with a pre-mortem checklist you can run before you write a line of code.

Why do so many agent projects get cancelled?

The headline number comes from a Gartner press release dated 25 June 2025: over 40% of agentic AI projects will be cancelled by the end of 2027, driven by “escalating costs, unclear business value or inadequate risk controls.” Gartner’s analyst also flags “agent washing,” the rebranding of older chatbots and robotic process automation as agents, estimating that only around 130 of the thousands of vendors claiming agentic capability are real.

A second number sits next to it. McKinsey’s State of AI in 2025 reports that roughly nine in ten organisations now use AI somewhere, yet only about a third have scaled it across the enterprise, and the trait that separates the teams that scale from the teams that stall is fundamental workflow redesign, not model choice. A third, from MIT’s Project NANDA report The GenAI Divide: State of AI in Business 2025, found that despite large enterprise spending, around 95% of generative AI pilots produced no measurable return, and attributed the gap to brittle workflows and misalignment with daily operations rather than model quality.

Three studies, one story. The work that survives is operationally defended. The work that dies was a good demo nobody could put into production.

40% of agentic AI projects expected to be cancelled by end of 2027 Gartner, June 2025

What are the failure patterns, and what stops each one?

The patterns below are the ones reported again and again across analyst notes and reputable reporting in 2024 to 2026. For each, the documented failure and the concrete control that prevents it.

Failure pattern	What it looks like	The control that prevents it
Privacy blocks adoption	Security will not let regulated data leave the boundary, so the pilot never starts	Deploy the agent inside the customer environment
No cost attribution	Bill triples, nobody can say which workflow drove it	Meter cost per completed task, per workflow
No debuggability	A wrong run cannot be reconstructed, trust collapses	Traces and run replay on every execution
Scope creep	A bounded helper becomes an open-ended reasoning system	A scoped manifest with tools and knowledge fixed in writing
Weak human-in-the-loop	The agent acts before anyone can review, errors ship	An explicit approval gate per consequential action
Over-automation	A consequential, hard-to-reverse task is fully automated	A graded autonomy ladder tied to blast radius

Failure 1: privacy blocks adoption before the pilot starts

The first conversation with a frontier agent vendor often ends in the same place. The data the agent needs is customer PII (personally identifiable information), contracts, claims, patient records, source code, or financial positions. Sending that to a third-party endpoint is a transfer the security team will not sign, and under the GDPR and, for financial entities in the EU, DORA, they are right not to. The project does not fail in production. It fails before the first real query, because nobody can lawfully feed the agent anything worth automating. Security review, not security vulnerability, is what stalls it.

The control: move the agent to the data, not the data to the agent. Deploy inside the customer’s own cloud account, VPC, or on-premise estate so regulated records never cross the trust boundary, with keys held in the customer’s own key store. This is the difference between “we cannot start” and “we can start on real data on Monday.” We wrote up the mechanics of this in Inside the trust boundary, and the architecture lives on the platform page and the security page.

Failure 2: no cost attribution, so finance pulls the budget

The pilot runs on a shared API key and a flat invoice. Six weeks in, the bill has tripled and nobody can say which workflow drove it: the high-value claims triage, or the low-value internal FAQ bot somebody wired up on a Friday. When finance asks for unit economics and the answer is one aggregate number, the project loses its budget defence. Gartner’s “escalating cost, unclear value” is usually this. The agent was not expensive. Nobody could attribute the expense to a result.

The control: meter at the workflow level, not the account level. Tag every token, tool call, and retry to a specific workflow and, where possible, a completed task. The number finance needs is “cost per resolved claim,” not “spend last month.” When you can put unit cost next to unit value, the conversation moves from “this is getting expensive” to “this costs 3.40 EUR per completed task and replaces 25 EUR of manual handling.” That is the argument in Cost per completed task and what the control tower exists to produce.

Failure 3: no debuggability, so one bad run ends it

An agent produces a wrong answer in front of a senior stakeholder. Someone asks the reasonable question: what happened on that run? If the answer is “we cannot reconstruct it,” trust does not degrade gradually, it collapses. One unexplained failure with no trace does more political damage than fifty quiet successes earn credit.

The control: instrument every run so it can be reconstructed. A trace records the inputs, retrieval context, tool calls, and intermediate decisions, ideally under the OpenTelemetry GenAI semantic conventions so the format is portable rather than a vendor’s private dialect. Replay re-runs the exact failing case. Evals turn one-off debugging into a regression suite, so a fix becomes a test. When a stakeholder points at a wrong answer, you respond with the run, not an apology.

Failure 4: scope creep turns a tool into a science project

Scope creep is, by several 2025 analyses, the single most common reason agent projects stall. A bounded helper that drafted one kind of email starts being asked to handle exceptions, then to pull from three more systems, then to reason open-endedly across the whole department. Each addition needs more integrations, more data access, and more evaluation, and the project quietly loses the small, defensible win it could have shipped.

The control: fix the scope in writing before you build, as a manifest the agent cannot exceed at runtime. A manifest names the owner, the single workflow, the allowed tools (and the forbidden ones), the knowledge sources, the risk tier, the cost cap, and the rollback. A short one:

name:        invoice-exception-resolver
owner:       finance-ap@customer
workflow:    accounts-payable / exception handling
allowed:     erp.read, erp.match, ticket.create   (no erp.post)
knowledge:   vendor-master, po-history, ap-policy
risk_tier:   2 (financial, reversible)
cost_cap:    0.60 EUR / completed task
autonomy:    execute-with-approval
rollback:    void ticket, requeue to human, alert owner

The line no erp.post is the blast-radius decision made in writing rather than discovered in an incident. More on this in Agent skills: a manifest for every workflow.

Failure 5: weak human-in-the-loop, so errors ship before anyone looks

A demo where the agent only suggests is easy to approve. The trouble starts when someone wires the suggestion straight to the action (send the email, post the invoice, release the payment) with no review step, because review felt slow. The first wrong action that reaches a customer or a regulator is the one that gets the project paused.

This is not only operational prudence, it is law. Under the EU AI Act, deployers of high-risk systems must assign human oversight to competent, trained people with real authority to intervene, including the ability to decide not to use the output (Article 14 and Article 26). “We let it act and see” is not a posture legal can accept.

The control: an explicit approval gate on every consequential action, mapped to roles (RBAC) and to value and blast-radius limits. The gate is also your rollback for the early life of the agent: the rollback is a human saying no. This is what lets risk and legal approve the first version on day one. The trust model is built around this.

Failure 6: over-automation of the things you cannot take back

The opposite of weak oversight is over-confidence. A team that gets one workflow working well reaches for full autonomy on a task that is consequential and hard to reverse: a clinical recommendation, a denied claim, a wired payment, a contract clause. When that goes wrong, the cost is not a re-run, it is a real-world harm and a real-world liability. Reporting through 2025 and into 2026 keeps surfacing the same lesson: the failures that make headlines are the ones where a high-stakes, low-reversibility task was automated without a stop.

The control: a graded autonomy ladder, where the rung an agent is allowed to occupy is tied to how reversible the action is. Low blast radius and reversible can run autonomously with humans on exceptions. High blast radius and irreversible stays at execute-with-approval no matter how good the demo looked. The detail is in The autonomy ladder.

How do you avoid becoming the statistic?

The order matters as much as the controls. Most cancelled projects tried to do everything at once, or in the wrong order, and ran out of trust or budget before the value landed. The sequence that survives:

Deploy privately first. Stand the agent up inside the trust boundary on real data. This clears the security veto and gets you a system allowed to do useful work.
Instrument before you automate. Turn on cost attribution and traces while the agent is still only suggesting. The meter and the logs should be running before anyone proposes giving the agent the ability to act.
Automate with evidence. Climb the autonomy ladder one rung at a time, promoting on the trace record (approval rate, failure rate, replay coverage, a rollback that has actually been exercised), not on a slide.

The pre-mortem checklist

Before you build, assume the project failed and ask why. If you cannot answer every line below, you have found the risk before it found you.

Owner. Is there a single named human accountable for this agent in production, not a team and not a prompt in a notebook?
Boundary. Where does the regulated data live, and have you confirmed it never has to leave the customer’s environment to run?
Cost cap. What is the cost per completed task you would defend to finance, and what is the hard cap that pauses the agent?
Trace. For any run, can you reconstruct the inputs, retrieval, tool calls, and decisions, and replay the exact failing case?
Scope. Is the manifest written down, with allowed and forbidden tools, before any code, and can the agent exceed it at runtime?
Approval. Which actions require a human gate, who holds the authority to say no, and is that consistent with your obligations under the EU AI Act?
Reversibility. For each action the agent can take, is it reversible, and if not, is it pinned below autonomous?
Rollback. Has the rollback for the worst plausible failure actually been exercised, not just written?

A project that can answer those eight is not guaranteed to succeed. But it is no longer in the 40%. The cancellations Gartner expects are mostly projects that were technically sound and operationally undefended. The defence is not a better model. It is a boundary, a meter, a log, a manifest, a gate, and a ladder, in that order. For more on the operating model behind these controls, see the full insights library.

FAQ

Is the 40% cancellation figure about model quality?

No. Gartner attributes the cancellations to escalating costs, unclear business value, and inadequate risk controls. Independent work from McKinsey and MIT points the same way: the binding constraint is operational and organisational (data access, cost attribution, debuggability, workflow design), not whether the model can reason. Swapping to a newer model rarely changes the outcome.

What single control prevents the most failures?

Deploying inside the customer’s own environment removes the most common blocker, because it is what lets a regulated team start on real data at all. But the controls work as a set. Private deployment clears the security veto, metering defends the budget, traces defend trust, the manifest holds scope, and the approval gate and autonomy ladder keep consequential actions safe. Each one closes a different failure mode.

What is a pre-mortem and when do you run it?

A pre-mortem is a short exercise where you assume the project has already failed and work backwards to the reasons, before any code is written. For an agent, that means naming the owner, the data boundary, the cost cap, the trace requirement, the scope, the approval gate, the reversibility of each action, and a rollback that has been tested. It is the cheapest insurance against being one of the cancelled projects.