Why AI Agents Fail at Multi-Step Tasks — and How to Catch the Silent Failure

#python #ai #programming #tutorial

💻 All the code for this series lives in one repo: resilient-agent-harness-sample-for-aws. This post is the Multi-Step Task Planning demo (03-multi-step-task-planning). Clone it and follow along.

Give an AI agent a task with several steps and one tool that misbehaves quietly, and here's what happens: a step's tool returns "confirmed", the agent believes it, moves on, and at the end reports the whole task done. But that one step never actually persisted. The tool said success; the write isn't there. The agent has no way to tell a real success from a fake one, so it ships a result that's confidently, partially broken.

Trusting a tool's "confirmed" without checking is one of the most common ways agents fail on multi-step work. The failure is invisible precisely because nothing errored. There's no exception to catch, no red log line, just a cheerful summary that doesn't match reality. And you can't prompt your way around a tool that lies. The fix is structural: verify each step against the real backend, and redo the one that didn't take.

To make it concrete, I built a small travel agent and gave it a trip to book. The full demo, runnable end to end, is in the resilient-agent-harness repo.

What is the demo?

The agent, built with Strands Agents, books a round-the-world trip of three flights (JFK to CDG, CDG to HND, HND to JFK) and has three tools:

search_flights finds fares from the Duffel sandbox.
book_flight writes a booking to the backend. The middle flight (CDG to HND, the Tokyo leg of the trip) has a silent failure baked in: its first attempt returns "confirmed" but does not save.
list_booked_flights reads back what actually persisted. This is the ground truth.

Before any agent runs, the notebook calls book_flight on the Tokyo flight directly to prove the trap: attempt 1 says confirmed, yet list_booked_flights shows the booking isn't there. That's the silent failure, demonstrated on the tool itself, so you trust the rest of the story.

What is multi-step task planning?

Multi-step task planning is completing a task made of several ordered steps by doing one step, checking it actually persisted in the real backend, and only then moving to the next, instead of firing off every step and trusting each tool's reported success. The check against ground truth is what catches a step that reported "done" but silently never saved.

The trap is that a tool's response and the actual state of the world can disagree. A booking call can return a confirmation while the row never lands. Verifying against the backend is the only reliable way to know the difference.

Why isn't a tool's "confirmed" enough?

A tool can return success while the write didn't persist: a flaky backend, a consistency lag, a half-applied transaction. The response looks identical to a real success, so the agent relays it as fact. The demo runs the trip two ways:

Approach	How it works	What happens
BEFORE	One agent books all three flights and trusts each `"confirmed"`.	It reports the trip booked, but only 2/3 flights actually saved (`JFK-CDG`, `HND-JFK`). The Tokyo flight is silently missing.
AFTER	A native Strands Graph: an executor books one flight, a verifier reads the backend and replies PASS/FAIL, and a conditional edge retries on FAIL.	The verifier catches the silent failure and the graph re-books it. 3/3 flights actually saved.

Why a Graph, and why Strands makes it easy

Coordinating two agents (an executor that does the work and a verifier that checks it, with a retry when verification fails) is multi-agent orchestration. That's exactly what Strands' native GraphBuilder is for, and it's where Strands does the heavy lifting for you. The docs describe a Graph as a deterministic agent-orchestration system where the executor and verifier are nodes and the flow between them is edges, including conditional and cyclic edges. The retry-until-it-saves pattern is the one the docs call a "feedback loop": you declare the nodes and edges, and the SDK runs the flow, the bounded retry loop, and the token accounting. You don't hand-roll a while loop or track state yourself.

The diagram shows that loop: the executor books a flight and hands off to the verifier; the verifier reads the real backend; a green PASS edge ends the flight, and a red FAIL edge loops back to the executor to re-book. GraphBuilder wires the conditional edge and bounds the cycle so it can't spin forever.

Two design choices carry the whole thing. The verifier has only list_booked_flights, so it decides from ground truth, not from the executor's say-so. And the retry is a conditional edge from verify back to execute that fires only when the verifier read FAIL. set_max_node_executions(6) bounds the loop (required for a cycle), and reset_on_revisit(True) makes the executor start fresh on each retry instead of carrying stale state.

from strands import Agent
from strands.multiagent import GraphBuilder

executor = Agent(name="executor", tools=[search_flights, book_flight])
verifier = Agent(name="verifier", tools=[list_booked_flights])   # reads ground truth, replies PASS/FAIL

def verification_failed(state):
    v = state.results.get("verify")
    return bool(v) and "FAIL" in str(v.result).upper()

builder = GraphBuilder()
builder.add_node(executor, "execute")
builder.add_node(verifier, "verify")
builder.add_edge("execute", "verify")
builder.add_edge("verify", "execute", condition=verification_failed)   # retry only on FAIL
builder.set_entry_point("execute")
builder.set_max_node_executions(6)     # bound the retry loop (required for a cycle)
builder.reset_on_revisit(True)         # executor starts fresh each retry
graph = builder.build()

result = graph(f"Book flight {route} and verify it actually saved.")

You can watch the recovery in the per-flight node trace. The two flights that save on the first try run execute, verify and stop. The Tokyo flight runs execute, verify, execute, verify: the verifier read FAIL, the conditional edge looped back, and the executor re-booked it.

JFK-CDG: nodes ran -> ['execute', 'verify']                       saved = True
CDG-HND: nodes ran -> ['execute', 'verify', 'execute', 'verify']  saved = True   # retried!
HND-JFK: nodes ran -> ['execute', 'verify']                       saved = True
flights ACTUALLY saved in the backend: 3/3

Does verification cost more tokens?

Yes, and that's the part most "agent efficiency" posts skip. Tokens come from result.accumulated_usage, the real Strands metrics, not estimates. A measured run on OpenAI gpt-4o-mini gave me:

	before	after
flights actually saved	2/3	3/3
agent claimed complete	yes	yes
tokens	3,126	10,732

Read it honestly: verification costs more tokens, not fewer, because you pay to read the backend and retry. Both runs claim "all booked"; only the verified Graph is actually right. The win is correctness, not a smaller bill. The exact totals shift per run because the model is non-deterministic, so run it yourself and watch the shape hold: the BEFORE agent is cheaper and wrong, the AFTER graph costs more and ships a complete trip.

Frequently asked questions

Why isn't a tool's "confirmed" enough?
Because a tool can return success while the write didn't actually persist (a flaky backend, a consistency lag). The agent can't tell a real success from a fake one, so it reports work as done that isn't. Reading the backend after the fact is the only reliable check.

Does verification always cost more tokens?
Yes, up front, and that's the trade. You spend extra tokens to read the backend and retry, and in return you don't ship a trip that's silently missing a flight. The metric that matters is correctness, not raw token count.

Do I need Strands or OpenAI for this?
No. Execute, verify against ground truth, and retry the failure are general agent concepts. Strands is model-agnostic: its providers are interchangeable, so the same Graph runs on Amazon Bedrock (the default), Anthropic, OpenAI, or a local model via Ollama. The demo defaults to OpenAI gpt-4o-mini because it needs only an API key to try, though that's still a cloud API call, not a model on your machine.

Run it yourself

The full demo (the silent failure proven on the tool directly, the naive agent shipping 2/3, then the native Graph recovering to 3/3) runs end to end in one notebook. Clone the repo and run it:

git clone https://github.com/elizabethfuentes12/resilient-agent-harness-sample-for-aws.git
cd resilient-agent-harness-sample-for-aws/03-multi-step-task-planning

uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

# Default: OpenAI gpt-4o-mini (just an API key to try)
echo "OPENAI_API_KEY=sk-..." > .env
echo "DUFFEL_API_KEY=duffel_test_..." >> .env   # free sandbox token from app.duffel.com
uv run test_multi_step_task_planning.py

Prefer notebooks? Open test_multi_step_task_planning.ipynb and run it top to bottom.

The pattern follows MiRA (Wang et al., Mar 2026), which adds inference-time planning and verification with no training. The benchmark figures and full reading are in the repo's README. What this demo produces is the mechanism: execute, verify against ground truth, retry the failure, on a native Strands Graph.

What's the silent failure that bit your agent: a tool that said "done" while nothing saved? Tell me in the comments.

📬 Building reliable AI agents? I write about agent memory, guardrails, evaluation, and multi-agent patterns. Subscribe to my newsletter to get the next one.

Gracias!

🇻🇪 Dev.to Linkedin GitHub Twitter Instagram Youtube