Agency is the New Resilience
Agents can power robust workflows by intelligently reacting to unexpected conditions, creating a new form of flexible resilience.
It’s true that language model-based agents are nondeterministic: We don’t know in advance what approach they’ll take to accomplish their work. However, simply calling them nondeterministic risks making them sound random — which they very much are not. Indeed, if we take a broad enough view, they’re not even all that unpredictable. By definition, agents direct their behavior intentionally, based on observations of their environment, in pursuit of a specified goal. So, in practice, the nondeterministic nature of agents exists to accommodate the nondeterministic nature of the environments they’re working in. Agents aren’t simply nondeterministic — they’re something more like counter-nondeterministic. Even if the path may vary, we can predict where they’ll end up: At their goal.
Perhaps that sounds a bit abstract, so let’s jump to an example. Let’s say I want to remember the exact date I graduated from college. That was quite a while ago, and this detail is long gone from my working memory. However, there is exactly one correct answer to this question. (To avoid ambiguity, let’s specify that I want it in YYYY-MM-DD format, with no commentary or markup.) If I assigned this question to a personal agent, how might it find this information? Presumably my school has a student database, but I’m sure I don’t have access to it. Maybe I still have some old emails about the graduation, perhaps with relatives visiting from out of town? Or a calendar invite for the celebration dinner? Or even pictures of me wearing a graduation gown in a cloud photo roll backup? I don’t know which of these I still have around, and at first neither would an agent. But if I gave an agent sufficient data access, and it looked around long enough, it would eventually track down the correct answer.
In this way, agents’ nondeterminism supports their resilience: Whatever they have access to, they’ll use toward their assigned goal. If data is missing in one place, they’ll look elsewhere. If a tool doesn’t seem to work, they’ll try another one. If a service won’t respond, they’ll switch to something different. Whatever happens, they’ll take a reasonable next step — within the boundaries and guardrails that we’ve set for them, of course. This is what makes agents so darn useful, and it’s exactly the type of goodness we get with things like Agentic Workflows.
Of course, this type of contextual resilience isn’t entirely new. We’ve long had to ensure our software works in nondeterministic environments, such as the cloud — as soon as network is involved, all sorts of weird things happen. To accommodate this reality, the industry has developed a robust suite of resilience techniques and technologies: Libraries like Polly, Cockatiel, and Tenacity provide well-defined strategies like timeouts, retries, circuit breakers, caching, and hedging. Agents may be more flexible than mechanical code, but do they nonetheless have an analogous set of strategies at their disposal? And if we characterize the strategies, can we make them even better? This is certainly not a comprehensive list, but a few come to mind:
- Modified resubmission: An agent tries to call an API, but the interface has changed, and the call is rejected. The semantics might be effectively the same, but the schema or syntax is different. The agent can restructure the request based on the error message and try again.
- Resource rediscovery: An agent needs to work with some artifact, such as a webpage or a file on disk, however that artifact isn’t at its expected location. The agent can look around for a new location (perform a web search,
grepthe file system, etc.), then continue its work once it finds it. - Error reclassification: An agent gets an error response from some operation, but the response nonetheless contains the information it needed. The agent can ignore the fact that the response was technically an error, consider it effectively a success, and move forward with the info it gleaned.
All of these imply things we can do to set our agents up for success: Make sure our systems return detailed explanations for invalid requests, so the inputs can be fixed. Give resources clear identifiers so they can be easily recognized, even if they’ve moved. Include information about partial successes in outputs, even if the operation ultimately failed. These types of programming patterns enable agents to exercise their resilience — their agency — and to ultimately accomplish what we asked for.
So, what types of problems do you see your agents running into? How do your agents get themselves unstuck? And how can you make it easier for them to get themselves unstuck? In the era of agentic engineering, that’s now part of the job.