top of page

How Do You Get an Agent to Actually Work?

  • Writer: Ram Srinivasan
    Ram Srinivasan
  • May 17
  • 10 min read

Updated: May 18

At MIT, one of the professors who left the deepest mark on me was Hal Gregersen.


Hal teaches leadership and innovation at Sloan, and his work centers on fearless inquiry: the discipline of finding the questions underneath the question everyone is asking.

What we learned was that the first question is usually a symptom and it is the questions beneath them where the real work begins.


Consider the question I am hearing from many teams right now: How do I get my agent to actually work?


It is a practical question BUT it is incomplete. The agent will not work until four or five deeper questions get answered in the right order. Each one exposes the next. Let’s unpack this.


The first shadow question: what does “work” mean?

When a someone asks how to make an agent work, they usually have a deterministic system in mind. Input goes in, output comes out and the output is correct.


That is what “work” has meant for most software.


But agents behave differently. The same input may produce a different reasoning paths each time. The model decides which tool to call, which information to retrieve, and which sub-question to pursue. The answer is NOT just a return value. It is the result of a small expedition the agent took on your behalf.


This reveals something important. Yes, we need to optimize for the right answer and for that we need to ensure the agent takes a reasonable path to get there. The agent chooses the right path under the right conditions, the right environment.


Before we go deeper, try this yourself.


The small interactive below lets you change the environment an agent is operating inside. Pick a process, change the substrate underneath it, and run the agent. The model is not the point. The point is that the same agent behaves differently depending on what it can see, retrieve, use, remember, and do.



That is the argument in miniature.


An agent is not just answering a question. It is acting inside a constructed environment. If the environment is clean, bounded, and well-instrumented, the agent has a chance to behave well. If the environment is noisy, stale, over-permissioned, or poorly bounded, the agent can make a bad decision while appearing perfectly fluent.


In essence, you are engineering an environment where a non-deterministic actor — a system that may take different paths each time — can repeatedly make good decisions.


So what are you actually engineering?


Context.

Not just prompts. A prompt is only one part of what the model sees at the moment of decision. The bigger picture includes the available tools, retrieved information, memory from prior turns, user permissions, and instructions about when to escalate.


Get the context right and the model behaves well across a wide range of inputs. Get it wrong and prompt tuning will not save you.


What used to be called prompt engineering is evolving into context engineering. The prompt is still useful, but the larger levers sit upstream.


When an agent misbehaves in production, the first instinct is to edit the prompt. The better instinct is to ask: what did the model see when it made the bad decision?


Often, the prompt was fine. The problem likely lies elsewhere.


For example, perhaps the retrieval system surfaced the wrong document or the tool returned forty thousand tokens of noise or the memory layer dropped a constraint the user gave three turns ago and therefore the model made a reasonable decision from a corrupted view of the world.We need to go deeper.


But where does context come from?

Many enterprise agents fail here, and teams often do not recognize it for several weeks.


They connect the agent to a vector database, embed company documents, call it RAG, and ship. RAG (retrieval-augmented generation) is where the model answers using retrieved company information. This sounds simple but in practice, the agent retrieves documents that are technically related BUT produce answers that are technically wrong.


Retrieval is not just a similarity problem. It is also a ranking, permissions, recency, and identity problem.


Experienced builders learn three things quickly.

  • Similarity is not relevance. Two documents can be semantically close, while only one is correct. The other may be a 2022 draft that contradicts current policy. The model cannot reliably know that. The ranking layer has to help.

  • Permissions belong inside retrieval. If the agent is answering for a finance director, documents only legal should see must never enter the candidate set. If they do, the model can read them, reason over them, and quote them. That is no longer a helpful answer. It is a permissions incident.

  • Too much retrieval makes the agent worse. A model that receives forty relevant chunks can produce a more confused answer than a model that receives the four right ones. The signal gets diluted. Cap retrieval aggressively, often around five or six chunks, and spend engineering effort making those chunks right.


For example, one global wealth manager learned this in production. Its advisor assistant was fluent but quietly wrong, especially in volatile markets. The fix was to rebuild the context layer with advisor-specific filters by client segment and region. Then adding recency and source-of-truth ranking, and enforcing permission-aware search. During the next market drawdown, the assistant surfaced relevant house views.

You fix hallucinations by curating what the agent is allowed to see.

This is what people mean when they say the context layer is becoming the new moat. The real value is the ranking, permissions, freshness, identity awareness, and enterprise graph (a map of how people, documents, clients, systems, and policies relate to each other) that produce a clean view of the company.


Now your agent has context, what next?

Tool use is often misunderstood.


The instinct is to treat tools like functions. The agent calls a function. The function returns a value. That is technically true, but it misses how the model experiences tools.

For the model, tools are a vocabulary.

Every tool is a verb it can think with. A search tool lets the model decide to go look for something. A read_email tool lets it say, let me check what she actually wrote. If the vocabulary is vague or overlapping, the agent chooses poorly. If the vocabulary is sharp, the agent can combine tools in useful ways you did not explicitly script.

  • Fewer, clearer tools usually beat many overlapping ones. JPMorgan learned this while building internal banker and advisor agents. Early versions exposed many endpoints for research, pricing, risk, and CRM. The agents picked the wrong one too often because the descriptions blurred together. The refactor consolidated them into higher-level verbs like get research, summarize client history, and generate client-ready deck, with descriptions written in natural language. After that tool selection became more reliable during market stress.



  • Write tool descriptions for the model, not the developer. A tool description is not API documentation. It is a decision aid the model reads when deciding whether to use the tool. Tell it when to use the tool, when not to, and what the output will look like. The difference between a mediocre description and a good one shows up in agent behavior quickly.

  • Tools return tokens, and tokens are not free. A tool that returns a thousand database rows can fill the context window in a few calls. Then the agent has less room to reason. Every tool should support pagination, filtering, and truncation by default. For example, a useful ceiling is around twenty-five thousand tokens per response. Beyond that, the tool should summarize before returning.


Tools create the next problem: state. Every tool call changes what the agent knows. It reads an email, retrieves a policy, checks a price, drafts a deck, opens a ticket. Some of that information matters only for the next thirty seconds. Some of it matters for the rest of the task. Some of it should become part of what the organization knows about a client, a workflow, or a decision.


This is where many agents start to drift. If tools give the agent a vocabulary for action, memory gives it continuity.


What does the agent remember, when and why?

Memory may be the most misunderstood part of the agent stack.


The context window is NOT memory. It is the model’s working attention for a single inference call (one request to the model). Everything in that window has to be paid for again in tokens on the next turn. When a team says, “we made the context window bigger,” they have made the model’s attention more expensive. They have not given it durable memory.


Real memory lives outside the context window and gets retrieved when needed.


For example, EY ran into this at scale while building its global agentic operating system, deployed to more than three hundred thousand professionals across tax, assurance, and consulting. Early pilots worked but he scaled version struggled because context windows were overloaded with transcripts and agents lost track of decisions across engagements.


The fix was structural. EY built a substrate where domain assistants shared common patterns for retrieval, tools, and memory. Structured memory was tied to core systems like engagement records, so agents could recall prior work and client constraints instead of re-reading transcripts every session.


Even the most advanced implementations show what happens when this goes wrong. Klarna’s AI‑led customer support boosted efficiency, BUT an “AI‑only” model triggered customer backlash and a partial return to human agents.


EY recently withdrew a published report after AI‑generated hallucinations. Fabricated data and citations were discovered, forcing a rethink of how such work is produced and approved.


Together, these cases show that agents demand clear governance, escalation rules, and the willingness to change course when reality disagrees with the design.

Agents are grown on a substrate, not rebuilt from scratch each time.

The memory pattern has three layers:

  • Working memory is the current context window, edited as the conversation evolves.

  • Session memory is what the agent remembers within a task. It persists across turns, but not necessarily across sessions.

  • Persistent memory is what the agent knows across sessions, and in mature systems, across users where appropriate.


Persistent memory is where institutional knowledge starts to accrue. It is also where governance becomes unavoidable. What should the agent remember? About whom? For how long? Who can inspect or delete it? These are policy questions with technical consequences.


The agent should write to memory more deliberately than it reads from it. Most teams design retrieval first and writing second. The teams whose agents improve over time design writing first. They are explicit about what gets promoted from working memory into session memory, and from session memory into persistent memory. They treat those promotions as observable events. When something goes wrong six weeks later, they can trace what the agent learned and when.


Get memory right and the agent stops having selective amnesia.


Your agent has memory, but how does it make decisions? 

Most enterprise leaders treat autonomy like a slider. That framing is too crude. The real question is what the agent owns, what it escalates, and what it must never do. The answer depends on the cost of being wrong.

Consider these three categories:

  • Some decisions are cheap and reversible: which document to retrieve, which tool to call, which sub-question to pursue. The agent should own these. Requiring human approval here destroys the value of having an agent.

  • Some decisions are expensive but recoverable: sending an email, filing a ticket, issuing a refund under a defined limit. For example, Klarna’s customer service agent operates in this zone. It handles work equivalent to roughly 853 full-time employees, resolves many chats end to end, and reportedly delivers around sixty million dollars in annual savings. The autonomy is bounded. Retrieval is constrained to current policies, “top-k” limits (caps on how many retrieved results the model receives) reduce conflicting context, and the agent escalates disputes, suspected fraud, and high-value cases.

  • Some decisions are irreversible: terminating an account, signing a contract, authorizing a payment above a threshold. The agent should not own these. JPMorgan’s COIN, which handles the equivalent of 360,000 staff hours of contract review a year, makes this boundary clear. The system reads, classifies, extracts terms, and flags issues but it does not sign.


The boundary has to be drawn in the architecture, not only in policy documents.

BUT boundaries only work if the agent knows when it is approaching one.


That requires a practical definition of confidence. Not confidence as a number the model invents, but confidence as something the system can observe from the agent’s behavior.


Asking a model, “how confident are you?” usually gives you a poorly calibrated number. It is better to observe behavior. Did the agent retrieve? Did it find what it expected? Did it loop more than usual? Did it switch tools repeatedly? Those signals are more useful than asking the model to introspect.

The decisions you let the agent own define what people in the organization spend their days doing. Draw the line well and agents absorb cognitive drudgery while humans focus on judgment, trust, and relationships. Draw it poorly and you get either an expensive chatbot or a compliance incident.

This is where the future-of-work question becomes concrete.


And now we can ask how do you get an agent to actually work?

The answer is a chain.

  • The agent works when context is right.

  • Context depends on retrieval that is identity-aware, ranked, fresh, and bounded.

  • Retrieval depends on tools that give the agent a clear vocabulary for action.

  • Tools depend on memory that holds the thread across turns and sessions.

  • Memory depends on autonomy rules that reflect the cost of being wrong.

  • Autonomy depends on the organization deciding what the agent owns and what humans own.

The thesis is simple: agents are grown on a substrate of retrieval, tools, memory, and structured autonomy.

The substrate is your institutional knowledge made legible. The retrieval layer is your documents and access policies. The tool layer is what your company can do, expressed as verbs. The memory layer is what your company knows over time. The autonomy layer is where you draw the line between human and machine.


Each layer forces an operating-model decision that technology has made impossible to ignore. Two future-of-work questions now sit underneath every agent strategy:

  • What work should no longer require human attention because an agent can own it safely?

  • And what human capabilities become more valuable when agents absorb the repeatable parts of knowledge work?


The keystone question dissolves when you stop trying to answer it directly and start asking the questions underneath. The companies that do this will not always have the flashiest demos, but they will have agents that improve in production, with the audit trail to prove it. If you are building or sponsoring an agent this year, pick one workflow and walk this chain end to end.

Define the context, retrieval rules, tools, memory, and autonomy for that single use case.

Ship it, instrument it, and treat every failure as a question about the substrate, not just the model.


Until next time,

Ram

— 

Ram Srinivasan


MIT Alum | Author, The Conscious Machine | Global Future of Work and AI Adoption Leader published in Business Insider, Fortune, Harvard Business Review, MIT Executive Viewpoints and more.


A Message From Ram:

My mission is to illuminate the path toward humanity's exponential future. If you're a leader, innovator, or changemaker passionate about leveraging breakthrough technologies to create unprecedented positive impact, you're in the right place. If you know others who share this vision, please share these insights. Together, we can accelerate the trajectory of human progress.


Disclaimer:

Ram Srinivasan currently serves as an Innovation Strategist and Transformation Leader, authoring groundbreaking works including "The Conscious Machine" and the upcoming "The Exponential Human."


All views expressed on "Substrate" and across all digital channels and social media platforms are strictly personal opinions and do not represent the official positions of any organizations or entities I am affiliated with, past or present. The content shared is for informational and inspirational purposes only. These perspectives are my own and should not be construed as professional, legal, financial, technical, or strategic advice. Any decisions made based on this information are solely the responsibility of the reader.


While I strive to ensure accuracy and timeliness in all communications, the rapid pace of technological change means that some information may become outdated. I encourage readers to conduct their own due diligence and seek appropriate professional advice for their specific circumstances.

 
 
bottom of page