The core difference from a chatbot
A standard LLM like GPT or Claude, when used as a chatbot, runs a single inference pass: it takes your message plus conversation history, generates a response, and stops. The model cannot call external services, take actions in the world, or run for more than the duration of that single generation.
An agent wraps that same LLM in a loop. Rather than generating a final answer directly, the LLM generates an intermediate step — either a thought about what to do next, or an action to take (calling a tool). The result of that action comes back as an observation. The LLM then generates the next step. This continues until the task is complete. IBM's technical documentation for AI agents in 2026 frames it precisely: 'AI agents use tool calling on the backend to obtain up-to-date information, optimize workflows and create subtasks autonomously.'
This loop — Thought → Action → Observation → repeat — is called the ReAct pattern (Reasoning + Acting), introduced in a 2022 Google Research paper and now standard across every major agent framework including LangGraph, CrewAI, AutoGen, and Hermes Agent. It is the architecture that turns a language model from an answer generator into a task executor.
What tool calling actually is
When the LLM generates an 'action' step, it is generating a structured function call — a JSON payload specifying a tool name and arguments. The agent framework intercepts this output, executes the actual function, and feeds the result back to the LLM. The LLM does not directly execute code; it generates the call specification and the framework runs it.
Typical tools available to an agent: web search (returns search results as text), browser control (navigates to URLs, clicks, fills forms), terminal execution (runs shell commands and returns stdout/stderr), file read/write, API calls (sends HTTP requests to external APIs), memory retrieval (vector similarity search over stored facts), and code execution (runs Python/JavaScript in a sandbox).
Claude Sonnet 4.6 and GPT-5.4 both support native tool calling — the model has been specifically trained to generate valid function call outputs reliably. Older models like GPT-3.5 required extensive prompt engineering to produce consistent tool-call JSON. This underlying model quality improvement is a significant reason why production agent reliability has improved substantially in 2025-2026 versus 2023.
How planning works: single-agent vs multi-agent
For simple tasks — look up a fact, summarize a page, run a script — a single agent loop handles the job directly. The LLM plans within the context window: it reasons about what tools to call, calls them in sequence, and produces an output.
For complex multi-step tasks, modern frameworks use explicit planning steps. The agent first generates a full plan (a list of subtasks) before executing any of them. This matters because once execution starts, the model's context is filling with action/observation pairs. Having a written plan to reference prevents the model from losing track of the overall goal as its context fills up.
Multi-agent architectures go further: task decomposition happens across specialized agents. An orchestrator agent breaks a research task into subtasks and assigns them to a researcher agent, a coder agent, and a writer agent — each running their own action loops in parallel. The orchestrator collects outputs and synthesizes the result. Hermes Agent supports this via subagent delegation — the primary agent can spawn up to 3 concurrent subagents and aggregate their outputs. LangGraph, CrewAI, and AutoGen each implement similar patterns with different tradeoffs in flexibility vs. ease of setup.
Memory: what the agent knows and when
Agent memory operates at multiple layers. In-context memory is whatever fits in the current context window — the conversation history, the task instruction, the action/observation pairs from the current session. This is limited and temporary. Claude Sonnet 4.6's 1M token context sounds vast, but a heavily tool-using agent can fill hundreds of thousands of tokens in a long session, and the compute cost of full-context inference rises accordingly.
External memory — vector stores, knowledge bases, conversation archives — is retrieved selectively. Before each inference step, the agent runs a similarity search over its stored memory to retrieve the K most relevant facts, skill documents, or past observations. These are injected into the context window alongside the current task state. This retrieval step is what allows an agent to reference facts and experiences from months ago without keeping them all in context simultaneously.
The 2026 standard for production agent memory is a dual-layer architecture: a Hot Path (recent messages plus summarized state) paired with a Cold Path (external retrieval from Zep, Mem0, Pinecone, or similar). A Memory Node synthesizes what to save after each turn. Digital Applied's January 2026 technical guide notes that even 200K–400K token windows (Claude, GPT-5.4) are impractical for full history due to cost and latency — external episodic memory remains mandatory for production agents regardless of context window size.
What makes agents fail
The three most common failure modes in production agent deployments: (1) tool call errors accumulating — when a tool call fails and the agent does not handle the error correctly, it can spiral into repeated retry loops or incorrect reasoning, (2) context fill — for long-running tasks, the action/observation history fills the context window, and the model starts losing the thread of the original task goal, (3) hallucinated tool calls — models occasionally generate tool calls with invalid arguments, or fabricate results rather than actually calling the tool. The last failure mode is the most dangerous in high-stakes tasks.
Real defenses: structured output enforcement (requiring tool calls to pass schema validation before execution), step limits (terminating an agent loop that has exceeded a maximum step count), human-in-the-loop checkpoints for irreversible actions, and explicit error handling instructions in the system prompt. Hermes v0.5.0 adds checkpoint/rollback — the /rollback command reverts file changes if the agent takes incorrect actions during code or file editing tasks.
These failure modes are why the 'autonomous agent does everything' framing is premature for many production use cases. The practical approach in 2026 is to identify tasks that are verifiable (the agent can confirm its output is correct), reversible (mistakes can be undone), and low consequence per error — and start there. Automate out from that core as reliability is confirmed.