The difference from a chatbot
A standard LLM used as a chatbot runs a single inference pass: it takes your message plus conversation history, generates a response, and stops. The model cannot call external services, take actions in the world, or run for more than the duration of that single generation.
An agent wraps that same LLM in a loop. Rather than generating a final answer directly, the LLM generates an intermediate step — either a thought about what to do next, or an action to take (calling a tool). The result of that action comes back as an observation. The LLM then generates the next step. This continues until the task is complete. IBM's technical documentation for AI agents frames it: 'AI agents use tool calling on the backend to obtain up-to-date information, optimize workflows and create subtasks autonomously.'
This loop — Thought → Action → Observation → repeat — is called the ReAct pattern (Reasoning + Acting), introduced in a 2022 Google Research paper and now standard across every major agent framework including LangGraph, CrewAI, AutoGen, and Hermes Agent. It is the architecture that turns a language model from an answer generator into a task executor.
What tool calling actually is
When the LLM generates an 'action' step, it generates a structured function call — a JSON payload specifying a tool name and arguments. The agent framework intercepts this output, runs the actual function, and feeds the result back to the LLM. The LLM does not directly execute code. It generates the call specification and the framework executes it.
Typical tools: web search (returns search results as text), browser control (navigates to URLs, clicks, fills forms), terminal execution (runs shell commands, returns stdout/stderr), file read/write, API calls to external services, memory retrieval (vector similarity search over stored facts), and code execution in a sandbox.
Claude Sonnet 4.6 and GPT-5.4 both support native tool calling — the models are specifically trained to generate valid function call outputs reliably. Older models like GPT-3.5 required extensive prompt engineering to produce consistent tool-call JSON. That underlying model improvement is a large part of why production agent reliability is substantially better in 2025-2026 than in 2023.
How planning works
For simple tasks — look up a fact, summarize a page, run a script — a single agent loop handles it directly. The LLM plans within the context window, calls tools in sequence, and produces an output. No explicit planning step required.
For complex multi-step tasks, frameworks use explicit planning phases. The agent first generates a full plan (a list of subtasks) before executing any of them. This matters because once execution starts, the model's context fills with action/observation pairs. Having a written plan to reference prevents the model from losing the thread of the original goal as context fills.
Multi-agent architectures split the task across specialized agents. An orchestrator breaks a research task into subtasks and assigns them to a researcher agent, a coder agent, and a writer agent — each running their own action loops in parallel. The orchestrator collects and synthesizes the outputs. Hermes Agent supports this via subagent delegation — the primary agent can spawn up to 3 concurrent subagents and aggregate results. LangGraph, CrewAI, and AutoGen implement similar patterns with different tradeoffs in flexibility versus setup complexity.
Memory: what the agent knows and when
Agent memory operates at multiple layers. In-context memory is whatever fits in the current context window — conversation history, task instruction, action/observation pairs from the current session. Limited and temporary. Claude Sonnet 4.6's 1M token context sounds vast, but a heavily tool-using agent can consume hundreds of thousands of tokens in a long session, and inference cost rises with context length.
External memory — vector stores, knowledge bases, conversation archives — is retrieved selectively. Before each inference step, the agent runs a similarity search over stored memory and retrieves the most relevant facts, skill documents, or past observations injected into the context window. This allows the agent to reference experiences from months ago without keeping them all in context simultaneously.
The 2026 standard for production agent memory is a dual-layer architecture: a Hot Path (recent messages plus summarized state) paired with a Cold Path (external retrieval from Zep, Mem0, Pinecone, or similar). Digital Applied's January 2026 technical guide notes that even 200K-400K token windows are impractical for full history due to cost and latency — external episodic memory remains mandatory for production agents regardless of context window size.
What makes agents fail
The three most common failure modes in production agent deployments: tool call errors accumulating (when a call fails and the agent doesn't handle the error correctly, it can spiral into retry loops or incorrect reasoning); context fill (for long-running tasks, the action/observation history fills the window and the model starts losing the thread of the original goal); hallucinated tool calls (models occasionally generate calls with invalid arguments, or fabricate results rather than actually calling the tool — the most dangerous failure mode in high-stakes tasks).
Real defenses: structured output enforcement (requiring tool calls to pass schema validation before execution), step limits (terminating a loop that has exceeded a maximum count), human-in-the-loop checkpoints for irreversible actions, and explicit error handling instructions in the system prompt. Hermes v0.5.0 adds checkpoint/rollback — the /rollback command reverts file changes if the agent takes incorrect actions during code or file editing tasks.
These failure modes are why 'the autonomous agent does everything' framing is premature for many production use cases. The practical approach in 2026: identify tasks that are verifiable (the agent can confirm its output is correct), reversible (mistakes can be undone), and low consequence per error — start there, and automate outward as reliability is confirmed.