Home / Blog / How AI agents actually work: the reasoning loop, tool use, and planning
Not magic. A loop, some tools, and memory.

How AI agents actually work: the reasoning loop, tool use, and planning

An AI agent is not a smarter chatbot. The architecture is fundamentally different. Here is exactly what happens when an agent receives a task — from the first inference call to the completed result.

Hermes OS team3 April 202611 min read

The core difference from a chatbot

A standard LLM like GPT or Claude, when used as a chatbot, runs a single inference pass: it takes your message plus conversation history, generates a response, and stops. The model cannot call external services, take actions in the world, or run for more than the duration of that single generation.

An agent wraps that same LLM in a loop. Rather than generating a final answer directly, the LLM generates an intermediate step — either a thought about what to do next, or an action to take (calling a tool). The result of that action comes back as an observation. The LLM then generates the next step. This continues until the task is complete. IBM's technical documentation for AI agents in 2026 frames it precisely: 'AI agents use tool calling on the backend to obtain up-to-date information, optimize workflows and create subtasks autonomously.'

This loop — Thought → Action → Observation → repeat — is called the ReAct pattern (Reasoning + Acting), introduced in a 2022 Google Research paper and now standard across every major agent framework including LangGraph, CrewAI, AutoGen, and Hermes Agent. It is the architecture that turns a language model from an answer generator into a task executor.

What tool calling actually is

When the LLM generates an 'action' step, it is generating a structured function call — a JSON payload specifying a tool name and arguments. The agent framework intercepts this output, executes the actual function, and feeds the result back to the LLM. The LLM does not directly execute code; it generates the call specification and the framework runs it.

Typical tools available to an agent: web search (returns search results as text), browser control (navigates to URLs, clicks, fills forms), terminal execution (runs shell commands and returns stdout/stderr), file read/write, API calls (sends HTTP requests to external APIs), memory retrieval (vector similarity search over stored facts), and code execution (runs Python/JavaScript in a sandbox).

Claude Sonnet 4.6 and GPT-5.4 both support native tool calling — the model has been specifically trained to generate valid function call outputs reliably. Older models like GPT-3.5 required extensive prompt engineering to produce consistent tool-call JSON. This underlying model quality improvement is a significant reason why production agent reliability has improved substantially in 2025-2026 versus 2023.

How planning works: single-agent vs multi-agent

For simple tasks — look up a fact, summarize a page, run a script — a single agent loop handles the job directly. The LLM plans within the context window: it reasons about what tools to call, calls them in sequence, and produces an output.

For complex multi-step tasks, modern frameworks use explicit planning steps. The agent first generates a full plan (a list of subtasks) before executing any of them. This matters because once execution starts, the model's context is filling with action/observation pairs. Having a written plan to reference prevents the model from losing track of the overall goal as its context fills up.

Multi-agent architectures go further: task decomposition happens across specialized agents. An orchestrator agent breaks a research task into subtasks and assigns them to a researcher agent, a coder agent, and a writer agent — each running their own action loops in parallel. The orchestrator collects outputs and synthesizes the result. Hermes Agent supports this via subagent delegation — the primary agent can spawn up to 3 concurrent subagents and aggregate their outputs. LangGraph, CrewAI, and AutoGen each implement similar patterns with different tradeoffs in flexibility vs. ease of setup.

Memory: what the agent knows and when

Agent memory operates at multiple layers. In-context memory is whatever fits in the current context window — the conversation history, the task instruction, the action/observation pairs from the current session. This is limited and temporary. Claude Sonnet 4.6's 1M token context sounds vast, but a heavily tool-using agent can fill hundreds of thousands of tokens in a long session, and the compute cost of full-context inference rises accordingly.

External memory — vector stores, knowledge bases, conversation archives — is retrieved selectively. Before each inference step, the agent runs a similarity search over its stored memory to retrieve the K most relevant facts, skill documents, or past observations. These are injected into the context window alongside the current task state. This retrieval step is what allows an agent to reference facts and experiences from months ago without keeping them all in context simultaneously.

The 2026 standard for production agent memory is a dual-layer architecture: a Hot Path (recent messages plus summarized state) paired with a Cold Path (external retrieval from Zep, Mem0, Pinecone, or similar). A Memory Node synthesizes what to save after each turn. Digital Applied's January 2026 technical guide notes that even 200K–400K token windows (Claude, GPT-5.4) are impractical for full history due to cost and latency — external episodic memory remains mandatory for production agents regardless of context window size.

What makes agents fail

The three most common failure modes in production agent deployments: (1) tool call errors accumulating — when a tool call fails and the agent does not handle the error correctly, it can spiral into repeated retry loops or incorrect reasoning, (2) context fill — for long-running tasks, the action/observation history fills the context window, and the model starts losing the thread of the original task goal, (3) hallucinated tool calls — models occasionally generate tool calls with invalid arguments, or fabricate results rather than actually calling the tool. The last failure mode is the most dangerous in high-stakes tasks.

Real defenses: structured output enforcement (requiring tool calls to pass schema validation before execution), step limits (terminating an agent loop that has exceeded a maximum step count), human-in-the-loop checkpoints for irreversible actions, and explicit error handling instructions in the system prompt. Hermes v0.5.0 adds checkpoint/rollback — the /rollback command reverts file changes if the agent takes incorrect actions during code or file editing tasks.

These failure modes are why the 'autonomous agent does everything' framing is premature for many production use cases. The practical approach in 2026 is to identify tasks that are verifiable (the agent can confirm its output is correct), reversible (mistakes can be undone), and low consequence per error — and start there. Automate out from that core as reliability is confirmed.

Common questions

What is the ReAct pattern in AI agents?

ReAct (Reasoning + Acting) is the standard agent loop: the model alternates between generating Thoughts (reasoning about what to do next) and Actions (tool calls). After each action, the tool result is fed back as an Observation, and the model reasons about the next step. This continues until the task is complete or a stop condition is reached.

How is an AI agent different from an AI chatbot?

A chatbot runs a single inference pass and returns an answer. An agent runs a loop: it calls tools, receives results, reasons about what to do next, and repeats — potentially for dozens of steps — until a complex task is fully executed. The agent can take real actions in external systems; a chatbot cannot.

What tools does a typical AI agent have access to?

Web search, browser control (clicking, form filling, page navigation), terminal/shell execution, file system access, API calls to external services, code execution sandboxes, image/vision analysis, and memory retrieval from external stores. The exact tool set depends on the framework and how it is configured.

Do AI agents actually understand what they're doing?

The model generates plausible next steps based on patterns in its training. It does not have subjective understanding, but it can reason about tasks, decompose them into steps, handle errors, and adjust its approach based on tool feedback. Whether that constitutes 'understanding' is a philosophical question that does not change the practical outcome: well-designed agents complete complex multi-step tasks reliably when the task is within scope.

How do agents handle tasks that take hours to complete?

Long-running agents checkpoint their state periodically — storing the current task plan, completed steps, and relevant memory to a database. If the process is interrupted, it can resume from the last checkpoint rather than starting over. Hermes Agent implements this via LangGraph-style checkpointing with PostgresSaver for thread-scoped state persistence.

Deploy in 5 minutes.

7-day money-back guarantee. BYO AI key. From $19/mo.

Start Now
Related reading
AI agents vs chatbots: the actual differenceHow persistent memory works in AI agentsMulti-agent systems: how they're built in 2026Feature: Persistent MemoryFeature: Browser AutomationFeature: Scheduled Tasks