Home / Blog / How AI agents actually work: the reasoning loop, tool use, and planning
Not magic. A loop, some tools, and memory.

How AI agents actually work: the reasoning loop, tool use, and planning

An AI agent is not a smarter chatbot. The architecture is fundamentally different. Here is exactly what happens when an agent receives a task — from the first inference call to the completed result.

Hermes OS team3 April 202611 min read

The difference from a chatbot

A standard LLM used as a chatbot runs a single inference pass: it takes your message plus conversation history, generates a response, and stops. The model cannot call external services, take actions in the world, or run for more than the duration of that single generation.

An agent wraps that same LLM in a loop. Rather than generating a final answer directly, the LLM generates an intermediate step — either a thought about what to do next, or an action to take (calling a tool). The result of that action comes back as an observation. The LLM then generates the next step. This continues until the task is complete. IBM's technical documentation for AI agents frames it: 'AI agents use tool calling on the backend to obtain up-to-date information, optimize workflows and create subtasks autonomously.'

This loop — Thought → Action → Observation → repeat — is called the ReAct pattern (Reasoning + Acting), introduced in a 2022 Google Research paper and now standard across every major agent framework including LangGraph, CrewAI, AutoGen, and Hermes Agent. It is the architecture that turns a language model from an answer generator into a task executor.

What tool calling actually is

When the LLM generates an 'action' step, it generates a structured function call — a JSON payload specifying a tool name and arguments. The agent framework intercepts this output, runs the actual function, and feeds the result back to the LLM. The LLM does not directly execute code. It generates the call specification and the framework executes it.

Typical tools: web search (returns search results as text), browser control (navigates to URLs, clicks, fills forms), terminal execution (runs shell commands, returns stdout/stderr), file read/write, API calls to external services, memory retrieval (vector similarity search over stored facts), and code execution in a sandbox.

Claude Sonnet 4.6 and GPT-5.4 both support native tool calling — the models are specifically trained to generate valid function call outputs reliably. Older models like GPT-3.5 required extensive prompt engineering to produce consistent tool-call JSON. That underlying model improvement is a large part of why production agent reliability is substantially better in 2025-2026 than in 2023.

How planning works

For simple tasks — look up a fact, summarize a page, run a script — a single agent loop handles it directly. The LLM plans within the context window, calls tools in sequence, and produces an output. No explicit planning step required.

For complex multi-step tasks, frameworks use explicit planning phases. The agent first generates a full plan (a list of subtasks) before executing any of them. This matters because once execution starts, the model's context fills with action/observation pairs. Having a written plan to reference prevents the model from losing the thread of the original goal as context fills.

Multi-agent architectures split the task across specialized agents. An orchestrator breaks a research task into subtasks and assigns them to a researcher agent, a coder agent, and a writer agent — each running their own action loops in parallel. The orchestrator collects and synthesizes the outputs. Hermes Agent supports this via subagent delegation — the primary agent can spawn up to 3 concurrent subagents and aggregate results. LangGraph, CrewAI, and AutoGen implement similar patterns with different tradeoffs in flexibility versus setup complexity.

Memory: what the agent knows and when

Agent memory operates at multiple layers. In-context memory is whatever fits in the current context window — conversation history, task instruction, action/observation pairs from the current session. Limited and temporary. Claude Sonnet 4.6's 1M token context sounds vast, but a heavily tool-using agent can consume hundreds of thousands of tokens in a long session, and inference cost rises with context length.

External memory — vector stores, knowledge bases, conversation archives — is retrieved selectively. Before each inference step, the agent runs a similarity search over stored memory and retrieves the most relevant facts, skill documents, or past observations injected into the context window. This allows the agent to reference experiences from months ago without keeping them all in context simultaneously.

The 2026 standard for production agent memory is a dual-layer architecture: a Hot Path (recent messages plus summarized state) paired with a Cold Path (external retrieval from Zep, Mem0, Pinecone, or similar). Digital Applied's January 2026 technical guide notes that even 200K-400K token windows are impractical for full history due to cost and latency — external episodic memory remains mandatory for production agents regardless of context window size.

What makes agents fail

The three most common failure modes in production agent deployments: tool call errors accumulating (when a call fails and the agent doesn't handle the error correctly, it can spiral into retry loops or incorrect reasoning); context fill (for long-running tasks, the action/observation history fills the window and the model starts losing the thread of the original goal); hallucinated tool calls (models occasionally generate calls with invalid arguments, or fabricate results rather than actually calling the tool — the most dangerous failure mode in high-stakes tasks).

Real defenses: structured output enforcement (requiring tool calls to pass schema validation before execution), step limits (terminating a loop that has exceeded a maximum count), human-in-the-loop checkpoints for irreversible actions, and explicit error handling instructions in the system prompt. Hermes v0.5.0 adds checkpoint/rollback — the /rollback command reverts file changes if the agent takes incorrect actions during code or file editing tasks.

These failure modes are why 'the autonomous agent does everything' framing is premature for many production use cases. The practical approach in 2026: identify tasks that are verifiable (the agent can confirm its output is correct), reversible (mistakes can be undone), and low consequence per error — start there, and automate outward as reliability is confirmed.

Common questions

What is the ReAct pattern in AI agents?

ReAct (Reasoning + Acting) is the standard agent loop: the model alternates between Thoughts (reasoning about the next step) and Actions (tool calls). After each action, the tool result feeds back as an Observation, and the model reasons about the next step. This continues until the task is complete or a stop condition is reached.

How is an AI agent different from an AI chatbot?

A chatbot runs a single inference pass and returns an answer. An agent runs a loop — calls tools, receives results, reasons about the next step, and repeats for potentially dozens of steps — until a complex task is fully executed. The agent can take real actions in external systems; a chatbot cannot.

What tools does a typical AI agent have access to?

Web search, browser control (clicking, form filling, page navigation), terminal/shell execution, file system access, API calls to external services, code execution sandboxes, image/vision analysis, and memory retrieval from external stores. The exact tool set depends on the framework and its configuration.

Do AI agents actually understand what they're doing?

The model generates plausible next steps based on patterns in training. No subjective understanding, but it can reason about tasks, decompose them into steps, handle errors, and adjust based on tool feedback. Whether that constitutes 'understanding' is a philosophical question that does not change the practical outcome: well-designed agents complete complex multi-step tasks reliably when the task is within scope.

How do agents handle tasks that take hours to complete?

Long-running agents checkpoint their state periodically — storing the current task plan, completed steps, and relevant memory to a database. If the process is interrupted, it resumes from the last checkpoint rather than starting over. Hermes Agent implements this via LangGraph-style checkpointing with PostgresSaver for thread-scoped state persistence.

Deploy in 5 minutes.

7-day money-back guarantee. BYO AI key. From $9.99/mo.

Start Now
Related reading
AI agents vs chatbots: the actual differenceHow persistent memory works in AI agentsMulti-agent systems: how they're built in 2026Feature: Persistent MemoryFeature: Browser AutomationFeature: Scheduled Tasks