What context windows do and do not do
Every language model has a context window — the amount of text it can see at once during a conversation. Current frontier models have context windows of 200,000 to 1,000,000 tokens. Claude Sonnet 4.6 and Opus 4.6 both offer 1 million token context at standard pricing as of March 2026 — enough to hold roughly 700,000 words simultaneously. GPT-5.4 supports 1 million tokens in the API. These are genuinely large.
But context windows are not persistent. They last for the duration of one session. When you start a new conversation, the context is empty. If you have a 200-session history with a chatbot and start a fresh conversation, the model knows nothing from those previous sessions. You are back to zero regardless of how large the context window is.
This is not a limitation that can be fixed by making the context window larger. Even with infinite context, the model would still start each session fresh unless there is a separate memory store that gets loaded into context at session start. That is what persistent memory systems do — they are a layer above and alongside the context window, not a replacement for it.
How persistent memory systems work
A persistent memory system stores information between sessions and retrieves relevant pieces when a new session starts. At its simplest, this looks like a database — a collection of text snippets tagged with metadata — and a retrieval function that queries it based on what the current session needs.
Vector databases are commonly used for this because they allow semantic retrieval: instead of looking up stored memory by exact keyword match, the system can find memories that are conceptually relevant to the current context even if the wording is different. If the agent remembers "the user prefers Python over JavaScript" and the current task involves writing some code, that memory gets surfaced even if the new task prompt does not mention Python.
More structured approaches use tiered memory: hot memory for things needed in almost every session (your name, your primary projects, standing preferences), warm memory for less frequent but important context (how you solved a problem three months ago), and cold memory for archived history that can be retrieved on demand but does not load automatically.
What Hermes stores in memory
Hermes Agent uses three distinct memory types implemented as files and structures the agent reads and writes directly. The user model lives in `USER.md` — a structured document containing your technical background, communication preferences, standing project context, and operational patterns. The agent updates this file as it learns more about you. General learned facts live in `MEMORY.md`, a curated store the agent reads at session start and writes to when it discovers something worth retaining.
Skill Documents are procedural memory stored in the agentskills.io open format — searchable markdown files encoding how to approach specific task types. Tools used, decision tree, failure modes encountered and how they were handled. On future similar tasks, the agent retrieves and loads the relevant Skill Document rather than reasoning from scratch. This is the compounding mechanism: the agent that has done 50 research tasks is materially faster and more reliable than one on its first.
Event memory is a timestamped log of tasks, decisions, and outcomes. This is what lets the agent tell you what it did last Tuesday, why it made a particular call, and whether a strategy worked. Optional Honcho integration adds cross-session AI-native user modeling as a separate API layer — building a persistent user understanding that carries across different tools, not just Hermes sessions.
Why this requires a persistent server
Memory stored on your laptop disappears when you close the application, reformat the drive, or switch to a new machine. For memory to be genuinely persistent — accessible from any device, surviving hardware failures, available when the agent is running scheduled tasks while you are offline — it needs to be stored on a server.
This is the infrastructure dependency that makes cloud hosting important for anyone who wants to rely on their agent's memory long-term. A self-hosted setup on a VPS works, but it requires manually configuring backups, volume mounts, and disaster recovery. Managed hosting handles all of that by default.
The practical consequence: the longer an agent runs, the more valuable its accumulated memory becomes. An agent that has been running for six months and has built up thousands of Skill Documents and a detailed user model is meaningfully different from a fresh install. That accumulated state is worth protecting.
What persistent memory does not do
Memory systems do not make agents reliable. An agent with rich memory can still take incorrect actions, misunderstand ambiguous instructions, or apply a past strategy to a situation where it does not fit. Memory helps with context; it does not substitute for good task design and human oversight on consequential actions.
Memory also does not automatically stay accurate. If the agent learns something incorrect — a wrong assumption about how a system works, a miscategorized approach in a Skill Document — that incorrect information persists and gets retrieved on future tasks. Periodic memory review and correction is important for agents doing high-stakes work.