Reliable multi-agent systems are mostly a memory design problem. Once agents call tools, collaborate, and run long workflows, you need explicit mechanisms for what gets stored, how it is retrieved, and how the system behaves when memory is wrong or missing.
This article compares 6 memory system patterns commonly used in agent stacks, grouped into 3 families:
We focus on retrieval latency, hit rate, and failure modes in multi-agent planning.
Next, we go system family by system family.
What it is?
The default pattern in most RAG and agent frameworks:
This is the ‘vector store memory’ exposed by typical LLM orchestration libraries.
Latency profile
Approximate nearest-neighbor indexes are designed for sublinear scaling with corpus size:
Main cost components:
Hit-rate behavior
Hit rate is high when:
Vector RAG performs significantly worse on:
Benchmarks such as Deep Memory Retrieval (DMR) and LongMemEval were introduced precisely because naive vector RAG degrades on long-horizon and temporal tasks.
Failure modes in multi-agent planning
When it is fine
What it is?
MemGPT introduces a virtual-memory abstraction for LLMs: a small working context plus larger external archives, managed by the model using tool calls (e.g., ‘swap in this memory’, ‘archive that section’). The model decides what to keep in the active context and what to fetch from long-term memory.
Architecture
Latency profile
Two regimes:
Overall, you still pay vector search and serialization costs when paging, but you avoid sending large, irrelevant context to the model at each step.
Hit-rate behavior
Improvement relative to plain vector RAG:
The core new error surface is paging policy rather than pure similarity.
Failure modes in multi-agent planning
When it is useful
What it is?
Zep positions itself as a memory layer for AI agents implemented as a temporal knowledge graph (Graphiti). It integrates:
Zep evaluates this architecture on DMR and LongMemEval, comparing against MemGPT and long-context baselines.
Reported results include:
These numbers underline the benefit of explicit temporal structure over pure vector recall on long-term tasks.
Architecture
Core components:
The KG can coexist with a vector index for semantic entry points.
Latency profile
Graph queries are typically bounded by small traversal depths:
In practice, Zep reports order-of-magnitude latency benefits vs baselines that either scan long contexts or rely on less structured retrieval.
Hit-rate behavior
Graph memory excels when:
Hit rate is limited by graph coverage: missing edges or incorrect timestamps directly reduce recall.
Failure modes in multi-agent planning
When it is useful
What it is?
GraphRAG is a retrieval-augmented generation pipeline from Microsoft that builds an explicit knowledge graph over a corpus and performs hierarchical community detection (e.g., Hierarchical Leiden) to organize the graph. It stores summaries per community and uses them at query time.
Pipeline:
Latency profile
Latency mostly depends on:
Hit-rate behavior
GraphRAG tends to outperform plain vector RAG when:
The hit rate depends on graph quality and community structure: if entity extraction misses relations, they simply do not exist in the graph.
Failure modes
When it is useful
What they are?
These systems treat ‘what the agents did‘ as a first-class data structure.
In both cases, the log / checkpoints are the ground truth for:
Latency profile
Hit-rate behavior
For questions like ‘what happened,’ ‘which tools were called with which arguments,’ and ‘what was the state before this failure,’ hit rate is effectively 100%, assuming:
Logs do not provide semantic generalization by themselves; you layer vector or graph indices on top for semantics across executions.
Failure modes
ALAS explicitly tackles some of these via transactional semantics, idempotency, and localized repair.
When they are essential?
What it is?
Episodic memory structures store episodes: cohesive segments of interaction or work, each with:
Episodes are indexed with:
Some systems periodically distill recurring patterns into higher-level knowledge or use episodes to fine-tune specialized models.
Latency profile
Episodic retrieval is typically two-stage:
Latency is higher than a single flat vector search on small data, but scales better as lifetime history grows, because you avoid searching over all individual events for every query.
Hit-rate behavior
Episodic memory improves hit rate for:
Hit rate still depends on episode boundaries and index quality.
Failure modes
When it is useful?
References:
Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.
