Source: arxiv.org
Summary
Summary
The paper investigates agent‑memory systems from a data‑management perspective, proposing a four‑module framework—
- Memory Representation & Storage (logical format + physical backend),
- Memory Extraction (how raw inputs are turned into logical records),
- Memory Retrieval & Routing (global & query‑level access methods), and
- Memory Maintenance (conflict resolution, eviction, consolidation). Using this taxonomy, the authors benchmark 12 representative systems (e.g., Mem0, Zep, MemOS, SimpleMem, MemAgent, A‑MEM, etc.) across five diverse workloads (LoCoMo, LongMemEval, DB‑Bench, LongBench, and a third long‑conversation QA set). Their key findings are:
- No single architecture dominates across all domains; instead, the task bottleneck determines the best choice for memory design. Structured, relational, or graph‑based memories excel at cross‑session reasoning, while hybrid or hierarchical approaches best support long‑horizon grounding, and raw transcript‑based memories work for stateful operations.
- Retrieval fidelity is driven more by how evidence is organized (e.g., linked or hierarchical) than by the ranking of a single best hit.
- Robustness to dynamic updates demands revisability in data representation and selective consolidation.
- Long‑horizon stability is achieved when heterogeneous “event–entity” graphs preserve temporal relationships; flat dense retrieval suffers as context drifts.
- Operational cost scales with the scope of maintenance: localized updates (as in LightMem or MemTree) provide a sweet spot compared to global graph re‑writes (e.g., MemOS, Cognee).
The authors also conduct a detailed component‑ablation study that confirms these insights and offers concrete guidance on representation granularity, extraction coarsening, retrieval fusion, and conservative consolidation.
Implications
- For the design of new agents, match the memory architecture to the workload: graph‑based storage for cross‑session facts, hierarchical fusion for long‑term queries, and simple in‑context registers for session‑state updates.
- Retrieval engines should be structured (hybrid or rule‑based) rather than purely dense similarity.
- Maintenance must be localized and conservative—avoid full‑graph recomputation whenever possible.
- The evaluation framework (
MemoryData) and curated codebase are released alongside the paper for reproducibility.
Content
Abstract “Memory for large language model (LLM) agents has rapidly evolved from simple retrieval‑augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end‑to‑end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, critical system‑level concerns, including operational costs, architectural trade‑offs across memory modules, and robustness under dynamic knowledge updates, remain insufficiently explored.”
Key Contribution Statement “We propose an analytical framework that decomposes agent memory into four core modules: memory representation and storage, extraction, retrieval and routing, and maintenance.”
Finding 1. Workload‑Aligned Memory “Strong agent memory is not defined by a single universal representation, but by how well it supports the dominant workload bottleneck: (1) for dispersed cross‑session reasoning, relation‑ and time‑aware retrieval is most effective, as in Zep and Cognee; (2) for long but semantically coherent dialogue, coarse‑to‑fine filtering improves exact grounding, as in MemOS and MemoryOS; and (3) for stateful execution, preserving interaction traces is more critical than exact lexical matching alone, as in Long Context.”
Finding 2. Evidence‑Centric Memory Organization “Retrieval quality depends more on how a system organizes evidence for later reconstruction than on how well it ranks one relevant memory first. Early localization and evidence assembly should be treated as separate design targets; explicit structure, such as links or hierarchy, is most valuable when supporting evidence is scattered or temporally distant, as in A‑MEM and MemTree; flat dense retrieval is mainly effective for short‑range access.”
Conclusion “We present a comprehensive review of existing agent memory systems from a data management perspective. We conduct thorough experiment evaluations across five benchmarks and 12 representative systems. Fine‑grained component analyses confirm that representation granularity, extraction coarseness, retrieval fusion, and conservative consolidation are key levers. The released evaluation framework (MemoryData) and code enable reproducibility of all results.”