[Transcript] Mahesh @ Anthropic — Memory and Dreaming for self-learning agents

About this transcript — Mahesh (Product Manager, Platform Team at Anthropic) による発表「Memory and Dreaming for self-learning agents」（2026-05-16）の英語全文。Whisper large-v3-turbo で自動生成、段落区切りだけ追加。固有名詞の小ミス（"Cloud" は本来 "Claude"、"bi-poders" は "vibe coders"、"Cloud.md" は "Claude.md"）は原文ママ。

日本語の解説と自分のエージェント群への適用例はこちらに書いた。

Just don't make me go. Hello. Hey everyone, how's it going? Thanks for coming.

My name is Mahesh, and I'm a product manager on the platform team here at Anthropic. Over the past year and a half, I've gotten to work on primitives like MCP and skills, and today I want to talk about the primitive that I'm most excited about next. Which is memory. I'll talk about why we think memory is so important and why we've been spending so much time on it at Anthropic, how we think about designing memory systems that are built for frontier agents, and I'm excited to also talk about Dreaming, a brand new product that we're launching today in research preview in the managed agents API.

Model capabilities have improved really quickly over the last couple of years, and agents are capable of tasks that take many, many hours and can run for hours or almost days at a time. And as models and agents have improved, we've also invested in building higher and higher level capabilities and primitives that kind of get out of those models' way and give them access to additional bits of their environment and things that they can manage and become more powerful over time. So, for example, we launched MCP, which gives agents access to external tools and data. We launched harnesses that were really powerful, like Cloud Code and the agent SDK.

And in October, we launched Skills, which let agents pick up brand new capabilities that either other agents have designed and shared with them or humans and users that they interact with have designed for them. Each primitive has let agents do increasingly powerful things for longer periods of time, but we still think that something is still kind of unsolved, and that's continuous self-learning and context management over long-horizon tasks. So, memory is the next primitive. It's the thing that I think will get us to self-learning agents that evolve and improve based on the tasks that they're working on and their own experience.

With memory, agents can learn about the tasks they work on, things like the success criteria, common mistakes, strategies that are or are not working. They can learn about their environments, things like the code bases that they interact with, the files and the assets that they're constantly keeping up to date, and they can also learn from other agents that are in the same environment as them. They can share learnings, they can figure out what's going wrong elsewhere in a system and incorporate that into their own memory. And I think this last point is the one that I've been most excited about this year and over the next couple of months.

I think self-managed memory is going to be super important in these large and complex multi-agent systems where a swarm of agents that are working in a similar environment on discrete tasks are essentially building up their own understanding, their own model of the world that they're in over time. So to help get to this vision, we just launched memory and cloud-managed agents in public beta a couple of weeks ago. This gives developers a frontier memory system that works out of the box to maximize intelligence by default, to support these systems of many agents running concurrently in the same environment, and most importantly, to give enterprises and developers the flexibility and control they need to actually run these in production in an enterprise setting. We've already heard from a bunch of teams building on this to date that all say that this helps them get to continuous learning and continuous improving agents a lot faster.

Rakuten, for example, mentioned that this helped them drop their first past mistakes in their internal knowledge agents by 90% because agents were able to catch mistakes and share them with the next iteration of agents, which also led to better token efficiency and lower costs and better latency because they started deploying memory systems. So I want to talk a bit about the requirements that we kept in mind that we discovered while talking to customers and building this. The first and most important is memory needs to be built to maximize intelligence by default. Agent builders have been designing memory systems for a while.

I mean, we ourselves launched Cloud.md, originally with Cloud Code, I think about a year and a half ago, and this was a pretty constrained early version of memory where an agent could leave notes for itself. Sometimes the user would also leave notes in the same memory file. And we also launched something like the memory tool within our SDKs, which was a pretty well-specified tool call with specific parameters and output formats that API builders could use. As agents have improved, we've tried to get more and more out of Cloud's way and delegate more of this decision-making to Cloud without over-constraining the design of these harnesses.

And as we did with skills, we kind of came to the conclusion that, hey, we know that agents are able to manage a virtual environment and manage their own file system, so why can't we go in the same direction with memory? Memory in Cloud-managed agents models memory as a file system to Cloud, a series of files with a specific hierarchy and format that Cloud can manage and update on its own. It can use familiar tools like Bash and Grep to update this memory to keep it organized and to constantly change it as it starts working on a task. Now, this also tracks with what we're seeing in the latest models.

With Cloud Opus 4.7, which we just launched last month, we saw that it was state-of-the-art at file system-based memory. That means it's a lot better at discerning what content to put into memory, what's worth remembering. It's better at figuring out what's the right structure for memory. How many files should I split memory into?

How do I keep it organized inside of a file system? And ultimately, it all does this with just Bash tools and Grep tools that already make Cloud so good at agentic coding. The other thing that we had in mind when designing memory is that it needs to scale with the multi-agent systems that we're going to be building over the coming months. Multi-parallel agents is something that we're already kind of starting to do with Cloud Code.

There's a lot of bi-poders that have like 10 or 15 Cloud Code sessions running at the same time and we're starting to see this in an enterprise setting as well where enterprises, including Anthropic, have hundreds or sometimes even thousands of agents running in parallel, interacting with the same set of shared state and the same shared memory. So there's a couple of properties that come out of this. One is we want to give agents the ability to mix and match between the session and the work that it's doing and the set of memory stores that it has access to. So one property of memory and managed agents is permission scopes.

The ability for one agent to have read-only access to one memory store and maybe that memory store is organization-wide knowledge, a set of best practices, a runbook for how to deal with these common tasks, and then it has read-write memory for another memory store. So maybe that's another one where it has working memory that's a lot more specific and frequently updated based on the work it's doing. The other property that came out of this was concurrency. If there are hundreds or thousands of agents interacting at the same time with the same memory state, it needs to be able to know that it's not going to clobber the memory or overwrite it as it continues working.

So we implemented optimistic concurrency where one agent can essentially use a content hash to check if it's going to overwrite another agent's memory before it actually makes an update. From talking to customers, the final and most important property from all of this is about developer and enterprise control for actual production agents. A couple of things came out of this. The first and probably most sought-after property is version history.

It's the ability for the developers building with managed agents to see an entire audit log of every time memory was updated and to actually even give agents access to the same audit log in the future so they can keep track of what change was made and when. It's also attribution metadata to say what agent made an update, what time did it make that update, what session made that specific change, and to go super granularly so this is predictable and in developers control. The other property that came out of this was a standalone API. We talked to a lot of customers that are building bespoke systems outside of managed agents to manage and curate their memory and keep it up to date.

We talked to customers that do PII scanning to make sure that memory doesn't have sensitive content that shouldn't be in there. We also talked to customers that wanted to clean up memory in their own separate pipeline or to clone it into external systems. So we didn't want to lock them in into a specific system that was only available to manage agents. We built this portable API so they could go and control these additional things on their own.

So taking a step back we've started to form this picture of the different layers that we need to work in as we build a frontier memory system. We've talked about the storage layer which is where the data is actually stored, what kind of metadata and attribution data we're leaving alongside of it. We've talked about the structure and the content layer. This is things like our decision to model memory as files in a file system and earlier with skills as a form of procedural memory that have a pretty lightweight spec that say, hey, here's how you can actually learn how to do this new capability and equip yourself with new knowledge.

And then there's the process layer. This is things like how often is memory actually updated? What triggers updates to that memory and what sources does it use to decide what changes to make to memory and have new things to learn? And we think that agent memory, the API that we've been discussing, solves part of this.

But as we started to scale this up into these more complex multi-agent systems, we also saw a bunch of limitations. We saw cases where memory was, sessions were kind of missing learnings that other agents and other sessions had already kind of figured out on their own. We saw these common mistakes and these shared patterns across multiple agents working in the same environment. And we also saw that they weren't super efficient at keeping up this large-scale memory store and keeping it up to date in a holistic and efficient way.

They were kind of siloed into the specific task that they were working on. So for the past couple of months, we've been experimenting with a couple of different types of processes to kind of supplement this with. And we landed on one. We call this process dreaming.

And today, we're launching this in research preview in the Managed Agents API. Dreaming is a process that looks for patterns and mistakes across your recent agent sessions and their transcripts and automatically produces organized and up-to-date memory content. We've worked with a few customers in early testing. And, for example, Harvey, when they deployed dreaming in one of their legal benchmarks, which tests out a pretty realistic legal scenario, they saw a six times increase in the task completion rate for one of their legal scenarios.

And we're really excited to see how other customers use this when they start testing out this research preview. So let's talk a bit about why we got excited about dreaming in the first place and some of the design and harness considerations we kept in mind as we designed it. So how does dreaming work? It's a batch asynchronous process that runs separately from the work that you're doing within a specific session that's working on a specific task.

You can kick off dreaming periodically using our console or our API on kind of a cron basis or you can plug it in using our API into an existing process. For example, some customers kick off dreaming once their agents have finished a task and are spinning down and want to save those learnings to the memory state. And dreaming comprehensively looks through recent transcripts, looks for common mistakes, things that a bunch of agents are doing like a failed tool call or strategies that are working out for them and finds opportunities to update the memory state that will improve it in the future. And it produces this updated memory state that you can then apply immediately to your memory store or maybe you want to run some checks and do some manual review which you can do via the API.

The ultimate goal of dreaming is continuous self-learning and self-improvement where the next day's agents automatically get better based on the learnings and the work of the previous day's experience. We're excited about dreaming from a design and research perspective for a couple of reasons. The first property is compared to the memory APIs we've been talking about previously, dreaming is out of band. It happens outside the context of an agent working on a specific session or a specific task.

And this has a couple of benefits. The first is that it's a really good fit for multi-agent systems. When a single agent is reading and writing memory it has the perspective of itself, of its own context and of its task. But dreaming lets us go kind of a step above that and look at multiple agents at the same time to find these shared patterns and learnings that a single agent might not learn or notice from its own limited perspective.

From a harness design perspective we've also found consistently that it's important for agents to have really discreet and clear objectives as they start working on a task. So dreaming really lets us separate out this new objective of memory quality because we think over the coming months memory is going to be increasingly important and load-bearing to the actual outcomes and the work that agents are doing. So this lets us separate the memory quality objective from the task completion and task performance objective that a lot of agents already have today. And again, because dreaming is an out-of-band process, it's in the background, it does this without adding any latency to the hot path of an agent's existing task.

The other design perspective we had here and thing that we wanted to enable which I'm very excited about is large-scale memory systems and how we can use compute effectively to create and curate these. Today, most memory deployments are pretty localized to a specific user or a specific task or maybe a small team that's working together. But agent systems are quickly getting to enterprise scale and again, within Anthropic and within other enterprises that we work with, they already have hundreds or thousands of agents running concurrently that share state. So this effectively starts to turn into a really large knowledge base as opposed to just a simple memory store to store work in context about a specific task.

And to support this, we need to find ways to let Claude scale up memory systems to be super large while still being up-to-date and fresh and not too token intensive. Dreaming is a process that lets us do this by essentially following similar scaling laws of using additional compute an additional effort to keep these memory systems organized. One way to think about this is how we consider test time compute or thinking models from a couple of years ago where giving models the ability to go explore and try different things and especially spend more tokens leads to a lot better final outcomes on the task they're working on. And dreaming is a similar thing that lets a dreaming agent spend more tokens to keep these systems well organized and up-to-date.

Another perspective we have here is like a search system where there's this upfront effort to kind of produce this high quality up-to-date index that then is used at retrieval time or search time to get the latest results super efficiently and performatively. So this is something that dreaming also lets us do by creating this index up-front and then curating it so that all the downstream agents can use it and effectively lets us amortize this effort across all of those agents that are reading from a memory store. So now with memory and dreaming in the managed agents API we start to build this picture of what we think of as a frontier memory system at least so far. Memory on the left side is a primitive for agents to immediately in real time read and write things and remember things as they're working on a task.

And dreaming is a comprehensive process built on top of that to verify the state of memory to organize it and to enrich and backfill it with additional information based on the tasks that the agents are doing during a day. Dreaming is kind of the bridge between these more intermediate memory systems and these larger scale knowledge bases that again we think are going to be really prominent over the next few months. So let's walk through a quick demo. What we're looking at here is a SRE agent — let me make sure this starts — there you go — that is looking at alerts that are coming in and it's reacting based on those alerts, spinning up specific agents that either do a bunch of triage work, maybe sometimes it spins up an agent to go submit PRs, and each of these agents are equipped with a couple of memory stores.

We can see that it has an org-wide knowledge memory store, it has an SRE and a code-based memory store, and so if we click into the org-wide knowledge memory store we can see it's read-only. It's a set of let's say runbooks and SLO guidelines, it points the agents to the specific owners that they might need to go ping, and it's something that doesn't get updated very often. We don't want agents necessarily to be going and making changes as they work. Now there's also the SRE memory store that's read-write, and of course the SRE agents are able to constantly make changes to this as they react and learn from the environment around them.

So we see this alert, this P1 that's coming in from the dispatch service, and we spin up this SRE agent that goes and starts to kick off an investigation. It goes and investigates the CPU utilization, maybe it goes and checks out some of the traffic patterns and queries for some of the recent PRs that have gotten deployed. It writes down these learnings — if we click into the SRE memory store — and notes these in a new diff that gets updated in that memory store. Now a couple minutes later that same alert gets paged again and a different SRE agent spins up with access to the same memory store.

The first thing it does is it sees that note within its memory store that says hey, we already did this investigation, here's what we found, here's a way you can short circuit what you're looking at, and ultimately it saves a bunch of time that it would have spent going and investigating the same thing. So we see an immediate token efficiency gain and an intelligence gain since it now knows what else it can go investigate. Now this is great but we want to actually be able to deploy these in an enterprise and actually have reliability and see what decision making led to certain outcomes. So if we click into the memory store we can see it has version history.

It says every single time an update was made to this, we can actually go back in time and see what changes were made. We can also see which agent made that change. When was it written? And we also have this little precondition hash, which is again what lets us do this optimistic concurrency to say hey, I made this change, let's actually verify it is what it is before I go and overwrite it.

So we've been talking about agent memory but let's now see how dreaming can now make this a more holistic and up-to-date memory system. So we'll go and pivot over to the cloud console where it actually reflects the exact state of what we're looking at in the API. It's a set of memory stores that we've created and we'll click into the Team SRE memory store which again reflects the latest state of memory that we had written. If we go and navigate to the dreaming tab specifically we can kick off a dreaming job where we say hey, I want to go and update and create this specific memory store and I want to look at a bunch of the sessions that we've been looking at for the past seven days.

These are all the sessions that touch this memory store and we want to start a dreaming job to look over those. So I'll click into the dream and we can see some of the work that it's doing under the hood. It says hey, here are some of the input sessions that it's going to go and spend time looking into, look through the transcripts, and it spins up within the cloud console an actual session where you can go and see what's happening. It's looking at the specific transcript entries and it's going to spin up a bunch of sub-agents that go and look through those transcripts, try to identify those learnings, and then produce that updated memory state.

So we'll jump ahead a few minutes and look at the completed dreaming job to see what the output was. It produces a diff which is a set of updated files that it's going to apply to this memory store. The first one is an update to this dispatch latency note that we were looking at in the demo earlier. It said hey, a bunch of these agents were triggered exactly 60 seconds after an upstream spike in CPU utilization and it kind of figures out based on that pattern that there might be some retry logic that's getting triggered that's really inefficient and leading to a lot of wasted time when we're actually triaging this stuff.

So it identifies that because each of the individual agents aren't really noticing that pattern. They don't know that other agents are also seeing kind of that 60 second pattern every single time and it leaves a note, and the goal with this is future agents benefit from this learning and can go figure this out more efficiently. It also does a deduplication and curation step. It sees that there were five of the same entries from previous agents that were working with this memory store and it consolidates that into a single entry.

It removes a stale entry that's no longer valid that it saw in the transcript that is no longer relevant, and then it adds this verification note that says at this time, based on this transcript I just looked at, this memory is actually accurate. I was able to verify it based on the work that the agent was doing and therefore you can rely on it in the next day when you start using the same memory store. So there's that verification, backfill, organization steps that we think memory and dreaming are really useful for. And so with this demo what we've kind of seen is how we can actually build production agents using the memory and dreaming APIs in the managed agents API.

And to close out, I think that over the next couple of months we're going to start seeing agents that run for days or many, many hours at a time, and I think that memory is going to be a really important part of that system and what makes it ultimately possible. So I'm really excited to see what everyone builds with memory and dreaming in the Cloud Managed Agents API and you should get started today. Thank you. Thank you.

Source: Anthropic 公式発表動画（Mahesh, Platform Team PM, 24:28）
Transcription: Whisper large-v3-turbo (M5 Mac, 8 threads CPU, ~6m38s)
Date: 2026-05-17 by yuki
日本語解説: AIに「経験値」を貯めさせる

Related Posts

AIに「経験値」を貯めさせる：Memory・Multi-agent shared memory・Dreaming を、自分のエージェント群で実装してみた

みなさん、ごめんなさい。— Anthropic Mythosが教えてくれたこと

コードを書かない開発が、こんなに楽しいとは思わなかった