Why LLM applications need better memory management

Tuesday May 6, 2025. 11:00 AM , from InfoWorld

You delete a dependency. ChatGPT acknowledges it. Five responses later, it hallucinates that same deprecated library into your code. You correct it again—it nods, apologizes—and does it once more.

This isn’t just an annoying bug. It’s a symptom of a deeper problem: LLM applications don’t know what to forget.

Developers assume generative AI-powered tools are improving dynamically—learning from mistakes, refining their knowledge, adapting. But that’s not how it works. Large language models (LLMs) are stateless by design. Each request is processed in isolation unless an external system supplies prior context.

That means “memory” isn’t actually built into the model—it’s layered on top, often imperfectly. If you’ve used ChatGPT for any length of time, you’ve probably noticed:

It remembers some things between sessions but forgets others entirely.

It fixates on outdated assumptions, even after you’ve corrected it multiple times.

It sometimes “forgets” within a session, dropping key details.

These aren’t failures of the model—they’re failures of memory management.

How memory works in LLM applications

LLMs don’t have persistent memory. What feels like “memory” is actually context reconstruction, where relevant history is manually reloaded into each request. In practice, an application like ChatGPT layers multiple memory components on top of the core model:

Context window: Each session retains a rolling buffer of past messages. GPT-4o supports up to 128K tokens, while other models have their own limits (e.g. Claude supports 200K tokens).

Long-term memory: Some high-level details persist across sessions, but retention is inconsistent.

System messages: Invisible prompts shape the model’s responses. Long-term memory is often passed into a session this way.

Execution context: Temporary state, such as Python variables, exists only until the session resets.

Without external memory scaffolding, LLM applications remain stateless. Every API call is independent, meaning prior interactions must be explicitly reloaded for continuity.

Why LLMs are stateless by default

In API-based LLM integrations, models don’t retain any memory between requests. Unless you manually pass prior messages, each prompt is interpreted in isolation. Here’s a simple example of an API call to OpenAI’s GPT-4o:

import { OpenAI } from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'You are an expert Python developer helping the user debug.' },
{ role: 'user', content: 'Why is my function throwing a TypeError?' },
{ role: 'assistant', content: 'Can you share the error message and your function code?' },
{ role: 'user', content: 'Sure, here it is...' },
],
});

Each request must explicitly include past messages if context continuity is required. If the conversation history grows too long, you must design a memory system to manage it—or risk responses that truncate key details or cling to outdated context.

This is why memory in LLM applications often feels inconsistent. If past context isn’t reconstructed properly, the model will either cling to irrelevant details or lose critical information.

When LLM applications won’t let go

Some LLM applications have the opposite problem—not forgetting too much, but remembering the wrong things. Have you ever told ChatGPT to “ignore that last part,” only for it to bring it up later anyway? That’s what I call “traumatic memory”—when an LLM stubbornly holds onto outdated or irrelevant details, actively degrading its usefulness.

For example, I once tested a Python library for a project, found it wasn’t useful, and told ChatGPT I had removed it. It acknowledged this—then continued suggesting code snippets using that same deprecated library. This isn’t an AI hallucination issue. It’s bad memory retrieval.

Anthropic’s Claude, which offers prompt caching and persistent memory, moves in the right direction. Claude allows developers to pass in cached prompt fragments using pre-validated identifiers for efficiency—reducing repetition across requests and making session structure more explicit.

But while caching improves continuity, it still leaves the broader challenge unsolved: Applications must manage what to keep active in working memory, what to demote to long-term storage, and what to discard entirely. Claude’s tools help, but they’re only part of the control system developers need to build.

The real challenge isn’t just adding memory—it’s designing better forgetting.

Smarter memory requires better forgetting

Human memory isn’t just about remembering—it’s selective. We filter details based on relevance, moving the right information into working memory while discarding noise. LLM applications lack this ability unless we explicitly design for it. Right now, memory systems for LLMs fall into two flawed categories:

Stateless AI: Completely forgets past interactions unless manually reloaded.

Memory-augmented AI: Retains some information but prunes the wrong details with no concept of priority.

To build better LLM memory, applications need:

Contextual working memory: Actively managed session context with message summarization and selective recall to prevent token overflow.

Persistent memory systems: Long-term storage that retrieves based on relevance, not raw transcripts. Many teams use vector-based search (e.g., semantic similarity on past messages), but relevance filtering is still weak.

Attentional memory controls: A system that prioritizes useful information while fading outdated details. Without this, models will either cling to old data or forget essential corrections.

Example: A coding assistant should stop suggesting deprecated dependencies after multiple corrections.

Current AI tools fail at this because they either:

Forget everything, forcing users to re-provide context, or

Retain too much, surfacing irrelevant or outdated information.

The missing piece isn’t bigger memory—it’s smarter forgetting.

GenAI memory must get smarter, not bigger

Simply increasing context window sizes won’t fix the memory problem. LLM applications need:

Selective retention: Store only high-relevance knowledge, not entire transcripts.

Attentional retrieval: Prioritize important details while fading old, irrelevant ones.

Forgetting mechanisms: Outdated or low-value details should decay over time.

The next generation of AI tools won’t be the ones that remember everything. They’ll be the ones that know what to forget. Developers building LLM applications should start by shaping working memory. Design for relevance at the contextual layer, even if persistent memory expands over time.