The CrewAI quickstart gives you memory in three lines of configuration. Short-term memory tracks conversation context within a crew run. Long-term memory persists across executions. Entity memory accumulates facts about recurring subjects. All of it works in development, where “persistence” means a SQLite file in the project directory and “retrieval” means loading everything into the context window.
Production breaks this model in three places: the persistence backend cannot be a local file when crews run across multiple instances, retrieval cannot load the full memory store when it contains thousands of entries, and state recovery after a mid-execution failure requires decisions the framework does not make for you.
| Memory Type | Default Behavior | Production Requirement | What Breaks Without It |
|---|---|---|---|
| Short-term | In-process, cleared after crew execution completes | Checkpointed to external store for crash recovery | Mid-execution failures lose all accumulated context; crew restarts from zero |
| Long-term | SQLite file, single-process access | PostgreSQL/pgvector or dedicated vector DB with concurrent access | Multiple crew instances corrupt the SQLite file or read stale data |
| Entity | In-memory dictionary, lost on restart | Persistent entity store with deduplication and merge logic | Entity facts are re-learned from scratch every execution, wasting tokens and producing inconsistent outputs |
| Retrieval | Full memory loaded into context | Relevance-scored retrieval with token budget limits | Context window fills with irrelevant memories, degrading task quality and inflating costs |
| Eviction | No eviction — memories accumulate indefinitely | TTL-based eviction, relevance decay, or explicit pruning | Memory store grows unbounded; retrieval quality degrades as noise increases |
The Persistence Backend Decision
The persistence backend determines everything downstream: concurrent access patterns, query capabilities, backup and recovery, and operational complexity. The decision is less about which database is “best” and more about which database you already operate.
PostgreSQL with pgvector is the default production recommendation when the team already runs PostgreSQL. Store memory metadata (timestamps, task associations, entity references) in relational tables. Store memory embeddings in pgvector columns for similarity search. The operational overhead is zero if PostgreSQL is already in the infrastructure — you are adding tables, not a system.
Dedicated vector databases (Qdrant, Weaviate, Pinecone) make sense when memory retrieval patterns require capabilities PostgreSQL does not provide efficiently: filtered vector search at high cardinality, real-time index updates without locking, or multi-tenancy with strict isolation. The tradeoff is operational complexity — you now run and monitor an additional system.
Redis with vector search works for high-throughput, low-latency memory access where durability is secondary. Short-term memory checkpointing, session-scoped entity caching, and temporary context storage are strong Redis use cases. Long-term memory that must survive infrastructure failures is not.
from typing import Optionalfrom datetime import datetimefrom enum import Enumfrom pydantic import BaseModel, Field
class MemoryType(str, Enum): SHORT_TERM = "short_term" LONG_TERM = "long_term" ENTITY = "entity"
class EvictionPolicy(str, Enum): TTL = "ttl" RELEVANCE_DECAY = "relevance_decay" LRU = "lru" MANUAL = "manual"
class MemoryBackendConfig(BaseModel): backend: str connection_string: str max_memories_per_retrieval: int = Field(default=10, ge=1, le=50) relevance_threshold: float = Field(default=0.7, ge=0.0, le=1.0) eviction_policy: EvictionPolicy = EvictionPolicy.TTL ttl_days: Optional[int] = Field(default=90, ge=1) checkpoint_interval_tasks: int = Field(default=1, ge=1) token_budget_per_retrieval: int = Field(default=4000, ge=100)
class CrewMemoryProfile(BaseModel): crew_name: str short_term_config: MemoryBackendConfig long_term_config: MemoryBackendConfig entity_config: Optional[MemoryBackendConfig] = None total_memories_stored: int = Field(default=0, ge=0) avg_retrieval_latency_ms: float = Field(default=0.0, ge=0.0) avg_tokens_per_retrieval: int = Field(default=0, ge=0) last_eviction_run: Optional[datetime] = NoneThe token_budget_per_retrieval field is the most operationally significant configuration. Without it, a crew with 5,000 stored memories and a relevance threshold of 0.5 might retrieve 200 memories at 500 tokens each — 100,000 tokens added to every task’s context window. The budget caps this: retrieve the top-N most relevant memories that fit within the token allocation.
Retrieval Strategy: Relevance Over Recency
The default retrieval pattern — load recent memories — works when the crew processes a narrow, sequential workload. A customer support crew handling one conversation thread benefits from recency because the most recent messages are the most relevant.
Production crews processing diverse workloads need relevance-scored retrieval. A crew that has processed 500 different customer accounts should retrieve memories related to the current account, not the memories from the most recent execution regardless of account.
The retrieval pipeline for production:
- Embed the current task context using the same embedding model that produced the stored memory embeddings
- Vector similarity search against the memory store with the configured relevance threshold
- Metadata filtering to scope retrieval to the relevant entity, project, or time window
- Token budget enforcement — rank by relevance score, include memories until the token budget is exhausted
- Deduplication — remove memories that are semantically redundant (cosine similarity > 0.95 between retrieved memories)
Entity Memory: The Deduplication Problem
Entity memory stores facts about entities the crew encounters: customer names, product specifications, system configurations. The accumulation pattern is straightforward — the crew learns something about Entity X and stores it. The problem is that the crew often learns the same thing multiple times, or learns an updated fact that should replace the previous version.
Without deduplication and merge logic, entity memory degrades predictably:
- Redundancy inflation. The same fact stored twelve times consumes twelve retrieval slots and twelve token allocations, contributing nothing beyond the first instance.
- Contradiction accumulation. The entity’s state changes, but old memories persist alongside new ones. The crew retrieves “Customer X uses PostgreSQL 14” and “Customer X uses PostgreSQL 16” and must reconcile the contradiction during inference — burning tokens and risking incorrect output.
- Entity drift. Multiple references to the same entity with slightly different names (ActiveWizards, Active Wizards, AW) create separate entity memory clusters that never merge.
The fix is a merge-on-write pattern: before storing a new entity memory, check for existing memories about the same entity. If the new memory updates an existing fact, update rather than append. If the new memory contradicts an existing fact, replace the old fact with a timestamp indicating when the update occurred.
State Recovery After Failure
A crew executing a five-task workflow fails on task three. Short-term memory contains the context from tasks one and two. Long-term memory may or may not have been updated, depending on when the memory persistence happens — at task completion, at crew completion, or on a configurable interval.
The recovery question is: what state should the crew have when it resumes?
Option 1: Replay from the beginning. Safe but expensive. The crew re-executes tasks one and two, re-accumulating the short-term memory. Works when tasks are idempotent and cheap. Fails when tasks have side effects (sent emails, wrote to databases) or when re-execution is cost-prohibitive.
Option 2: Resume from checkpoint. The crew loads the checkpointed memory state and resumes from the last completed task. This requires that checkpoints are taken at task boundaries and that the checkpoint includes both short-term and long-term memory state. The gap between the checkpoint and the failure moment is lost.
Option 3: Hybrid with Temporal. Temporal provides durable execution with activity-level checkpointing. CrewAI memory state is persisted as part of the Temporal workflow state. On failure, Temporal resumes the workflow, and the CrewAI crew restores its memory from the Temporal checkpoint. This is the strongest recovery model but requires Temporal infrastructure.
Token Economics of Memory
Memory is not free in token terms. Every memory retrieved adds to the context window, and context window size correlates directly with inference cost and latency.
The economics are measurable:
A crew with 10 tasks, retrieving 8 memories per task at an average of 400 tokens per memory, adds 32,000 tokens to the total crew execution cost. At GPT-4o pricing ($2.50 per million input tokens), that is $0.08 per execution. At 1,000 executions per day, memory retrieval alone costs $80/day — $2,400/month.
The cost is justified only when the retrieved memories improve task quality enough to offset the spend. Monitoring the relationship between memory retrieval volume and output quality is essential: if output quality does not degrade when retrieval is reduced from 8 to 4 memories per task, you are paying for retrieval that adds tokens without adding value.
Memory token consumption interacts directly with the broader CrewAI cost control levers — model routing and crew composition — because a crew optimized for inference cost but left with unbounded memory retrieval will find its savings absorbed by context inflation. Token budgets must be set at both the crew and memory retrieval level.
The cost audit methodology should include memory token consumption as a line item, not buried in general inference costs.
Memory Isolation in Multi-Tenant Deployments
When a single CrewAI deployment serves multiple tenants — different customers, different projects, different business units — memory isolation is a hard requirement. Tenant A’s memories must never appear in Tenant B’s retrieval results. The patterns for CrewAI enterprise authentication and tenant isolation address the broader authentication surface; memory isolation is one component of that boundary, not a separate problem.
The isolation boundary must be at the infrastructure level, not the application level. Application-level filtering (adding a WHERE tenant_id = X clause) is bypassable through bugs, misconfigurations, or query injection. Infrastructure-level isolation means separate memory stores, separate vector namespaces, or database-level row security that enforces isolation regardless of application logic.
For regulated industries, memory isolation extends to data classification. If Tenant A’s data is classified as PII and Tenant B’s data is public, the memory stores must enforce different retention and access policies per classification level.
Monitoring Memory Health
Memory health monitoring should track four metrics:
Retrieval relevance distribution. The average and P95 relevance scores of retrieved memories. A declining trend indicates that the memory store is accumulating noise faster than signal — the eviction policy needs adjustment.
Memory utilization ratio. The percentage of stored memories that have been retrieved at least once in the trailing 30 days. A utilization ratio below 10% means 90% of stored memories are dead weight — consuming storage and degrading retrieval without contributing to any task.
Token consumption per retrieval. Track the actual tokens consumed by memory retrieval as a percentage of total inference tokens. If memory retrieval consistently exceeds 30% of total tokens, the retrieval budget is too generous or the memories are too verbose.
Entity memory consistency. The number of contradictory facts stored for the same entity. This requires periodic entity memory audits — a background job that scans for duplicate or contradictory entries and flags them for merge or eviction.
- Replace SQLite with a production persistence backend before deploying CrewAI crews to multi-instance environments. PostgreSQL with pgvector is the lowest-overhead option if PostgreSQL is already in the stack.
- Configure token budgets for memory retrieval. Without explicit limits, memory retrieval costs scale linearly with memory store size.
- Implement relevance-scored retrieval with a minimum threshold. Recency-based retrieval degrades when the crew processes diverse workloads.
- Build entity memory deduplication and merge logic. Without it, entity memory accumulates redundant and contradictory facts that degrade output quality.
- Checkpoint memory state at task boundaries for crash recovery. Crew-level checkpointing loses all progress on multi-task workflows.
- Enforce memory isolation at the infrastructure level for multi-tenant deployments. Application-level filtering is insufficient for security and compliance.
- Monitor retrieval relevance, memory utilization, and token consumption. Declining retrieval relevance is the earliest signal that eviction policy needs adjustment.
FAQ
What memory types does CrewAI support and when should each be used?
CrewAI supports short-term memory (conversation context within a crew execution), long-term memory (persistent across executions), and entity memory (structured facts about recurring entities). Short-term is automatic. Long-term requires a persistence backend. Entity memory is valuable when the crew processes recurring entities where accumulated knowledge improves task quality.
What persistence backend should I use for CrewAI long-term memory in production?
PostgreSQL with pgvector for teams already running PostgreSQL. A dedicated vector database (Qdrant, Weaviate) when retrieval patterns require filtered vector search at high cardinality or strict multi-tenant isolation. Redis for high-throughput, low-latency scenarios where durability is secondary.
How do you handle memory state recovery when a CrewAI crew fails mid-execution?
Checkpoint memory state at task boundaries. On failure, restore from the last checkpoint and resume from the next incomplete task. For the strongest recovery model, combine CrewAI memory with Temporal workflow checkpointing for durable execution with memory continuity.
Does CrewAI memory increase token consumption?
Yes. Every retrieved memory adds tokens to the context window. Control costs by setting a token budget per retrieval, using relevance scoring with a minimum threshold, and monitoring the ratio of memory tokens to total inference tokens. If memory retrieval exceeds 30% of total tokens, the budget is too generous.
The Memory You Cannot Reconstruct
Memory infrastructure is not a feature to add later. A crew that has been running for six months without proper memory persistence has lost six months of accumulated knowledge — customer interactions, entity facts, task outcomes. That knowledge cannot be reconstructed from logs or outputs. The architectural decision to instrument memory persistence is made once, and the cost of making it late is measured in lost institutional knowledge that the crew must re-learn from scratch.
Assess Your CrewAI Memory Architecture
If your CrewAI deployment is running with default memory configuration — SQLite persistence, no retrieval limits, no eviction policy — or if memory-related costs are growing faster than crew execution volume, a CrewAI engineering engagement can assess the current memory architecture, design the persistence and retrieval infrastructure for production scale, and implement the monitoring that makes memory health visible before it degrades output quality.
Request CrewAI Engineering Support
If you want the multi-agent assessment framework first, start with the Enterprise AI Assessment Kit.