The persistence layer decision is the one LangGraph teams get wrong first. They ship to production on MemorySaver, watch a pod restart kill twenty in-flight agent threads, then spend the next sprint reverse-engineering a database schema while their checkpoint writes block the event loop.
This post covers the four decisions that determine whether your LangGraph state architecture survives production: which checkpointer backend to use and when, what belongs in state versus an external store, when to checkpoint every node versus selectively, and how to recover cleanly from failures — including replaying with modified state when the original execution path was wrong.
| Scenario | Checkpointer | Rationale | Watch Out For |
|---|---|---|---|
| Local dev, unit tests | MemorySaver | Zero setup, fast, resets cleanly between test runs | Never use in production — state lost on any restart |
| Single-process server, low concurrency | SqliteSaver | File-backed durability, no external dep, ACID writes | SQLite write-locks serialize concurrent threads — not horizontally scalable |
| Multi-process or containerized deployment | PostgresSaver | ACID, queryable, doubles as audit log, scales with pooling | Requires connection pooling (asyncpg/psycopg_pool); blocking connections stall the event loop |
| High-throughput (>5k checkpoints/hour) | PostgresSaver + pgbouncer | Connection pooler absorbs burst writes; Postgres handles the rest | Checkpoint table grows unbounded — implement retention policy |
| Low-latency workflows, sub-10ms checkpoint budget | Redis-backed saver | Redis write latency is meaningfully lower than Postgres under burst load — measure your actual p95 before committing to this tier | Redis requires persistence config (AOF) — default Redis loses data on restart |
| Cross-cloud or serverless execution | Custom BaseCheckpointSaver (DynamoDB, S3) | Matches existing infrastructure, avoids VPC-crossing to Postgres | Must implement get/put/list correctly — subtle bugs in put() cause silent state loss |
| HITL approval workflows with audit requirements | PostgresSaver | SQL queries on checkpoint table replace a separate audit log infrastructure | Store interrupt timestamps in state — LangGraph has no built-in TTL for paused threads |
State Schema Design: What Goes In, What Stays Out
LangGraph state is a TypedDict (or Pydantic model) that gets serialized to the checkpointer after every node execution. Every field in state pays a cost: serialization time, storage size, and — critically — LLM context budget when you inject state fields into prompts.
The schema design rule is not “put everything in so nodes can access it.” It is: put in the minimum that conditional edges need to route correctly, plus the accumulating artifacts the LLM needs to reason.
from typing import Annotated, TypedDict, Literalimport operator
class AgentState(TypedDict): # --- Routing fields (what conditional edges read) --- task_id: str phase: Literal["research", "draft", "review", "done"] retry_count: int failure_class: str | None
# --- Accumulated reasoning artifacts --- messages: Annotated[list[dict], operator.add] # append-only via reducer tool_outputs: Annotated[list[dict], operator.add]
# --- External references (IDs only, not content) --- source_doc_ids: list[str] # S3 keys or DB row IDs vector_search_result_ids: list[str]
# --- Final output --- draft: str | None review_verdict: Literal["pass", "revise", "escalate"] | NoneThree patterns to enforce:
External references, not content. A research agent that fetches a 40-page PDF should store the S3 key in source_doc_ids, not the document text in state. The PDF is fetched by the node that needs it. The state checkpoint stays kilobytes, not megabytes.
Append-only accumulation with reducers. The Annotated[list, operator.add] pattern tells LangGraph to merge rather than overwrite when a node returns a partial state update. Without this, every node that touches messages must return the full list or it clobbers earlier entries. Reducers also make parallelism safe — two branches can independently append to the same list without coordination.
Routing fields stay primitive. Enums, strings, integers. If a conditional edge needs to evaluate a complex object to decide where to route, the routing logic belongs in a function, not in a rich nested object in state.
json.dumps(state) and check the byte count after every major schema change. If a single checkpoint exceeds 50KB, something belongs in an external store. Checkpoints that large will measurably increase your per-node latency, and they will eventually break LLM-injected context.Checkpointer Configuration: The Code That Actually Matters
Most teams configure their checkpointer incorrectly because the LangGraph docs show synchronous examples and production deployments are async. A synchronous psycopg.connect() call inside an async FastAPI lifespan blocks the event loop during checkpoint writes — invisible in testing, catastrophic under concurrent load.
# Production-grade LangGraph setup with PostgreSQL checkpointer
from typing import Annotated, TypedDict, Literalimport operatorfrom langgraph.graph import StateGraph, ENDfrom langgraph.checkpoint.postgres.aio import AsyncPostgresSaverfrom psycopg_pool import AsyncConnectionPoolimport logging
logger = logging.getLogger(__name__)
# --- State schema -----------------------------------------------------------class ResearchAgentState(TypedDict): task_id: str query: str phase: Literal["search", "synthesize", "review", "done"] retry_count: int messages: Annotated[list[dict], operator.add] search_result_ids: list[str] # external store references synthesis: str | None review_verdict: Literal["pass", "revise"] | None
# --- Checkpointer factory (async, connection-pooled) -------------------------async def create_postgres_checkpointer(dsn: str) -> AsyncPostgresSaver: """ Creates an AsyncPostgresSaver backed by a connection pool. - min_size=2: always-warm connections for immediate checkout - max_size=10: caps DB connections under burst load Call checkpointer.setup() once on first deploy to create the langgraph_checkpoints schema. Safe to call repeatedly — it is idempotent. """ pool = AsyncConnectionPool( conninfo=dsn, min_size=2, max_size=10, kwargs={"autocommit": True}, # required by AsyncPostgresSaver ) await pool.open() checkpointer = AsyncPostgresSaver(pool) # await checkpointer.setup() # Uncomment on first deploy only return checkpointer
# --- Selective checkpointing via SubGraph trick ------------------------------# LangGraph checkpoints after every node by default.# To checkpoint only at specific boundaries, compile a subgraph for# high-frequency nodes and compile the outer graph with the durable checkpointer.# The inner subgraph runs without persisting; only the outer graph persists.
def build_research_graph(checkpointer) -> "CompiledGraph": builder = StateGraph(ResearchAgentState)
builder.add_node("search", search_node) builder.add_node("synthesize", synthesize_node) builder.add_node("review", review_node) builder.add_node("handle_revision", revision_node)
builder.set_entry_point("search") builder.add_edge("search", "synthesize") builder.add_edge("synthesize", "review") builder.add_conditional_edges( "review", route_after_review, { "pass": END, "revise": "handle_revision", "retry_search": "search", } ) builder.add_edge("handle_revision", "synthesize")
# compile with durable checkpointer — every node persists state return builder.compile(checkpointer=checkpointer)
def route_after_review(state: ResearchAgentState) -> str: if state["review_verdict"] == "pass": return "pass" if state["retry_count"] >= 3: # Force exit after 3 retries rather than looping forever return "pass" if state.get("phase") == "review" and state["review_verdict"] == "revise": return "revise" return "retry_search"
# --- Sample node implementations --------------------------------------------def search_node(state: ResearchAgentState) -> dict: logger.info(f"[{state['task_id']}] search phase, retry={state['retry_count']}") # In production: call search API, store results externally, return IDs result_ids = _run_search(state["query"]) return { "search_result_ids": result_ids, "phase": "search", "messages": [{"role": "system", "content": f"Search returned {len(result_ids)} results"}], }
def synthesize_node(state: ResearchAgentState) -> dict: # Fetch documents from external store by ID — never stored in state docs = _fetch_docs(state["search_result_ids"]) synthesis = _call_llm_synthesize(state["query"], docs, state["messages"]) return { "synthesis": synthesis, "phase": "synthesize", "messages": [{"role": "assistant", "content": synthesis}], }
def review_node(state: ResearchAgentState) -> dict: verdict = _call_llm_review(state["synthesis"], state["query"]) return { "review_verdict": verdict, "phase": "review", "messages": [{"role": "system", "content": f"Review verdict: {verdict}"}], }
def revision_node(state: ResearchAgentState) -> dict: return { "retry_count": state["retry_count"] + 1, "review_verdict": None, "phase": "draft", }
# Stubs — replace with real implementationsdef _run_search(query: str) -> list[str]: return ["doc-001", "doc-002"]def _fetch_docs(ids: list[str]) -> list[str]: return [f"content of {i}" for i in ids]def _call_llm_synthesize(q, docs, msgs) -> str: return "synthesis placeholder"def _call_llm_review(synthesis, query) -> Literal["pass", "revise"]: return "pass"Recovery Patterns: Resume, Replay, and Modified-State Restart
LangGraph’s thread_id is the recovery key. Every checkpoint for a given thread is addressable by this ID. Resume patterns split into three cases:
Straight resume — execution crashed mid-graph due to an infrastructure failure (pod eviction, process kill). The state at the last successful node is in the checkpointer. Call graph.ainvoke() with the same thread_id. LangGraph loads the latest checkpoint and continues from the next node in the queue. No code changes required.
# Resume after crash — same thread_id, no state argument neededconfig = {"configurable": {"thread_id": "task-abc-001"}}result = await graph.ainvoke(None, config=config)Replay with modified state — the graph completed but produced wrong output. You need to re-run from an earlier checkpoint with corrected inputs. Use graph.update_state() to patch a historical checkpoint, then invoke from that point.
# Replay from a specific checkpoint with corrected queryconfig = {"configurable": {"thread_id": "task-abc-001"}}
# List all checkpoints for this threadcheckpoints = [c async for c in graph.aget_state_history(config)]# checkpoints are ordered newest-first; pick the one before the bad nodetarget_checkpoint = checkpoints[2] # 3rd most recent
# Patch the state at that checkpointawait graph.aupdate_state( target_checkpoint.config, {"query": "corrected query text", "retry_count": 0}, as_node="search", # treat the update as if it came from this node)
# Invoke from the patched checkpoint forwardresult = await graph.ainvoke(None, config=target_checkpoint.config)Multi-thread state isolation — each thread_id is a fully independent execution context. Two agent runs with different thread_id values never share state, even if they are running the same graph. This is the correct isolation primitive for multi-user or multi-job deployments. Never share a thread_id across unrelated tasks.
The related failure pattern is missing thread_id entirely — omitting configurable from the config dict falls back to an in-memory thread with no checkpoint persistence. Every agent invocation should explicitly set thread_id:
import uuid
# Always generate a stable, traceable thread_idthread_id = f"research-{job_id}-{uuid.uuid4().hex[:8]}"config = {"configurable": {"thread_id": thread_id}}State Size and Context Window Limits
The messages list is where state grows unbounded in production. A research agent that runs 20 tool calls before producing output may accumulate 40+ message entries. When those messages get injected into the LLM prompt, they eat context budget — and large state objects make every checkpoint write more expensive.
The fix is a summarization node that fires conditionally based on message count, not on every execution:
def maybe_summarize(state: ResearchAgentState) -> dict: """ Fires when messages list exceeds a token threshold. Replaces older messages with a structured summary, keeping the last N messages verbatim. The raw history is stored externally if audit trail is required. """ messages = state["messages"] WINDOW_SIZE = 10 # keep last 10 messages verbatim TOKEN_THRESHOLD = 3000 # approximate token count trigger
# Simple length proxy — replace with tiktoken for production accuracy approx_tokens = sum(len(m.get("content", "")) // 4 for m in messages)
if approx_tokens <= TOKEN_THRESHOLD or len(messages) <= WINDOW_SIZE: return {} # no change
# Summarize everything except the last WINDOW_SIZE messages to_summarize = messages[:-WINDOW_SIZE] recent = messages[-WINDOW_SIZE:]
summary_text = _call_llm_summarize(to_summarize) summary_entry = { "role": "system", "content": f"[Summary of {len(to_summarize)} earlier messages]: {summary_text}", }
# Replace accumulated history with summary + recent messages # Note: returning a replacement list, not an append — avoid using operator.add here return {"messages": [summary_entry] + recent}Wire this as a node in the graph that runs after high-accumulation nodes, with a conditional edge that skips it when the message count is below the threshold. Do not checkpoint after the summarization node unless the summarized state is the one you want to recover from — the raw history is gone after summarization unless you archived it externally first.
Selective Checkpointing for Performance-Sensitive Workflows
LangGraph’s default behavior is to checkpoint after every node. For workflows with high-frequency nodes (token streaming, tight retry loops, sub-100ms tool calls), this adds measurable overhead per node execution.
The pattern for selective checkpointing is to compile a subgraph without a checkpointer for the high-frequency interior work, and compile the outer graph with the durable checkpointer for the decision boundaries that matter:
# Inner subgraph: token-level streaming steps, no checkpointinginner_builder = StateGraph(ResearchAgentState)inner_builder.add_node("stream_tokens", token_streaming_node)inner_builder.add_node("buffer_output", buffer_node)inner_builder.set_entry_point("stream_tokens")inner_builder.add_edge("stream_tokens", "buffer_output")inner_builder.add_edge("buffer_output", END)
# compile WITHOUT checkpointer — inner nodes run fast, no storage I/Oinner_graph = inner_builder.compile()
# Outer graph: wraps the inner subgraph, checkpoints at coarse boundariesouter_builder = StateGraph(ResearchAgentState)outer_builder.add_node("prepare", prepare_node)outer_builder.add_node("generate", inner_graph) # subgraph as a nodeouter_builder.add_node("review", review_node)outer_builder.set_entry_point("prepare")outer_builder.add_edge("prepare", "generate")outer_builder.add_edge("generate", "review")outer_builder.add_edge("review", END)
# compile WITH durable checkpointer — checkpoints only at outer node boundariesouter_graph = outer_builder.compile(checkpointer=checkpointer)The checkpoint granularity trade-off: coarser checkpointing means more re-execution after a failure (you replay from the last outer node boundary, not the last inner node). Finer checkpointing means more write overhead and larger checkpoint tables. For most production workflows, checkpointing at each major phase boundary (search → synthesize → review) rather than at every micro-step is the right balance.
Production Readiness Checklist
- Checkpointer backend matches deployment topology — MemorySaver only in tests; PostgresSaver or Redis for any multi-process or containerized deployment.
- State schema is JSON-serializable — no live DB connections, file handles, or
datetimeobjects without ISO-8601 encoding. - State size is bounded — external store IDs replace large payloads; summarization node fires before messages exceed 3k tokens.
thread_idis explicit and traceable — includes job ID or correlation ID for log correlation; never omitted from config.- Retry count is in state with a hard exit — conditional edge forces exit after N retries; no infinite loops possible.
- Checkpoint retention policy is defined — purge completed threads older than audit window; checkpoint table does not grow unbounded.
- Resume path is tested — integration test that kills the process mid-graph and verifies recovery to correct final state using the same thread_id.
State Management Across Long-Running Multi-Turn Agents
For agents that persist across multiple user sessions — a customer-facing assistant, a background research agent that runs over hours — the thread_id maps to a persistent conversation or job ID. The full checkpoint history accumulates in the checkpoint table, and you can page through it with aget_state_history() to reconstruct what the agent did at any point.
This creates two operational requirements that most teams implement late:
Checkpoint history pagination. aget_state_history() returns every checkpoint for a thread, oldest first. For a thread with 500 node executions, this means 500 checkpoint records. Pagination (using the before parameter) is how you avoid loading all 500 into memory to find the one you need.
Thread lifecycle management. When a job is complete, mark it done and stop writing new checkpoints to that thread. Implement a background process that identifies threads in terminal state (final node reached, no next nodes in snapshot) and archives or purges them on the checkpoint retention schedule. This prevents the langgraph_checkpoints table from becoming an unqueried archive of every agent run you have ever done.
Related patterns for production state management: HITL interrupts and durable checkpointing, conditional approval and timeout handling beyond basic interrupts, self-correcting agent loops with shared state, recovery patterns when execution fails mid-graph, the foundational LangGraph stateful workflow patterns, and the five-layer reliability stack that checkpointing fits into.
Frequently Asked Questions
What is a LangGraph checkpointer and why is MemorySaver not suitable for production?
A LangGraph checkpointer serializes the full graph state after every node execution, enabling pause/resume and crash recovery. MemorySaver stores state in process memory — any restart, pod eviction, or deployment loses every in-flight thread. Production systems require a durable backend (SqliteSaver for single-process workloads, PostgresSaver or a Redis-backed saver for multi-process deployments) so state survives across the full process lifecycle.
What belongs in LangGraph state versus an external store?
State should hold routing data, agent reasoning artifacts, and the minimum context needed for the current execution path. Binaries, large documents, database query results, and data that exceeds a few kilobytes belong in an external store (S3, a vector database, Postgres). Reference them by ID in state. The test: if removing a field from state would not change any conditional edge decision, it probably does not belong there.
How do you resume a LangGraph agent from a checkpoint after a failure?
Call graph.ainvoke() or graph.invoke() with the same thread_id in the config. LangGraph loads the most recent checkpoint for that thread and continues execution from the node that was next in the queue. No additional parameters are needed — the checkpointer handles state rehydration. For modified-state replay, use graph.update_state() to patch the checkpoint before invoking.
How do you prevent LangGraph state from growing too large and hitting context window limits?
Use a summarization node that condenses the messages list when it exceeds a token threshold, replacing older messages with a structured summary. Store raw message history externally and keep only a rolling window in state. Apply this pattern as a conditional edge that fires when len(state['messages']) > N rather than on every node execution — checkpointing an oversized state object is expensive and compounds with each node.
Taking a LangGraph System to Production?
The gap between a working graph in development and one that survives a pod restart, handles concurrent threads without state collisions, and recovers cleanly from mid-execution failures is not a documentation problem. It is an architecture problem — checkpointer selection, state schema design, retry boundary placement, and retention policy all need to be decided before you are debugging them in production.
Our AI engineering team builds and audits production LangGraph systems: state schema reviews, checkpointer migration from MemorySaver to PostgresSaver, recovery path design, and observability wiring.
Earlier in the Process?
The LangGraph Architecture Assessment covers checkpointer selection, state schema review, and recovery gap analysis in a structured session before you commit to an implementation — so the decisions that are expensive to reverse get made correctly the first time.