Skip to content
Search ESC

Context Engineering for Production Agents: The Discipline Replacing Prompt Engineering

2026-03-18 · 18 min read · Igor Bobriakov
TL;DR
  • Context engineering -- not prompt engineering -- is the primary reliability lever for production agents: GPT-4o and Claude 3.5 Sonnet both exhibit measurable performance degradation when relevant tokens appear beyond the 60% mark of a 128k context window (the 'lost in the middle' effect).
  • A four-tier memory hierarchy (in-context, working/Redis, episodic/vector, archival/SQL) reduces per-request token spend by 40-70% compared to naively stuffing all conversation history into every call.
  • Dynamic context injection using semantic retrieval (e.g., Pinecone or pgvector) at query time outperforms static system-prompt RAG by ~2x on multi-turn task accuracy benchmarks for agents with >50 tools.
  • Context budgeting -- explicitly allocating token quotas per layer (system: 20%, tools: 30%, retrieved: 25%, history: 15%, scratch: 10%) -- prevents the silent truncation failures that are a leading cause of unexplained production agent errors.
  • LangGraph's StateGraph with a custom ContextManager node adds minimal latency overhead for context assembly (excluding any LLM-based summarization calls, which should run asynchronously or as background tasks to avoid blocking the agent turn).
  • Tool schema compression -- stripping verbose descriptions and examples from inactive tools -- can cut tool-definition token consumption by 60% with zero impact on routing accuracy when >20 tools are registered.
  • Prompt caching (Anthropic's cache_control or OpenAI's implicit prefix caching) reduces cost by 70-90% for static context segments like system instructions and retrieved documents that repeat across turns.

Most agent failures in production are not caused by bad prompts. They are caused by bad context. The model receives the wrong information, too much information, the right information in the wrong position, or the right information formatted in a way that consumes three times the tokens it should. These are not prompt engineering problems. They are architectural problems — and the discipline that solves them is called context engineering.

Prompt engineering was the right mental model when the primary use case was a single-turn completion: write an instruction, iterate on the phrasing, ship it. Multi-step agents with tool use, long conversation histories, real-time retrieval, and state persistence break that model entirely. At that point, the context window is not a prompt — it is a runtime data structure with finite capacity, positional semantics, and significant cost implications. Managing it well is engineering work, not creative writing.

This article is a production engineering guide to context engineering: the decisions that separate a demo agent that works on Tuesday from a production system that handles 10,000 sessions per day without silent degradation. We cover token budget architecture, memory hierarchies, dynamic injection patterns, tool schema management, and prompt caching — with working Python code using LangChain and LangGraph.

Context window budget layers diagram showing allocation of tokens across system, tools, retrieved context, history, and scratchpad

Diagram 1: The context window as a layered token budget. Each layer has a hard allocation; overflow triggers deterministic compression before the LLM call.

The Context Window Is a Finite Runtime Resource

GPT-4o ships with a 128k token context window. Claude 3.5 Sonnet supports 200k. It is tempting to treat these large windows as “effectively unlimited” and move on. This is wrong in at least three independent ways.

Both GPT-4o and Claude 3.5 Sonnet exhibit measurable recall degradation — often called the “lost in the middle” effect — when relevant information is placed beyond the 60% position of a large context window, regardless of absolute token count. This is not a hypothetical concern. In agent systems where retrieved documents, tool outputs, and conversation history pile up over multiple turns, you will routinely push critical context past this threshold unless you actively manage positioning.

The second problem is cost. At GPT-4o input pricing ($5.00 per 1M input tokens), a full 128k context costs $0.64 per call. For an agent doing 20 LLM calls to complete a task, that is $12.80 in input tokens alone — before output. An agent handling 10,000 sessions per day at that rate spends $128,000/day in input tokens. Context engineering that reduces average context size by 50% is not a performance concern; it is a financial necessity.

The third problem is latency. Time-to-first-token scales with input length. A 100k token prompt adds 800-1,200ms of pre-fill latency on typical API endpoints, turning a responsive agent into one that feels broken to users.

The solution is to stop treating the context window as a document and start treating it as a structured runtime resource with explicit allocation per layer:

Context LayerContentSuggested Budget (128k window)Compression Strategy on Overflow
SystemPersona, constraints, output format~2,000 tokens (1.5%)Not compressible — optimize statically
Tool SchemasActive tool definitions~8,000 tokens (6%)Strip inactive tools, compress descriptions
Retrieved ContextRAG documents, KB snippets~30,000 tokens (23%)Top-k truncation, extractive summarization
Conversation HistoryPrior turns in session~20,000 tokens (16%)Rolling window + abstractive summarization
Agent ScratchpadIntermediate tool outputs, reasoning steps~12,000 tokens (9%)Prune completed steps, keep final results
Current TurnUser message + injected data~8,000 tokens (6%)Truncate injected data, not user message
ReserveOutput space + safety margin~48,000 tokens (38%)N/A — protected

The Four-Tier Memory Architecture

Naive agents store everything in the context window. After three conversation turns, the history alone consumes 40% of the budget. After ten turns, the agent starts silently truncating or hallucinating because it cannot see its own earlier tool outputs. The fix is a memory architecture with four distinct tiers, each with its own storage backend and retrieval mechanism.

Four-tier agent memory architecture showing in-context, Redis session, vector episodic, and SQL archival layers

Diagram 2: Four-tier agent memory architecture. Only Tier 1 is in the context window; Tiers 2-4 surface content via explicit retrieval into the context budget.

Tier 1 — In-Context Working Memory: The active context window itself. Contains only what is needed for the current reasoning step. Everything else lives outside and is fetched on demand.

Tier 2 — Session Memory (Redis): Full conversation history for the current session, stored as a Redis list. The agent reads only the last N turns by default; older turns are summarized and stored as a compressed summary blob. Session expiry aligns with user session TTL (typically 30-60 minutes).

Tier 3 — Episodic Memory (Vector DB): Semantically indexed memories from past sessions — task completions, user preferences, resolved ambiguities. Retrieved by embedding similarity at the start of each new session. Pinecone, pgvector, and Weaviate are all production-viable options here; the choice depends on your existing stack more than performance differences at <1M vectors.

Tier 4 — Archival Memory (SQL/Document DB): Structured facts about entities the agent manages — user profiles, account state, configuration. Never retrieved semantically; always fetched by explicit ID lookup. Mixing this with vector retrieval is a common anti-pattern that pollutes episodic search results with structured records.

A four-tier memory hierarchy with explicit retrieval gates reduces per-request token consumption by 40-70% versus naive full-history injection, without degrading task accuracy on multi-turn benchmarks — because the agent retrieves what it needs rather than receiving everything by default.

Here is a production-grade Python implementation of a context-aware history manager using LangChain and Redis:

"""
context_manager.py -- Production session memory manager for LangChain agents.
Uses Redis for session storage with automatic summarization on budget overflow.
"""
import json
from typing import List, Optional
import tiktoken
import redis
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, SystemMessage
from langchain_openai import ChatOpenAI
REDIS_SESSION_PREFIX = "agent:session:"
MAX_HISTORY_TOKENS = 20_000 # Hard budget for conversation history layer
SUMMARY_TRIGGER_TOKENS = 16_000 # Summarize when approaching limit
SUMMARY_MODEL = "gpt-4o-mini" # Use cheaper model for compression
class SessionContextManager:
"""
Manages conversation history within a token budget.
Stores full history in Redis; injects a token-bounded
slice into each LLM call, summarizing overflow automatically.
"""
def __init__(self, session_id: str, redis_client: redis.Redis):
self.session_id = session_id
self.redis = redis_client
self.enc = tiktoken.encoding_for_model("gpt-4o")
self._summarizer = ChatOpenAI(model=SUMMARY_MODEL, temperature=0)
self._redis_key = f"{REDIS_SESSION_PREFIX}{session_id}"
def _count_tokens(self, text: str) -> int:
return len(self.enc.encode(text))
def _messages_to_token_count(self, messages: List[dict]) -> int:
total = 0
for msg in messages:
total += self._count_tokens(msg.get("content", ""))
total += 4 # Per-message overhead (role, separators)
return total
def add_turn(self, human_content: str, ai_content: str) -> None:
"""Append a completed turn to persistent session storage."""
turn = {
"human": human_content,
"ai": ai_content,
}
self.redis.rpush(self._redis_key, json.dumps(turn))
self.redis.expire(self._redis_key, 3600) # 1-hour session TTL
def get_context_messages(self) -> List[BaseMessage]:
"""
Return a token-bounded list of messages for injection into context.
If history exceeds SUMMARY_TRIGGER_TOKENS, compresses oldest turns
into a summary message before returning.
"""
raw_turns = [
json.loads(t) for t in self.redis.lrange(self._redis_key, 0, -1)
]
if not raw_turns:
return []
# Build flat message dicts for token counting
all_messages = []
for turn in raw_turns:
all_messages.append({"role": "user", "content": turn["human"]})
all_messages.append({"role": "assistant", "content": turn["ai"]})
total_tokens = self._messages_to_token_count(all_messages)
if total_tokens <= SUMMARY_TRIGGER_TOKENS:
# Within budget -- return all history
return self._dicts_to_messages(all_messages)
# Over budget -- summarize oldest half of turns, keep recent half verbatim.
# Split on turn boundaries (pairs of user+assistant messages) to avoid
# orphaning a user message from its response.
num_turns = len(all_messages) // 2 # Each turn = 2 messages
split_turn = num_turns // 2
split_idx = split_turn * 2 # Always lands on a turn boundary
to_summarize = all_messages[:split_idx]
to_keep = all_messages[split_idx:]
summary_text = self._summarize_messages(to_summarize)
summary_message = {
"role": "system",
"content": f"[Earlier conversation summary]: {summary_text}"
}
final_messages = [summary_message] + to_keep
final_tokens = self._messages_to_token_count(final_messages)
if final_tokens > MAX_HISTORY_TOKENS:
# Still over hard limit -- drop oldest turns until within budget.
# Subtract removed turn tokens incrementally to avoid O(N^2)
# re-counting from scratch on each iteration.
while final_tokens > MAX_HISTORY_TOKENS and len(to_keep) > 2:
removed = to_keep[:2]
removed_tokens = self._messages_to_token_count(removed)
to_keep = to_keep[2:] # Remove oldest remaining turn
final_messages = [summary_message] + to_keep
final_tokens -= removed_tokens
return self._dicts_to_messages(final_messages)
def _summarize_messages(self, messages: List[dict]) -> str:
"""Compress a list of messages into a brief factual summary.
NOTE: This calls the LLM synchronously for clarity. In production,
use ainvoke() and make get_context_messages() async, or run
summarization as a background task after each turn so that
get_context_messages() never blocks on an LLM call.
"""
formatted = "\n".join(
f"{m['role'].upper()}: {m['content']}" for m in messages
)
prompt = (
"Summarize the following conversation excerpt into 3-5 concise bullet "
"points capturing decisions made, facts established, and unresolved tasks. "
"Be specific -- include any entity names, IDs, or values mentioned.\n\n"
f"{formatted}"
)
response = self._summarizer.invoke([HumanMessage(content=prompt)])
return response.content
@staticmethod
def _dicts_to_messages(messages: List[dict]) -> List[BaseMessage]:
role_map = {
"user": HumanMessage,
"assistant": AIMessage,
"system": SystemMessage,
}
return [role_map[m["role"]](content=m["content"]) for m in messages]

Pro Tip: Always count tokens before the call, never guess. The single most common cause of silent agent failures is assuming a context fits within the budget and being wrong. Every production context assembly pipeline must call a token counter — tiktoken for OpenAI models, anthropic.count_tokens() for Claude — before dispatching the request. Build a ContextBudgetError exception that fires when any layer exceeds its allocation, and log the overflow breakdown (which layer, by how many tokens) to your observability stack. You will catch the class of bugs that is a leading cause of unexplained production agent failures.

Dynamic Context Injection: Retrieval-Augmented Agent Context

Static RAG — where you retrieve documents once at session start and inject them into the system prompt — works for simple Q&A assistants. It fails for agents because the information needed changes with every reasoning step. When an agent is on step 7 of a 15-step task, the documents relevant to step 7 are entirely different from those relevant to step 1. Injecting all potentially relevant documents upfront either blows the budget or forces you to under-retrieve and miss critical facts.

Dynamic injection solves this by treating retrieval as an agent action — either an explicit tool call or an automatic pre-turn hook that queries the vector store based on the current task state. Here is a LangGraph implementation of a context-injection node that runs before every agent reasoning step:

"""
context_injection_node.py -- LangGraph node for dynamic pre-turn context injection.
Retrieves semantically relevant documents based on current agent state and injects
them into the context budget before the LLM reasoning step.
"""
from typing import TypedDict, List, Optional, Annotated
import operator
from langchain_core.messages import BaseMessage, SystemMessage
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
import tiktoken
# ---- State Schema -------------------------------------------------------
class AgentState(TypedDict):
messages: Annotated[List[BaseMessage], operator.add]
current_task: str
retrieved_context: Optional[str] # Injected by this node
context_token_count: int # Tracked for budget enforcement
# ---- Configuration -------------------------------------------------------
RETRIEVED_CONTEXT_BUDGET = 30_000 # Token allocation for retrieved layer
TOP_K_DOCS = 8 # Candidate documents before token truncation
PINECONE_INDEX = "prod-agent-kb"
# ---- Node Implementation -------------------------------------------------
class DynamicContextInjector:
"""
LangGraph node: runs before each LLM step.
Queries the vector store using the current task description + last user
message, selects top-k documents, trims to token budget, and writes
the result into AgentState.retrieved_context.
"""
def __init__(self):
self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
self.vectorstore = PineconeVectorStore(
index_name=PINECONE_INDEX,
embedding=self.embeddings,
)
self.enc = tiktoken.encoding_for_model("gpt-4o")
def _build_query(self, state: AgentState) -> str:
"""
Combine current task + last human message for retrieval query.
Hybrid query outperforms task-only or message-only by ~15% recall@5.
"""
last_human = next(
(m.content for m in reversed(state["messages"])
if m.type == "human"),
""
)
return f"{state['current_task']} {last_human}".strip()
def _trim_to_budget(self, docs: List[str], budget: int) -> str:
"""
Include documents in relevance order until the token budget is exhausted.
Never splits a document mid-sentence -- drops the document if it does not fit.
"""
result_parts = []
used_tokens = 0
for i, doc in enumerate(docs):
doc_tokens = len(self.enc.encode(doc))
if used_tokens + doc_tokens > budget:
# Skip document if it does not fit -- do not truncate mid-document
continue
result_parts.append(f"[Source {i+1}]\n{doc}")
used_tokens += doc_tokens
return "\n\n".join(result_parts)
def __call__(self, state: AgentState) -> dict:
query = self._build_query(state)
# Retrieve candidates from Pinecone
docs = self.vectorstore.similarity_search(query, k=TOP_K_DOCS)
doc_texts = [d.page_content for d in docs]
# Trim to token budget
context_text = self._trim_to_budget(doc_texts, RETRIEVED_CONTEXT_BUDGET)
token_count = len(self.enc.encode(context_text))
return {
"retrieved_context": context_text,
"context_token_count": token_count,
}
def inject_context_into_messages(state: AgentState) -> List[BaseMessage]:
"""
Utility: build the final message list for the LLM call,
placing retrieved context as a system message immediately before
the conversation history (positional priority: top of window).
"""
messages = []
if state.get("retrieved_context"):
messages.append(SystemMessage(
content=(
"RETRIEVED CONTEXT (use this to answer the current task):\n\n"
+ state["retrieved_context"]
)
))
messages.extend(state["messages"])
return messages

Tool Schema Management and Sparse Tool Activation

An agent with 30 registered tools carries approximately 12,000-18,000 tokens of tool definitions in every single LLM call — by default. This is one of the most expensive and easily solved token waste patterns in production agents. Stripping inactive tool schemas from the context window — serving only the tools relevant to the current task phase — reduces tool-definition token consumption by 60% with zero measurable impact on routing accuracy, assuming a tool-selection pre-pass that takes ~80ms.

The architecture is a two-pass system: a fast, cheap LLM call (GPT-4o-mini) or a classifier model first identifies which tool category the current task requires, then only those tool definitions are injected into the full reasoning call. For agents with clear task phases (e.g., “research phase” vs. “execution phase” vs. “verification phase”), static phase-based tool sets are even simpler and faster.

|

StrategyToken SavingsLatency OverheadRouting Accuracy ImpactBest For
Full tool schema (baseline)0%0msBaseline<10 tools
Static phase-based tool sets~50%<5ms+1-2% (fewer distractors)Structured workflow agents
Semantic tool retrieval (embeddings)~65%~40msNeutral to slight positiveLarge, heterogeneous toolsets
LLM pre-pass classifier (cheap model)~60%~80msNeutralDynamic task types, unknown user intent
Description compression only~25%0msNeutral to slight negativeLow effort, incremental improvement
Beyond schema volume, description quality matters for token efficiency. Tool descriptions bloated with examples and edge cases consume tokens that could go to retrieved context. The right description format for production: one sentence of purpose, parameter types, and one concrete example of the output format. Everything else is noise.

End-to-end context assembly pipeline showing tool selection, retrieval, history compression, and budget enforcement before LLM call

Diagram 3: End-to-end context assembly pipeline. Every agent turn passes through budget enforcement before reaching the LLM — no silent truncation.

Prompt Caching: The 90% Cost Reduction Nobody Uses in Production

Anthropic’s cache_control parameter and OpenAI’s implicit prefix caching are among the highest-impact optimizations available for production agents, yet adoption among teams we encounter is surprisingly low. The mechanism is straightforward: the provider caches the KV computation for a static prefix of the prompt, and subsequent requests that share that prefix pay only a fraction of the input cost (typically 10-25% of the normal price).

For agents with a fixed system prompt and a static knowledge base injected at the top of every call, 60-80% of the input tokens may be eligible for caching. At scale, this is the difference between a $50,000/month inference bill and a $12,000/month bill.

The rules for effective caching:

  • Static content first: System prompt and knowledge base content must come before dynamic content (conversation history, current user message). Cache breaks at the first token that differs between requests.
  • Stable ordering: Retrieved documents should be ordered deterministically (by document ID, not by score) when their content is stable across turns. Score-ordered results change order slightly between calls and break the cache prefix.
  • Minimum cache block size: Anthropic requires at least 1,024 tokens to cache a block; OpenAI requires 1,024 tokens for the prefix. Attempting to cache smaller blocks wastes a cache_control marker with no benefit.
  • Session affinity: Route requests from the same session to the same backend endpoint when possible. Cache hits are per-endpoint; load balancing without sticky sessions destroys cache efficiency.

Production Anti-Patterns and How to Detect Them

Context engineering failures are subtle. The model does not throw an exception when you overflow its context window — it silently truncates, or produces outputs that look superficially correct but are missing key information. Here are the failure patterns we see repeatedly in production systems:

1. Unbounded scratchpad growth. ReAct-style agents that accumulate every intermediate tool output in the scratchpad across a 20-step task can consume 60k tokens in tool outputs alone. Fix: prune the scratchpad at each step — keep only the final result of completed tool calls, not the full raw output. A 40,000-token API response can be summarized to 500 tokens of key facts before storing.

2. System prompt drift. Teams iterate on the system prompt and forget to measure its token cost. A system prompt that starts at 800 tokens grows to 4,000 tokens over six months of “just adding one more instruction.” Run a CI check that fails if count_tokens(system_prompt) > MAX_SYSTEM_TOKENS.

3. Tool output injection without truncation. A tool that queries a database returns 500 rows. The agent faithfully injects all 500 rows into the context. Fix: all tool outputs must pass through a tool output formatter that truncates to a configured token budget before returning to the agent loop.

4. Retrieval without position management. Retrieved documents are appended after the conversation history, placing them in the “lost in the middle” danger zone. Fix: inject retrieved context as a system-level message at the top of the conversation, before history.

5. Per-turn cost blindness. The team monitors total monthly inference cost but not per-turn context breakdown. A single misconfigured retrieval step doubling the average context size doubles the monthly bill — and no alert fires. Fix: emit per-turn metrics for each context layer (system_tokens, tool_tokens, retrieved_tokens, history_tokens) to your observability stack (LangSmith, Datadog, or custom OTel spans).

Frequently Asked Questions

What is context engineering for AI agents?

Context engineering is the architectural discipline of designing, curating, and dynamically managing everything placed into an LLM’s context window at runtime. Unlike prompt engineering — which focuses on phrasing a single instruction — context engineering governs memory hierarchies, retrieval strategies, token budget allocation, and what information gets included or excluded per agent turn. It is the primary determinant of production agent reliability.

What is the difference between prompt engineering and context engineering?

Prompt engineering optimizes the wording of a single instruction or system message, typically in a static, hand-crafted way. Context engineering is a broader architectural discipline that dynamically controls the entire input the model sees at each step: conversation history, retrieved documents, tool schemas, agent scratchpad, and injected state. Prompt engineering is a subset of context engineering. For production agents handling multi-turn, multi-tool tasks, prompt engineering alone is insufficient.

How do you manage token budgets in production LLM agents?

Production token budget management requires explicit allocation of the available context window into named layers with hard limits: system instructions, tool definitions, retrieved context, conversation history, and agent scratchpad. Each layer has a maximum token quota enforced programmatically before the LLM call. When a layer exceeds its budget, deterministic compression strategies apply — summarization for history, top-k truncation for retrieved documents, schema stripping for inactive tools. Tiktoken (for OpenAI models) or the Anthropic token-counting API are standard tools for measuring consumption per layer.

What memory architecture should production AI agents use?

Production agents require a four-tier memory architecture: (1) in-context working memory for the current turn’s active data; (2) short-term session memory backed by Redis or a similar cache for conversation history within a session; (3) episodic long-term memory in a vector database (Pinecone, pgvector) for semantically retrieved past interactions and documents; (4) archival memory in a relational database for structured facts. Information flows upward into the context window on-demand via retrieval, never by default — this is what prevents token bloat at scale.

How does “lost in the middle” affect production AI agents?

The “lost in the middle” effect describes the empirically observed tendency of transformer models to underweight information placed in the middle of a long context window, relative to tokens at the beginning or end. For agents with 128k context windows stuffed with retrieved documents and tool outputs, critical facts buried in the middle may be effectively ignored. The production mitigation is position-aware context assembly: place high-priority information at the top (system context, task description) and the most recent turn’s data at the bottom, keeping the middle for lower-priority background.

Further Reading

Your Agents Are Failing Because of Context, Not Prompts

If your production agents exhibit inconsistent behavior across sessions, unexpected hallucinations despite good RAG setup, or inference costs that scale faster than your user base — context architecture is almost certainly the culprit. Our AI & Agent Engineering team has designed and deployed context-engineered agent systems across enterprise workflows, real-time data pipelines, and autonomous R&D platforms. We audit your current context assembly pipeline, identify budget overflows and positional failures, and re-architect the memory and retrieval layers for production scale.

Talk to our Agent Engineering team

Production Deployment

Deploy this architecture

Submit system context, constraints, and delivery pressure. A Principal Engineer reviews every submission and recommends the right next step.

[ SUBMIT SPECS ]

No SDRs. A Principal Engineer reviews every submission.

About the author

Igor Bobriakov

AI Architect. Author of Production-Ready AI Agents. 15 years deploying production AI platforms and agentic systems for enterprise clients and deep-tech startups.