CrewAI multi-agent workflows fail in three distinct layers. Most debugging sessions stall because teams treat all failures as the same problem — a prompt to fix, a model to swap — instead of diagnosing which layer broke first.
The three layers are: agent failures (a single agent produced wrong output), orchestration failures (delegation routing misfired), and tool failures (a callable the agent invoked returned an error or wrong result). Each has different symptoms, different log signatures, and different fixes. Conflating them leads to debugging loops that change the wrong variable.
This is the diagnostic framework we use when a CrewAI workflow starts failing in ways that verbose output alone cannot explain.
| Symptom | Failure Layer | First Diagnostic Step |
|---|---|---|
| Agent returns empty or off-topic output for a known task | Agent failure | Check the task expected_output and whether the agent's tools match the task scope |
| Workflow runs longer than expected, costs climb, no result | Orchestration — delegation loop | Check manager stop condition and delegation depth counter in callback logs |
| Specialist agent attempts work outside its role description | Orchestration — role confusion | Inspect the manager's delegation message — it likely passed underspecified task context |
| Consistent failure at the same workflow step with an exception trace | Tool failure | Check tool call arguments logged by the callback handler — wrong argument type is the most common cause |
| Different agents produce different answers for the same sub-task across runs | Agent or orchestration | Check whether the same task is being routed to different agents; compare task descriptions in trace logs |
| Final output looks reasonable but upstream specialist output was wrong | Orchestration — context laundering | Reconstruct the delegation chain and check what the manager passed to synthesis; intermediate errors often vanish in the final output |
| System works in development but fails intermittently in production | Tool or agent | Check whether external tools (search, database, API) have different latency or rate limit behavior in production; log tool response time per call |
The Debugging Taxonomy: Three Failure Layers
Understanding which layer broke is not academic. The fix for a tool failure is different from the fix for a delegation loop, and applying the wrong remedy costs time and often masks the real issue.
Agent failures are isolated. One agent, given its assigned task and tool access, produces output that is wrong, incomplete, or misformatted. The failure does not require another agent to have misbehaved. Common causes: the task expected_output is too vague, the agent’s context window is too narrow, a required tool is missing from the agent’s tool list, or the model handling that role is not suited for the task type.
Orchestration failures are systemic. The delegation logic itself misfired: the wrong specialist received a task, a task was delegated when it should have been executed directly, a delegation loop has no exit condition, or context was corrupted during handoff. These failures often look like agent failures at first because the symptom appears at the specialist level — but the specialist is not the cause.
Tool failures are boundary failures. An agent called a tool correctly from a delegation standpoint, but the tool returned an error, timed out, returned a type the agent was not prepared for, or was not accessible from the delegated agent’s permission context.
Delegation Loop Failures: Why They Are Hard to Catch
Delegation loops are the most expensive CrewAI failure mode because they are not errors. The workflow continues running, consuming tokens, calling models, and returning intermediate outputs that look like progress. The signal that something is wrong is usually a bill, not an exception.
A delegation loop occurs when:
- the manager agent has no explicit stop condition beyond the model’s own judgment
- the
expected_outputfor the root task is too vague for any specialist output to satisfy - a specialist returns low-confidence output that the manager repeatedly judges as insufficient
max_iteris set too high or not set at all, so the workflow keeps retrying
The standard CrewAI max_iter parameter is your first guard, but it operates at the agent level, not the workflow level. A manager hitting max_iter will raise a MaxIterationReached exception — but if multiple agents are looping in coordination, the per-agent limit may not fire before the cost damage is done.
The structural fix is to track delegation depth explicitly in a callback handler and enforce a hard workflow-level limit. The Python example below shows how.
Structured Logging and Trace Correlation
verbose=True is useful during development. It is not a production debugging tool. It prints events to stdout in a format optimized for human reading during an active session, not for retrospective incident analysis or cross-run correlation.
Production debugging requires:
- a
trace_idgenerated once per workflow run and attached to every event - a
span_idper agent execution, with aparent_span_idencoding the delegation chain - structured log records that are machine-queryable after the fact
This is the Pydantic model we use for CrewAI delegation tracing:
from __future__ import annotations
import timeimport uuidimport loggingfrom typing import Any, Optionalfrom pydantic import BaseModel, Fieldfrom crewai.agents.agent_builder.base_agent_executor_mixin import CrewAgentExecutorMixinfrom crewai.utilities import Logger
# --- Trace data models ---
class ToolCallRecord(BaseModel): tool_name: str arguments: dict[str, Any] response: Optional[str] = None error: Optional[str] = None latency_ms: float = 0.0
class DelegationEvent(BaseModel): trace_id: str span_id: str parent_span_id: Optional[str] = None delegating_agent: Optional[str] = None receiving_agent: str task_description: str depth: int = 0 tool_calls: list[ToolCallRecord] = Field(default_factory=list) output: Optional[str] = None failure_layer: Optional[str] = None # "agent" | "orchestration" | "tool" failure_reason: Optional[str] = None duration_ms: float = 0.0 timestamp: float = Field(default_factory=time.time)
class WorkflowTrace(BaseModel): trace_id: str = Field(default_factory=lambda: str(uuid.uuid4())) workflow_name: str events: list[DelegationEvent] = Field(default_factory=list) total_delegations: int = 0 max_depth_reached: int = 0 failure_layer: Optional[str] = None total_duration_ms: float = 0.0
def record_event(self, event: DelegationEvent) -> None: self.events.append(event) self.total_delegations += 1 if event.depth > self.max_depth_reached: self.max_depth_reached = event.depth
def to_log_record(self) -> dict[str, Any]: return { "trace_id": self.trace_id, "workflow": self.workflow_name, "total_delegations": self.total_delegations, "max_depth": self.max_depth_reached, "failure_layer": self.failure_layer, "duration_ms": self.total_duration_ms, "events": [e.model_dump() for e in self.events], }
# --- Callback handler ---
class CrewAIDebugHandler: """ Attach to a CrewAI Crew via step_callback and task_callback. Captures per-agent execution events and emits structured JSON logs. """
MAX_DELEGATION_DEPTH = 6 # hard workflow-level guard
def __init__(self, workflow_name: str): self.trace = WorkflowTrace(workflow_name=workflow_name) self._active_spans: dict[str, DelegationEvent] = {} self._depth_counter: dict[str, int] = {} self._logger = logging.getLogger("crewai.debug")
def on_agent_start( self, agent_role: str, task_description: str, delegating_agent: Optional[str] = None, parent_span_id: Optional[str] = None, ) -> str: """Returns span_id for this agent execution.""" depth = self._depth_counter.get(agent_role, 0)
if depth > self.MAX_DELEGATION_DEPTH: raise RuntimeError( f"Delegation depth limit {self.MAX_DELEGATION_DEPTH} exceeded " f"for agent '{agent_role}'. Likely orchestration loop. " f"trace_id={self.trace.trace_id}" )
span_id = str(uuid.uuid4()) event = DelegationEvent( trace_id=self.trace.trace_id, span_id=span_id, parent_span_id=parent_span_id, delegating_agent=delegating_agent, receiving_agent=agent_role, task_description=task_description, depth=depth, ) self._active_spans[span_id] = event self._depth_counter[agent_role] = depth + 1 return span_id
def on_tool_call( self, span_id: str, tool_name: str, arguments: dict[str, Any], response: Optional[str] = None, error: Optional[str] = None, latency_ms: float = 0.0, ) -> None: if span_id not in self._active_spans: return record = ToolCallRecord( tool_name=tool_name, arguments=arguments, response=response, error=error, latency_ms=latency_ms, ) self._active_spans[span_id].tool_calls.append(record)
# Classify tool failures immediately if error: self._active_spans[span_id].failure_layer = "tool" self._active_spans[span_id].failure_reason = ( f"Tool '{tool_name}' returned error: {error[:200]}" )
def on_agent_end( self, span_id: str, output: Optional[str], duration_ms: float = 0.0, ) -> None: if span_id not in self._active_spans: return event = self._active_spans.pop(span_id) event.output = output event.duration_ms = duration_ms
# Classify agent-layer failure: output absent or suspiciously short if not output or len(output.strip()) < 20: if event.failure_layer is None: event.failure_layer = "agent" event.failure_reason = "Agent returned empty or near-empty output"
self.trace.record_event(event) self._logger.info( "agent_span_complete", extra={"span": event.model_dump()}, )
def emit_trace(self) -> dict[str, Any]: record = self.trace.to_log_record() self._logger.info("workflow_trace_complete", extra={"trace": record}) return recordThe failure_layer field is set during execution, not post-hoc. That matters: when a tool error fires inside a delegated agent, the span immediately carries failure_layer="tool" — so even if the manager synthesizes a plausible final answer, the trace shows where the failure originated.
Role Confusion: How Orchestration Overwrites Agent Scope
Role confusion is an orchestration failure, not an agent failure. The specialist agent executes correctly according to the task it received — but the task it received was wrong.
This happens most often when:
- the manager agent passes an underspecified task description to a specialist (e.g., “Analyze this” rather than “Identify the three technical risks in this architecture document”)
- the specialist’s
goalandbackstoryare broad enough that it attempts to fill the scope gap rather than reject the task allow_delegation=Trueon specialists lets them forward work they cannot handle, creating recursive scope drift
The diagnostic signal in your trace is a mismatch between the specialist’s role field and the task_description in the delegation event. If you are logging structured DelegationEvent records, this mismatch is queryable:
# Post-run analysis: find role-task mismatchesdef find_role_confusion(trace: WorkflowTrace) -> list[DelegationEvent]: suspicious = [] for event in trace.events: role_keywords = set(event.receiving_agent.lower().split()) task_words = set(event.task_description.lower().split()) # If the task description shares no significant terms with the agent role name, # flag it for manual inspection overlap = role_keywords & task_words - {"the", "a", "an", "and", "of", "to"} if not overlap and len(event.task_description) > 30: suspicious.append(event) return suspiciousThis is a heuristic, not a complete fix. The real fix is making task descriptions in delegation messages explicit: what the specialist should produce, in what format, and what it should not attempt.
Tool Access Failures During Delegated Tasks
Tool failures in delegated tasks have a specific signature: the failure is not because the tool is broken, but because the tool is not available to the delegated agent or was called with wrong arguments constructed from a weak context handoff.
Three patterns account for most delegated tool failures:
Permission mismatch. The manager can access a tool; the specialist cannot. CrewAI does not automatically propagate tool access. If a specialist needs a database query tool, it must be explicitly included in that agent’s tools list.
Argument type errors. The manager passes a task description that includes a value the specialist extracts and passes to a tool — but the extraction is imprecise. A date string formatted as "May 11" when the tool expects "2026-05-11" fails silently if the tool returns an empty result rather than an exception.
Stale context. The specialist operates on context that was current when the manager composed the task but has since changed — particularly relevant in workflows that run over multiple minutes with external data sources.
Production Debugging Infrastructure
The callback-based approach above is sufficient for most CrewAI deployments. For production systems processing significant workflow volume, three additional infrastructure decisions matter.
Log routing. The emit_trace() call should route to your existing structured log store, not just a file. If your stack uses Datadog, Grafana Loki, or CloudWatch Logs Insights, the WorkflowTrace.to_log_record() output is already a queryable JSON document. Index on trace_id and failure_layer as primary keys.
Delegation chain visualization. Once you have structured DelegationEvent records with span_id and parent_span_id, you have the data to build a delegation tree for any failing run. The parent-child span relationship encodes the full call graph — the same model used in distributed tracing. Tools like Jaeger or Zipkin can ingest this data directly if you emit events in OpenTelemetry format, though a simple recursive tree-printer over the JSON is often enough for debugging.
Token attribution per agent. CrewAI’s built-in usage_metrics gives you aggregate token counts at the crew level. Per-agent token attribution requires wrapping model calls in the callback layer. Tracking cost per agent per run identifies the agents creating the most expense — usually manager agents in delegation loops, or retrieval-heavy specialists that call the same tool multiple times per execution.
For systems connected to the broader production observability stack, see the observability data model for production AI for how delegation trace records fit into the wider event schema.
Reading the Delegation Chain
A delegation chain is the ordered sequence of agent executions that produced the final output: which agent delegated to which, in what order, with what task descriptions at each handoff. Without it, you cannot answer the question that every production debugging session eventually asks: “Which agent’s output caused the final result to go wrong?”
Reading the chain from structured logs:
- Find all
DelegationEventrecords sharing the sametrace_id. - Sort by
timestampascending. - Reconstruct the tree using
parent_span_idlinks. - Look for events where
failure_layeris non-null — those are the failure origin points. - Check whether the failure propagated up the chain (the parent span’s output was also wrong) or was absorbed silently (the parent span synthesized a plausible answer from a broken input).
Silent propagation — where the manager synthesizes plausible output from a failed specialist — is the hardest failure mode to catch in production. It passes every surface check and only manifests as downstream business logic errors or user-reported quality degradation. The only reliable detection is quality signals at the span level, not just at the workflow output level.
For teams evaluating whether their current CrewAI setup has the right observability foundation, the production readiness checklist for CrewAI and multi-agent systems covers observability as a readiness gate alongside orchestration and tool safety.
When Debugging Points to Architecture, Not Code
Sometimes the diagnostic process surfaces a structural problem that code changes cannot fix. A delegation loop that keeps re-firing is often a sign that the task decomposition is wrong, not that the stop condition needs tuning. Role confusion that persists after prompt improvements often means the agent roster has overlapping responsibilities that should be collapsed or re-scoped.
The clearest signal that debugging has reached an architecture limit:
- the same failure occurs across different prompt variations and model configurations
- fixing the failure in one workflow step moves it to another
- the delegation chain shows agents doing work that could be deterministic
That last point connects to the broader question of whether hierarchy is justified at all. Debugging friction is often diagnostic. If tracing the delegation chain reveals that the manager is mostly rephrasing the original task rather than making real decomposition decisions, the architecture may be over-delegated — a pattern we covered in depth in the hierarchical agents guide.
- Classify failures as agent, orchestration, or tool before changing any code — misidentifying the layer wastes debugging cycles.
- Attach a
trace_idper workflow run and aspan_idper agent execution; log both with every event for cross-run correlation. - Log tool call arguments and responses — not just tool names — in the callback handler; argument-level errors are otherwise invisible.
- Set a delegation depth counter enforced at the workflow level, not just
max_iterat the agent level. - Check
parent_span_idchains when the final output looks plausible but upstream quality is unknown — silent propagation is the hardest failure to detect from output alone. - When the same failure survives prompt fixes and model swaps, inspect the delegation chain for structural overlap or over-delegation before changing more code.
FAQ
What causes infinite delegation loops in CrewAI?
Infinite delegation loops typically occur when a manager agent has no explicit stop condition, when task expected_output is too vague to satisfy, or when a specialist agent returns output that the manager judges insufficient by the same criteria every time. The fix is to encode a stop condition outside the prompt — either a max_iter guard, a structured output validator, or a delegation depth counter in your callback handler.
How do I tell the difference between an agent failure and an orchestration failure in CrewAI?
An agent failure is isolated: one agent produces wrong or empty output for a specific task. An orchestration failure propagates: the wrong agent runs, delegation fires when it should not, or the task sequence diverges from intent. Agent failures show up as bad tool call arguments or off-scope outputs. Orchestration failures show up as unexpected delegation patterns in your trace — agents running tasks they were not designed for, or the same task running multiple times.
What is the minimum logging setup for a debuggable CrewAI system?
At minimum, log a trace_id per workflow run, a span_id per agent execution, the delegating agent's name, the receiving agent's name, the task description, tool calls with arguments and responses, and the final task output. Without the tool call arguments, failures at tool boundaries are invisible. Without the delegation chain, orchestration failures are unattributable.
How do callback handlers improve CrewAI debugging compared to verbose=True?
verbose=True prints execution events to stdout in a human-readable format useful for development. Callback handlers give you structured, machine-queryable events you can route to a log store, correlate across runs with a shared trace_id, and query retrospectively during an incident. verbose=True does not persist; callback handlers do.
When the Workflow Is Telling You Something Is Wrong
The debugging patterns above catch failures that are already happening. The harder skill is reading the delegation chain and recognizing failure preconditions before they show up as incidents: managers that are busier than their specialists, tool call argument errors that return empty results instead of exceptions, specialists whose role fields bear no relationship to the tasks they actually receive.
For teams whose CrewAI workflows are already in production and producing results that are harder to audit than expected, what agent observability should trigger a production audit covers the signals that indicate a structured review is warranted.
Diagnose Your CrewAI Workflow Before the Next Incident
If your multi-agent system is producing results that are hard to attribute, running up token costs without clear explanation, or failing in ways that verbose output does not explain, we can instrument and trace the delegation chain with you.
Request CrewAI Engineering Support
If you want the audit framework first, start with the Production-Ready AI Agent Audit.