-
Notifications
You must be signed in to change notification settings - Fork 711
Open
Description
Summary
When a subagent makes an API call with large context (~83K tokens) and needs to generate large output (~30KB), the API call can hang indefinitely with zero events written to the JSONL transcript. There is no configurable timeout to detect or recover from this state.
Reproduction
- Launch a background agent (
run_in_background=True) with a task requiring large context - Agent accumulates ~80K+ tokens of context via multiple Read calls
- Agent makes a tool call that succeeds (e.g.,
mkdir) - Next API call (to generate a large Write) hangs — zero JSONL events for 4+ minutes
- Parent agent polls via TaskOutput — always sees
status: running - No mechanism to detect the hang or kill the subagent
Observed behavior
- JSONL transcript stops at the successful
tool_result— no new entries TaskOutputalways returnsrunning(can't distinguish "actively generating" from "API hung")KillShellfails ("not a local bash task")TaskStopkills the agent but destroys any in-progress work
Expected behavior
A configurable timeout (e.g., api_timeout_seconds in ClaudeAgentOptions) that:
- Detects when an API response hasn't produced any events for N seconds
- Marks the agent as
errorwith a clear timeout message - Writes the error to the agent's JSONL for downstream tracking
Workaround
We implemented a 20-minute wall-clock timeout in our stop hook that unblocks the parent task, but the hung agent process itself stays alive consuming resources.
Environment
- claude-agent-sdk: 0.1.48
- Claude Code CLI: 2.1.72 (bundled)
- Observed on: Python 3.12, Linux (GKE)
- Original incident: CLI 2.0.72, but the underlying issue (no API timeout) persists
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels