-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Summary
When multiple Lambda functions cold-start simultaneously in sst dev (e.g., during a Step Functions parallel execution), some bridge Lambda instances never receive messages from the local dev process — even though the handler runs successfully locally and produces a correct response. The bridge times out and returns "sst dev is not running" or the Lambda sandbox times out.
Reproduction
This reproduces consistently when a Step Functions execution triggers multiple Lambda invocations concurrently (e.g., a company model generation that fans out into parallel file model processing). 4+ bridge Lambda instances cold-start within ~2 seconds, each establishing new AppSync subscriptions. 1-2 of them silently fail to receive any messages.
SST version: Fork based on v4.2.0 (anomalyco/sst)
Region: us-east-1
Evidence
Case 1: Bridge receives Ping but not Response (120s Lambda timeout)
Execution: michael-CompanyModelMachine, nested CheckFileModelsMachine
Lambda: CheckFileModelsTask (timeout: 2 minutes)
Bridge CloudWatch logs show:
subscribe_success
START RequestId: f6bf5f3c
status 200 OK
got packet type=1 from=dev ← Ping received
msg type=ka ← keepalive, then nothing
END RequestId: f6bf5f3c
REPORT Duration: 120000.00 ms Status: timeout
The local dev acknowledged the invocation (sent Ping), but no Response (type=3) was ever delivered. No local log files were written for this function, suggesting the worker may not have started — though we didn't have instrumented logging at this point, so we can't be certain.
Case 2: Bridge receives nothing (16s bridge timeout)
Execution: michael-FileModelMachine:684bf104-...
Lambda: DecideUpdateAction (worker 565e30ac...)
Bridge CloudWatch logs show:
subscribe_success id=019cd9da15548b35ba73d5f6
START RequestId: 7c533deb
status 200 OK
timeout 7c533deb
END RequestId: 7c533deb
REPORT Duration: 16034.19 ms
Zero messages received on the subscription — no Ping, no Response, nothing. But the local SST log file proves the handler ran successfully:
# .sst/log/lambda/FileModelMachineDecideUpdateAction/1773181476-7c533deb-...log
invocation 7c533deb-a6e4-407a-9abd-f807c66fa158
{"companyId":"06e87c43-...","fileId":"02edf737-..."}
response 7c533deb-a6e4-407a-9abd-f807c66fa158
{"shouldSkipFileModel":false,"shouldRunTextractExtraction":true,...}
The handler ran, the response was produced, the local dev presumably published it to AppSync on the bridge's subscription channel — but the bridge never received it. 3 other bridge instances that cold-started at the same time worked fine.
Timeline of Case 2
All 4 bridge Lambdas cold-started within ~2 seconds:
| Worker | Subscribed | Ping received? | Response received? | Duration |
|---|---|---|---|---|
655e3259... |
22:24:32 | Yes | Yes | 2533ms |
0f263086... |
22:24:32 | Yes | Yes | 2546ms |
565e30ac... |
22:24:34 | No | No | 16034ms (timeout) |
848897b5... |
22:24:34 | Yes | Yes | 2594ms |
Hypothesis
This looks consistent with aws-appsync-community#405 — AppSync subscriptions can silently fail to deliver messages even after start_ack / subscribe_success. The subscription isn't actually live yet despite AppSync confirming it. Under burst load with multiple simultaneous subscriptions, this race window is wider.
Unknowns / hedges
- We haven't confirmed that the local dev actually published the Ping/Response to AppSync. It's possible the Go-side publish failed silently or the message was routed incorrectly. Adding logging to
function.gowhere it publishesMessagePingandMessageResponsewould confirm this. - Case 1 and Case 2 have different symptoms (Ping received vs. nothing received). They could be the same AppSync race manifesting differently, or two separate issues.
- We haven't tested whether this correlates strictly with concurrency (does it only happen with 3+ simultaneous cold starts?).
- We haven't ruled out issues in the
sorted()packet reassembly logic inbridge.go, though it shouldn't block other messages.
Possible fixes
- Subscription confirmation round-trip: After subscribing, the bridge sends a message to the local dev on the shared
/inchannel, and waits for the local dev to echo it back on the bridge's private channel. The bridge doesn't process invocations until the round-trip succeeds, confirming the subscription is actually live. - Retry from local dev: If the local dev doesn't receive an ack after publishing a Ping/Response, it retries with backoff.
- Increase bridge initial timeout (
bridge.go:139, currently 16s) — this only helps if messages are delayed rather than permanently lost.
Related issues
- sst dev is not running #5496 — "sst dev is not running" (bridge timeout, originally 8s → bumped to 16s)
- Local development with lambdas doesn't work in dev mode #5421 — AppSync connection timeout in dev mode
- Lambda functions exit prematurely #5541 — Lambda functions exit prematurely / "sst dev is not running"
- PR Configurable worker timeout #5718 — Configurable worker timeout (closed without merge)
Disclosure
LLMs were used to help trace through the SST source code and diagnose the root cause of this issue. The CloudWatch logs, execution histories, and local log files are real production data. We checked the repo for policies on AI-assisted issue reports and didn't find any.