Skip to content

Live dev bridge silently loses responses under concurrent Lambda cold starts (AppSync subscription race?) #6575

@micthiesen

Description

@micthiesen

Summary

When multiple Lambda functions cold-start simultaneously in sst dev (e.g., during a Step Functions parallel execution), some bridge Lambda instances never receive messages from the local dev process — even though the handler runs successfully locally and produces a correct response. The bridge times out and returns "sst dev is not running" or the Lambda sandbox times out.

Reproduction

This reproduces consistently when a Step Functions execution triggers multiple Lambda invocations concurrently (e.g., a company model generation that fans out into parallel file model processing). 4+ bridge Lambda instances cold-start within ~2 seconds, each establishing new AppSync subscriptions. 1-2 of them silently fail to receive any messages.

SST version: Fork based on v4.2.0 (anomalyco/sst)
Region: us-east-1

Evidence

Case 1: Bridge receives Ping but not Response (120s Lambda timeout)

Execution: michael-CompanyModelMachine, nested CheckFileModelsMachine
Lambda: CheckFileModelsTask (timeout: 2 minutes)

Bridge CloudWatch logs show:

subscribe_success
START RequestId: f6bf5f3c
status 200 OK
got packet type=1 from=dev    ← Ping received
msg type=ka                    ← keepalive, then nothing
END RequestId: f6bf5f3c
REPORT Duration: 120000.00 ms  Status: timeout

The local dev acknowledged the invocation (sent Ping), but no Response (type=3) was ever delivered. No local log files were written for this function, suggesting the worker may not have started — though we didn't have instrumented logging at this point, so we can't be certain.

Case 2: Bridge receives nothing (16s bridge timeout)

Execution: michael-FileModelMachine:684bf104-...
Lambda: DecideUpdateAction (worker 565e30ac...)

Bridge CloudWatch logs show:

subscribe_success id=019cd9da15548b35ba73d5f6
START RequestId: 7c533deb
status 200 OK
timeout 7c533deb
END RequestId: 7c533deb
REPORT Duration: 16034.19 ms

Zero messages received on the subscription — no Ping, no Response, nothing. But the local SST log file proves the handler ran successfully:

# .sst/log/lambda/FileModelMachineDecideUpdateAction/1773181476-7c533deb-...log
invocation 7c533deb-a6e4-407a-9abd-f807c66fa158
{"companyId":"06e87c43-...","fileId":"02edf737-..."}
response 7c533deb-a6e4-407a-9abd-f807c66fa158
{"shouldSkipFileModel":false,"shouldRunTextractExtraction":true,...}

The handler ran, the response was produced, the local dev presumably published it to AppSync on the bridge's subscription channel — but the bridge never received it. 3 other bridge instances that cold-started at the same time worked fine.

Timeline of Case 2

All 4 bridge Lambdas cold-started within ~2 seconds:

Worker Subscribed Ping received? Response received? Duration
655e3259... 22:24:32 Yes Yes 2533ms
0f263086... 22:24:32 Yes Yes 2546ms
565e30ac... 22:24:34 No No 16034ms (timeout)
848897b5... 22:24:34 Yes Yes 2594ms

Hypothesis

This looks consistent with aws-appsync-community#405 — AppSync subscriptions can silently fail to deliver messages even after start_ack / subscribe_success. The subscription isn't actually live yet despite AppSync confirming it. Under burst load with multiple simultaneous subscriptions, this race window is wider.

Unknowns / hedges

  • We haven't confirmed that the local dev actually published the Ping/Response to AppSync. It's possible the Go-side publish failed silently or the message was routed incorrectly. Adding logging to function.go where it publishes MessagePing and MessageResponse would confirm this.
  • Case 1 and Case 2 have different symptoms (Ping received vs. nothing received). They could be the same AppSync race manifesting differently, or two separate issues.
  • We haven't tested whether this correlates strictly with concurrency (does it only happen with 3+ simultaneous cold starts?).
  • We haven't ruled out issues in the sorted() packet reassembly logic in bridge.go, though it shouldn't block other messages.

Possible fixes

  1. Subscription confirmation round-trip: After subscribing, the bridge sends a message to the local dev on the shared /in channel, and waits for the local dev to echo it back on the bridge's private channel. The bridge doesn't process invocations until the round-trip succeeds, confirming the subscription is actually live.
  2. Retry from local dev: If the local dev doesn't receive an ack after publishing a Ping/Response, it retries with backoff.
  3. Increase bridge initial timeout (bridge.go:139, currently 16s) — this only helps if messages are delayed rather than permanently lost.

Related issues

Disclosure

LLMs were used to help trace through the SST source code and diagnose the root cause of this issue. The CloudWatch logs, execution histories, and local log files are real production data. We checked the repo for policies on AI-assisted issue reports and didn't find any.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions