Live dev bridge silently loses responses under concurrent Lambda cold starts (AppSync subscription race?)

## Summary

When multiple Lambda functions cold-start simultaneously in `sst dev` (e.g., during a Step Functions parallel execution), some bridge Lambda instances never receive messages from the local dev process — even though the handler runs successfully locally and produces a correct response. The bridge times out and returns `"sst dev is not running"` or the Lambda sandbox times out.

## Reproduction

This reproduces consistently when a Step Functions execution triggers multiple Lambda invocations concurrently (e.g., a company model generation that fans out into parallel file model processing). 4+ bridge Lambda instances cold-start within ~2 seconds, each establishing new AppSync subscriptions. 1-2 of them silently fail to receive any messages.

**SST version:** Fork based on v4.2.0 (anomalyco/sst)
**Region:** us-east-1

## Evidence

### Case 1: Bridge receives Ping but not Response (120s Lambda timeout)

**Execution:** `michael-CompanyModelMachine`, nested `CheckFileModelsMachine`
**Lambda:** `CheckFileModelsTask` (timeout: 2 minutes)

Bridge CloudWatch logs show:
```
subscribe_success
START RequestId: f6bf5f3c
status 200 OK
got packet type=1 from=dev    ← Ping received
msg type=ka                    ← keepalive, then nothing
END RequestId: f6bf5f3c
REPORT Duration: 120000.00 ms  Status: timeout
```

The local dev acknowledged the invocation (sent Ping), but no Response (type=3) was ever delivered. No local log files were written for this function, suggesting the worker may not have started — though we didn't have instrumented logging at this point, so we can't be certain.

### Case 2: Bridge receives nothing (16s bridge timeout)

**Execution:** `michael-FileModelMachine:684bf104-...`
**Lambda:** `DecideUpdateAction` (worker `565e30ac...`)

Bridge CloudWatch logs show:
```
subscribe_success id=019cd9da15548b35ba73d5f6
START RequestId: 7c533deb
status 200 OK
timeout 7c533deb
END RequestId: 7c533deb
REPORT Duration: 16034.19 ms
```

Zero messages received on the subscription — no Ping, no Response, nothing. **But the local SST log file proves the handler ran successfully:**

```
# .sst/log/lambda/FileModelMachineDecideUpdateAction/1773181476-7c533deb-...log
invocation 7c533deb-a6e4-407a-9abd-f807c66fa158
{"companyId":"06e87c43-...","fileId":"02edf737-..."}
response 7c533deb-a6e4-407a-9abd-f807c66fa158
{"shouldSkipFileModel":false,"shouldRunTextractExtraction":true,...}
```

The handler ran, the response was produced, the local dev presumably published it to AppSync on the bridge's subscription channel — but the bridge never received it. 3 other bridge instances that cold-started at the same time worked fine.

### Timeline of Case 2

All 4 bridge Lambdas cold-started within ~2 seconds:

| Worker | Subscribed | Ping received? | Response received? | Duration |
|--------|-----------|----------------|-------------------|----------|
| `655e3259...` | 22:24:32 | Yes | Yes | 2533ms |
| `0f263086...` | 22:24:32 | Yes | Yes | 2546ms |
| `565e30ac...` | 22:24:34 | **No** | **No** | 16034ms (timeout) |
| `848897b5...` | 22:24:34 | Yes | Yes | 2594ms |

## Hypothesis

This looks consistent with [aws-appsync-community#405](https://github.com/aws/aws-appsync-community/issues/405) — AppSync subscriptions can silently fail to deliver messages even after `start_ack` / `subscribe_success`. The subscription isn't actually live yet despite AppSync confirming it. Under burst load with multiple simultaneous subscriptions, this race window is wider.

### Unknowns / hedges

- We haven't confirmed that the local dev actually *published* the Ping/Response to AppSync. It's possible the Go-side publish failed silently or the message was routed incorrectly. Adding logging to `function.go` where it publishes `MessagePing` and `MessageResponse` would confirm this.
- Case 1 and Case 2 have different symptoms (Ping received vs. nothing received). They could be the same AppSync race manifesting differently, or two separate issues.
- We haven't tested whether this correlates strictly with concurrency (does it only happen with 3+ simultaneous cold starts?).
- We haven't ruled out issues in the `sorted()` packet reassembly logic in `bridge.go`, though it shouldn't block other messages.

## Possible fixes

1. **Subscription confirmation round-trip:** After subscribing, the bridge sends a message to the local dev on the shared `/in` channel, and waits for the local dev to echo it back on the bridge's private channel. The bridge doesn't process invocations until the round-trip succeeds, confirming the subscription is actually live.
2. **Retry from local dev:** If the local dev doesn't receive an ack after publishing a Ping/Response, it retries with backoff.
3. **Increase bridge initial timeout** (`bridge.go:139`, currently 16s) — this only helps if messages are delayed rather than permanently lost.

## Related issues

- #5496 — "sst dev is not running" (bridge timeout, originally 8s → bumped to 16s)
- #5421 — AppSync connection timeout in dev mode
- #5541 — Lambda functions exit prematurely / "sst dev is not running"
- PR #5718 — Configurable worker timeout (closed without merge)

## Disclosure

LLMs were used to help trace through the SST source code and diagnose the root cause of this issue. The CloudWatch logs, execution histories, and local log files are real production data. We checked the repo for policies on AI-assisted issue reports and didn't find any.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Live dev bridge silently loses responses under concurrent Lambda cold starts (AppSync subscription race?) #6575

Summary

Reproduction

Evidence

Case 1: Bridge receives Ping but not Response (120s Lambda timeout)

Case 2: Bridge receives nothing (16s bridge timeout)

Timeline of Case 2

Hypothesis

Unknowns / hedges

Possible fixes

Related issues

Disclosure

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Worker	Subscribed	Ping received?	Response received?	Duration
`655e3259...`	22:24:32	Yes	Yes	2533ms
`0f263086...`	22:24:32	Yes	Yes	2546ms
`565e30ac...`	22:24:34	No	No	16034ms (timeout)
`848897b5...`	22:24:34	Yes	Yes	2594ms

Live dev bridge silently loses responses under concurrent Lambda cold starts (AppSync subscription race?) #6575

Description

Summary

Reproduction

Evidence

Case 1: Bridge receives Ping but not Response (120s Lambda timeout)

Case 2: Bridge receives nothing (16s bridge timeout)

Timeline of Case 2

Hypothesis

Unknowns / hedges

Possible fixes

Related issues

Disclosure

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions