Skip to content

fix: supervise dashboard and terminal ws runtime with health checks#418

Open
AgentWrapper wants to merge 1 commit intomainfrom
feat/417
Open

fix: supervise dashboard and terminal ws runtime with health checks#418
AgentWrapper wants to merge 1 commit intomainfrom
feat/417

Conversation

@AgentWrapper
Copy link
Collaborator

Closes #417

Summary

This PR replaces fragile ad-hoc dashboard/ws dev process usage with a supervised runtime model and explicit health/status visibility for:

  • dashboard web server (3000)
  • terminal websocket server (14800)
  • direct terminal websocket server (14801)

It also explicitly addresses recurring Connecting... XDA stalls by supervising both websocket backends and exposing readiness checks for them.

What Changed

  1. Added new ao services command group:
  • ao services install
  • ao services start
  • ao services stop
  • ao services status (--strict / --json)
  • internal portable supervisor runner (ao services run-supervisor)
  1. Implemented supervised runtime management library:
  • Linux systemd user-service path (auto-restart + login startup via enable)
  • portable fallback supervisor path with watchdog restart loop
  • config-scoped service naming via config hash
  • health probes for dashboard + both ws backends with clear diagnostics
  1. Migrated normal startup path to supervision:
  • ao start now starts supervised services instead of launching ad-hoc pnpm dev process trees
  • ao stop now stops managed services (with legacy best-effort port cleanup fallback)
  1. Added production-oriented web scripts used by supervision:
  • start:dashboard
  • start:terminal
  • start:direct-terminal
  1. Added tests for health/status behavior:
  • packages/cli/__tests__/lib/services.test.ts
  • packages/cli/__tests__/commands/services.test.ts
  • updated packages/cli/__tests__/commands/start.test.ts for managed service integration
  1. Updated docs and migration/recovery guidance:
  • README.md
  • SETUP.md
  • TROUBLESHOOTING.md

Acceptance Criteria Coverage

  • Killing one process auto-recovers without manual restarts:
    • systemd path uses Restart=always
    • supervisor path restarts child processes with backoff
  • /ao/ and /sessions/<id> remain reachable under routine restarts:
    • managed services keep dashboard/ws stack alive; status command exposes readiness
  • Terminal connects reliably (no perpetual Connecting... XDA from missing ws backends):
    • both ws services are supervised and checked in ao services status
    • docs include explicit XDA stall troubleshooting and recovery
  • Tests added for health/status behavior:
    • added CLI service health/status tests as noted above

Design Trade-offs

  • Preferred backend: systemd user services on Linux for robust lifecycle/autostart semantics.
  • Fallback backend: portable detached supervisor for environments without usable systemd.
  • Recommendation: use ao services install for normal operation; keep ao dashboard as explicitly unsupervised dev mode.

Validation (exact commands run)

pnpm exec eslint packages/cli/src/commands/dashboard.ts packages/cli/src/commands/start.ts packages/cli/src/index.ts packages/cli/src/commands/services.ts packages/cli/src/lib/services.ts packages/cli/__tests__/commands/start.test.ts packages/cli/__tests__/commands/services.test.ts packages/cli/__tests__/lib/services.test.ts
# Result: pass

pnpm --filter @composio/ao-cli typecheck
# Result: pass

pnpm --filter @composio/ao-cli test
# Result: pass (16 files, 211 tests)

pnpm --filter @composio/ao-web typecheck
# Result: pass

pnpm --filter @composio/ao-web test
# Result: pass (14 files, 383 tests)

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

const notes = svc.details;
console.log(
` ${svc.id.padEnd(22)}${processState.padEnd(12)}${port.padEnd(8)}${ready.padEnd(8)}${notes}`,
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chalk-colored strings break padEnd column alignment

Medium Severity

colorProcessState returns a chalk-formatted string containing invisible ANSI escape codes. Calling .padEnd(12) on this string counts those escape characters toward the length, so the padding will be far too short and the status table columns will be visibly misaligned. The ready column using chalk.green/chalk.red via .padEnd(8) has the same problem.

Additional Locations (1)

Fix in Cursor Fix in Web

const status = await getStatusForManager(config, "supervisor");
return { manager: "supervisor", stopped, status };
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-stop ignores running supervisor when systemd available

High Severity

When preference is "auto" and systemd is available, stopManagedServices only attempts stopSystemdServices and returns stopped: true on success. If services were actually started via the supervisor fallback (which happens when systemd is available but startSystemdServices failed at runtime), the supervisor and its child processes are never stopped. systemctl stop on inactive units succeeds silently, masking the issue.

Fix in Cursor Fix in Web

}
} else {
await preflight.checkPort(port);
port = newPort;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Port-in-use preflight check removed for normal start

Medium Severity

The old code called preflight.checkPort(port) when opts?.autoPort was false (normal ao start with existing config), giving a clear "Port N is already in use" error. This check was removed entirely. Now, if the configured port is occupied, startManagedServices launches services that fail to bind, producing only a vague "readiness checks are still failing" message instead of an actionable diagnostic.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: dashboard terminal endpoints die without supervision causing recurring /ao outages

1 participant