Skip to content

Fix: project navigation gets stuck after a long disconnect

Temporary planning doc — delete from docs/planning/ once the fix has shipped. Why-it-broke history belongs in the PR description; this file just guides the implementation.

Problem

When the browser tab is left open inside a project for a long time and the SignalR/API connection drops (e.g. the API container restarts, the access token expires, the laptop sleeps), the project transitions into a corrupted state:

  • The project nav collapses to only Project Overview and Studies (Studies has no children either).
  • The console shows a long retry loop of Failed to complete negotiation / CORS preflight failures on /notifications/negotiate.
  • Even after the connection is fully restored (sometimes hours later), the nav never recovers. The only workaround is to navigate to a different page entirely.

Real captured trace (from a pr-2467.syrf.org.uk session):

  • 00:47:04 — sign-in OK, WebSocket connected, full project loaded.
  • 03:48:04 — WebSocket abruptly closes with 1006 (no reason given).
  • 03:48:04+ — every /notifications/negotiate POST fails CORS preflight. Browser surfaces "No Access-Control-Allow-Origin header" because the API/ingress is responding with an error page that doesn't run CORS middleware.
  • Retry loop continues indefinitely; nav stays stuck at the two surviving items.

Root cause (high confidence)

There are at least two cooperating bugs that produce the observed symptom. The frontend bug is what makes the corruption sticky, and is what this PR will fix.

1. Reconnect path skips the project re-hydrate

In signal-r.service.ts:

  • Line 297-302 — hubReconnected$ (the SignalR built-in onreconnected event) feeds allReconnected$.
  • Line 656-668 — allReconnected$ is the trigger that re-dispatches loadProject after a reconnect, which is what re-populates project entity, permissions, stages, and the nav with it.
  • Line 637-653 — there is also a fallback 5-second timer that detects HubConnectionState.Disconnected and calls this._connect() to start the connection from scratch. This is the path that fires when SignalR's built-in withAutomaticReconnect() gives up (which is what happens during a multi-minute outage — withAutomaticReconnect() defaults to retrying for ~30s and then transitions the hub to Disconnected).
  • Line 671-702 — _connect(reconnect = false): only fires allReconnected$.next() when reconnect is true (line 684-686).
  • Line 653 — the timer subscriber calls this._connect() with no argument, so reconnect defaults to false.

Net effect: the recovery path that actually fires after a long disconnect never tells the rest of the app to re-hydrate the project. The hub reconnects silently, but the store is still in its degraded state.

2. Nav reads from selectors that are empty when the project is degraded

In project-nav.component.ts:

  • "Project Overview" — visible: true unconditionally → always renders.
  • "Studies" header — visible: true, but its children are gated on projectPermissions?.viewSearches / projectPermissions?.viewStudies. If selectCurrentProjectPermissionMap returns null, the header still renders but the dropdown is empty.
  • "Stages" — built from selectStagesForCurrentProject. If the project entity has been cleared / never reloaded, this is [] → entire section disappears.
  • "Screening", "Data Export", "Settings" — all gated on projectPermissions?.design / exportData / editMemberships → all hidden when permissions are null.

So when the project entity is missing/stale, exactly the two surviving items in the screenshot are what's left. This isn't really a bug in nav itself — it's correct rendering of an incorrect store state. But it makes the user-visible failure mode dramatic.

3. Why CORS errors appear (not a frontend bug, but worth noting)

The repeated No 'Access-Control-Allow-Origin' header errors are a server/ingress symptom: when the API rejects the request before reaching the CORS middleware (most likely because the access token in ?access_token=… has expired or the API is briefly returning 5xx during a redeploy), the browser surfaces it as a CORS failure rather than a 401. This isn't something to fix on the web side; the frontend should just be resilient to it.

Plan

This PR will:

  1. Make the timer-based reconnect path also fire allReconnected$. The simplest fix: change line 653 to .subscribe(() => this._connect(true)) so the success branch dispatches allReconnected$.next(). Verify this still behaves correctly on first connection (it shouldn't — first connection runs through a different code path in _subHubSubs).
  2. Force a fresh access token on each negotiate retry. The accessTokenFactory already calls getAccessTokenSilently(), but only when SignalR asks for it — and SignalR caches the token across retries. Investigate whether we need to explicitly invalidate / refresh on the catch branch of _connect() so that a stale token isn't reused indefinitely.
  3. Keep nav structurally stable while permissions reload. Optional: render the nav with disabled/loading states instead of dropping items, so the user gets feedback that the app is recovering rather than thinking it's broken. To decide once #1 is fixed and we see how fast recovery feels.
  4. Add tests.
  5. Unit test for _connect(true) firing allReconnected$.
  6. Unit/integration test that simulates: hub disconnect → timer-driven reconnect → project reload action dispatched.

Open questions

  • Should we proactively trigger _loadProjectRequest.dispatchRequest.loadProject on any transition from disconnected → connected (regardless of how long the gap was)? Cheap insurance.
  • Are there other selectors (e.g. data-export jobs, study lists) that need re-hydrating on reconnect, not just the project? Audit the saga/effects that subscribe to allReconnected$.
  • Is the long-tab token-refresh story working at all? The accessTokenFactory resolves once when SignalR first calls it; if getAccessTokenSilently() itself fails silently after the refresh token expires, we'll never reconnect. Worth a separate look.

Out of scope

  • Backend CORS / negotiate behaviour — that's a separate investigation in the API/ingress.
  • Full offline-mode UX — this PR aims for "recover on its own once the network is back", not "work offline".