fix: thread-safe access to Organization.Sessions/Groups from background threads#328
Conversation
…ground threads Organization.Sessions and Organization.Groups are plain List<T> collections with a UI-thread-only design. However, several background threads read them without synchronization: - SaveActiveSessionsToDisk (timer callback) - RunCodespaceHealthCheckAsync (background task) - NotifyCodespaceConnected (health check callback) - ResumeCodespaceSessionsAsync (health check callback) - SaveOrganizationCore (timer callback serialization) Add _organizationLock + SnapshotSessionMetas()/SnapshotGroups() helpers that produce thread-safe copies for off-thread reads. Wrap serialization in SaveOrganizationCore with the lock to prevent tearing during writes. Replace the fragile try/catch(InvalidOperationException) pattern in RunCodespaceHealthCheckAsync with the proper SnapshotGroups() call. Fixes PureWeen#327 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
btessiau
left a comment
There was a problem hiding this comment.
PR #328 Review — Organization.Sessions/Groups Thread-Safety
CI Status:
Existing reviews: None
🔴 CRITICAL — Lock design is one-sided; race is NOT actually fixed
CopilotService.cs:269-280, CopilotService.Organization.cs:180-183
Flagged by 4/5 models
_organizationLock is acquired by readers only (snapshot helpers + SaveOrganizationCore). Every mutation site — Organization.Groups.Add(), Organization.Sessions.Add(), RemoveAll(), etc. — runs on the UI thread without holding the lock. A mutex only prevents races when all competing parties acquire it. A background thread calling SnapshotGroups() can still execute Organization.Groups.ToList() concurrently with a UI-thread Organization.Groups.Add() that is resizing the internal array. Under the C# memory model this is a data race with undefined behavior. The InvalidOperationException from foreach's _version check was avoided (.ToList() uses Array.Copy, not the enumerator), but a torn read remains possible.
More critically, this PR removed the existing try-catch fallback in RunCodespaceHealthCheckAsync:
// REMOVED — was the safety net for exactly this race:
catch (InvalidOperationException) { return; }The code is now less resilient than before.
Required fix: Either hold _organizationLock at every mutation site, or snapshot the full Organization object under the lock before serializing/reading. The simplest correct pattern is to wrap all Organization.Sessions.Add/Remove and Organization.Groups.Add/Remove calls in lock (_organizationLock) blocks.
🟡 MODERATE — StartCodespaceHealthCheck() line 323: bare Organization.Groups.Any() on background thread
CopilotService.Codespace.cs:323
Flagged by 4/5 models
private void StartCodespaceHealthCheck()
{
lock (_healthCheckLock)
{
if (!Organization.Groups.Any(g => g.IsCodespace)) return; // ← bare read, no snapshotStartCodespaceHealthCheck() is called from StartAndReconnectCodespaceAsync (line 854) after multiple await calls, meaning it runs on a ThreadPool continuation — not the UI thread. This is the same race the PR claims to fix in RunCodespaceHealthCheckAsync, but it was missed here. Must use SnapshotGroups().Any(...).
🟡 MODERATE — GetClientForGroup() line 42: bare Organization.Groups.FirstOrDefault() not migrated
CopilotService.Codespace.cs:42
Flagged by 4/5 models
private CopilotClient GetClientForGroup(string? groupId)
{
var group = Organization.Groups.FirstOrDefault(g => g.Id == groupId); // ← bareCalled from SendPromptAsync. Once any await in the call path runs on a ThreadPool continuation, this bare enumeration races with UI mutations. Should use SnapshotGroups().
🟢 MINOR — Lock held during full JsonSerializer.Serialize on UI thread
CopilotService.Organization.cs:180-183
Flagged by 2/5 models
SaveOrganizationCore holds _organizationLock for the entire serialization (potentially slow for large org state). Background snapshot callers block for this duration. Consider snapshotting under the lock and serializing outside it:
string json;
OrganizationState orgSnapshot;
lock (_organizationLock) { orgSnapshot = /* clone */ ; }
json = JsonSerializer.Serialize(orgSnapshot, ...);🟢 MINOR — No regression tests for thread-safety behavior
PolyPilot.Tests/
Flagged by 4/5 models
SnapshotSessionMetas() and SnapshotGroups() are internal and testable, but no tests verify (a) that they return independent copies of the live list, or (b) that concurrent read+write doesn't produce exceptions. The existing test infrastructure (SessionOrganizationTests.cs, SessionPersistenceTests.cs) provides a good pattern to follow.
Test Coverage Assessment
The PR introduces a new lock discipline but adds no tests. Suggested scenarios:
- Verify
SnapshotSessionMetas()returns a copy, not a live reference (mutating the result doesn't affect the original) - Concurrent background snapshot + UI mutation — confirm no
InvalidOperationException StartCodespaceHealthCheckcall from async continuation context — verify it uses snapshot after the fix
Previous Findings Status
This is the first review of PR #328. No prior findings to track.
Recommended Action: ⚠️ Request Changes
The lock design has a fundamental correctness issue: locking only the readers while leaving writers unlocked does not prevent the race. The original crash trigger was swapped from foreach/enumerator to Array.Copy (which rarely throws), but the underlying data race is unresolved and undefined behavior under the C# memory model. Three specific changes are needed before merge:
- Hold
_organizationLockat allOrganization.Sessions/Organization.Groupsmutation sites, OR switch to a lock-protected wrapper that enforces this consistently - Fix
StartCodespaceHealthCheck():323— useSnapshotGroups().Any(...) - Fix
GetClientForGroup():42— useSnapshotGroups()or confirm it's always UI-thread-only
Address review: _organizationLock was only acquired by snapshot readers, leaving writers unprotected — a one-sided lock that doesn't prevent races. Fix: Add locked mutation helpers (AddSessionMeta, RemoveSessionMetasWhere, AddGroup, InsertGroup, RemoveGroupsWhere, RemoveSessionMeta) that hold _organizationLock on every list mutation. All ~35 mutation sites across CopilotService.cs, Organization.cs, Codespace.cs, and Providers.cs now use these helpers. Additional fixes: - StartCodespaceHealthCheck: use SnapshotGroups().Any() (was bare read) - GetClientForGroup: use SnapshotGroups().FirstOrDefault() for safety - StartCodespaceAsync Task.Run catch: snapshot + marshal mutation to UI - SaveOrganizationCore: snapshot-then-serialize to minimize lock duration Tests: 10 new tests in OrganizationThreadSafetyTests verifying snapshot isolation (independent copies), helper correctness, and concurrent read+write safety (no exceptions under parallel stress). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
btessiau
left a comment
There was a problem hiding this comment.
Addressed all findings in e79c0b8
🔴 CRITICAL — Lock design is one-sided → Fixed
You're absolutely right — locking only readers while leaving writers unlocked doesn't prevent the race.
Fix: Added 6 locked mutation helpers (AddSessionMeta, RemoveSessionMetasWhere, RemoveSessionMeta, AddGroup, InsertGroup, RemoveGroupsWhere) that hold _organizationLock. Migrated all ~35 mutation sites across CopilotService.cs, Organization.cs, Codespace.cs, and Providers.cs to use these helpers. Now both readers (snapshots) AND writers (mutations) acquire the same lock.
🟡 MODERATE — StartCodespaceHealthCheck():323 bare read → Fixed
Changed to SnapshotGroups().Any(g => g.IsCodespace).
🟡 MODERATE — GetClientForGroup():42 bare read → Fixed
Changed to SnapshotGroups().FirstOrDefault(...). While it's likely always on the UI thread (MAUI SynchronizationContext), using the snapshot is defensive and consistent.
🟢 MINOR — Lock held during serialization → Fixed
SaveOrganizationCore now snapshots under lock, serializes outside:
OrganizationState snapshot;
lock (_organizationLock) {
snapshot = new OrganizationState { Groups = ...ToList(), Sessions = ...ToList(), ... };
}
json = JsonSerializer.Serialize(snapshot, ...);🟢 MINOR — No regression tests → Fixed
Added OrganizationThreadSafetyTests.cs with 10 tests:
- Snapshot isolation (independent copies for Sessions and Groups)
- All 6 mutation helpers work correctly
- Concurrent stress tests: parallel background reader + locked writer with no exceptions
Additional fix: StartCodespaceAsync Task.Run catch block
The Organization.Groups.FirstOrDefault() at line 299 was a bare off-thread read in a Task.Run. Fixed to use SnapshotGroups() for the existence check, then marshal the actual mutation to the UI thread via InvokeOnUI().
Build: 0 errors. Tests: 2287+ passed (10 new), 4 pre-existing failures (3 CliPath + 1 flaky ChatDatabase).
btessiau
left a comment
There was a problem hiding this comment.
PR #328 Re-Review — Multi-Model Consensus Report
CI Status:
Models used: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex
Commits reviewed: f7550ec (original) + e79c0b8 (fixes)
Previous Findings — All FIXED ✅
| # | Finding | Status |
|---|---|---|
| 1 | 🔴 CRITICAL — Lock one-sided: ~35 mutation sites unlocked | ✅ FIXED — 8 locked helpers added; all sites migrated |
| 2 | 🟡 MODERATE — StartCodespaceHealthCheck() bare Groups.Any() on bg thread |
✅ FIXED — SnapshotGroups().Any() |
| 3 | 🟡 MODERATE — GetClientForGroup():42 bare Groups.FirstOrDefault() |
✅ FIXED — SnapshotGroups().FirstOrDefault() |
| 4 | 🟢 MINOR — SaveOrganizationCore held lock during JSON serialization |
✅ FIXED — snapshot under lock, serialize outside |
| 5 | 🟢 MINOR — No regression tests | ✅ FIXED — 10 tests in OrganizationThreadSafetyTests.cs |
New Finding — Consensus (4/5 models)
🟡 MODERATE — CopilotService.Persistence.cs:298,304,347,361 — Bare Organization.Groups reads in RestorePreviousSessionsAsync
Four sites still enumerate Organization.Groups directly without the lock:
// line 298
foreach (var group in Organization.Groups.Where(g => g.IsCodespace))
// line 304
foreach (var g in Organization.Groups)
// line 347
Organization.Groups.Any(g => g.Id == entry.GroupId && g.IsCodespace)
// line 361
Organization.Groups.FirstOrDefault(g => g.Id == entry.GroupId)The IsRestoring = true flag suppresses timer callbacks, but it does not gate WsBridge SyncRemoteSessions (background thread) or SDK event handlers — both of which call the locked mutation helpers (AddGroup, RemoveGroupsWhere) regardless of IsRestoring. During ReconnectAsync, a bridge reconnect or incoming SDK event can call AddGroup() concurrently with the foreach (var g in Organization.Groups) at line 304, throwing InvalidOperationException.
Fix: Take a single snapshot at the top of the restore block:
var orgGroups = SnapshotGroups();
// then use orgGroups throughout lines 298-361Not Meeting Consensus (noted but filtered)
- Shallow-copy mutation of
SessionGroupfields (ReconnectAttempts++on background thread withoutInvokeOnUI) — pre-existing, flagged by 1/5 models, out of scope for this PR. - Lock ordering (
_healthCheckLock → _organizationLock) — 3/5 models confirmed no deadlock risk; no reverse acquisition exists. - TOCTOU
Any()+AddSessionMeta()— 1/5 flagged; 1/5 explicitly confirmed these are UI-thread-only paths, not a race.
Recommendation: ⚠️ Request changes
One MODERATE finding remains (4/5 consensus): the 4 bare Organization.Groups reads in RestorePreviousSessionsAsync. The fix is a single SnapshotGroups() call at the top of the restore block. All other previous findings are resolved.
Replace 4 bare Organization.Groups reads (lines 298, 304, 347, 361) with a single SnapshotGroups() call at the top of the restore block. Bridge reconnects and SDK events can call AddGroup/RemoveGroupsWhere concurrently during restore, racing with the foreach enumeration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
btessiau
left a comment
There was a problem hiding this comment.
Addressed the remaining finding in 4fad726:
🟡 MODERATE — Bare Organization.Groups reads in RestorePreviousSessionsAsync → Fixed
Replaced all 4 bare reads (lines 298, 304, 347, 361) with a single var restoreGroups = SnapshotGroups() at the top of the restore block. Bridge reconnects and SDK events can fire AddGroup/RemoveGroupsWhere concurrently during restore — the snapshot ensures the foreach enumeration is isolated.
Build: 0 errors. Tests: 10/10 thread safety tests pass.
btessiau
left a comment
There was a problem hiding this comment.
🔄 PR #328 Re-Review (Round 3) — Organization Thread-Safety Locking
CI Status:
Commits reviewed: f7550ec (original) → e79c0b8 (reader+writer locking) → 4fad726 (RestorePreviousSessionsAsync)
Previous Findings Status
| # | Severity | Finding | Status |
|---|---|---|---|
| 1 | 🔴 CRITICAL | Lock one-sided — mutations not locked | ✅ FIXED |
| 2 | 🟡 MODERATE | StartCodespaceHealthCheck bare Organization.Groups read |
✅ FIXED |
| 3 | 🟡 MODERATE | GetClientForGroup bare Organization.Groups read |
✅ FIXED |
| 4 | 🟢 MINOR | Lock held during JsonSerializer.Serialize |
✅ FIXED |
| 5 | 🟢 MINOR | No regression tests for thread-safety | ✅ FIXED |
| 6 | 🟡 MODERATE | RestorePreviousSessionsAsync — 4 bare Organization.Groups reads |
✅ FIXED |
Fix Verification
#1 — 8 locked mutation helpers added (AddSessionMeta, RemoveSessionMetasWhere, RemoveSessionMeta, AddGroup, InsertGroup, RemoveGroupsWhere) and all direct Organization.Sessions.Add/Remove/RemoveAll and Organization.Groups.Add/Insert/Remove call sites throughout CopilotService.cs, Codespace.cs, Organization.cs, and Providers.cs have been migrated to them. Zero remaining bare mutations outside the helpers.
#2, #3 — SnapshotGroups() now used in StartCodespaceHealthCheck guard and GetClientForGroup lookup. The old fragile try { } catch (InvalidOperationException) pattern in RunCodespaceHealthCheckAsync is gone, replaced with SnapshotGroups().
#4 — SaveOrganizationCore now snapshots under lock (ToList() on both collections), then serializes outside the lock. Lock hold time is now microseconds.
#5 — OrganizationThreadSafetyTests.cs added with 10 tests: snapshot isolation, all 6 mutation helpers, and two concurrent read+write stress tests (ConcurrentSnapshotAndMutation_DoesNotThrow, ConcurrentGroupSnapshotAndMutation_DoesNotThrow).
#6 — var restoreGroups = SnapshotGroups() taken once at the top of the restore block in RestorePreviousSessionsAsync. All 4 former bare reads (lines ~298, 304, 347, 361 in Persistence.cs) now use this snapshot.
New Findings
None. Spot-checked the remaining bare Organization.Groups/Sessions reads in the codebase:
RetryCodespaceConnectionAsync(line 601) andStartAndReconnectCodespaceAsync(lines 824, 852) — both are UI-button handlers that run on the Blazor renderer's SynchronizationContext; bare reads are safe there.if (!Organization.Sessions.Any(...)) AddSessionMeta(...)patterns throughoutCreateSessionAsync— theAnycheck andAddSessionMetacall are on the same UI-thread execution unit; no interleaving possible.- Background task in
AddStoppedCodespaceGroup— correctly usesSnapshotGroups()for its background read andInvokeOnUI(() => Organization.Groups.FirstOrDefault(...))for its UI-thread mutation.
Test Coverage
10 new tests cover snapshot isolation and all mutation helpers. The two stress tests exercise actual concurrent access and verify no exceptions. The only gap (pre-existing, not introduced here) is that the tests don't cover SaveOrganizationCore snapshot correctness under contention, but the lock+snapshot pattern is simple enough that the unit tests of the helpers are sufficient.
Verdict
✅ Approve — All 6 previous findings fully resolved. Lock design is now sound (both reads and writes protected). No new issues found. The PR is still in Draft; mark ready when you want to merge.
…nd threads (PureWeen#328) ## Problem `Organization.Sessions` and `Organization.Groups` are plain `List<T>` collections designed for UI-thread-only access. However, several background threads read them without synchronization: - `SaveActiveSessionsToDisk` — timer callback reads Sessions to snapshot GroupId - `RunCodespaceHealthCheckAsync` — background task enumerates Groups - `NotifyCodespaceConnected` — health check reads Sessions to find group's first session - `ResumeCodespaceSessionsAsync` — health check reads Sessions for group members - `SaveOrganizationCore` — timer callback serializes entire Organization (races with UI mutations) This causes intermittent `InvalidOperationException` ("Collection was modified during enumeration") and potential data corruption. ## Fix - Add `_organizationLock` + `SnapshotSessionMetas()`/`SnapshotGroups()` helpers that produce thread-safe copies for off-thread reads - Replace direct `Organization.Sessions` reads in the 5 off-thread sites with snapshot calls - Wrap `SaveOrganizationCore` serialization with the lock to prevent tearing - Replace the fragile `try/catch(InvalidOperationException)` pattern in health checks with proper `SnapshotGroups()` ## Scope 4 files, 38 additions, 11 deletions. Minimal surface area — only touches the off-thread access sites, not the 70+ UI-thread accesses. Fixes PureWeen#327 --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem
Organization.SessionsandOrganization.Groupsare plainList<T>collections designed for UI-thread-only access. However, several background threads read them without synchronization:SaveActiveSessionsToDisk— timer callback reads Sessions to snapshot GroupIdRunCodespaceHealthCheckAsync— background task enumerates GroupsNotifyCodespaceConnected— health check reads Sessions to find group's first sessionResumeCodespaceSessionsAsync— health check reads Sessions for group membersSaveOrganizationCore— timer callback serializes entire Organization (races with UI mutations)This causes intermittent
InvalidOperationException("Collection was modified during enumeration") and potential data corruption.Fix
_organizationLock+SnapshotSessionMetas()/SnapshotGroups()helpers that produce thread-safe copies for off-thread readsOrganization.Sessionsreads in the 5 off-thread sites with snapshot callsSaveOrganizationCoreserialization with the lock to prevent tearingtry/catch(InvalidOperationException)pattern in health checks with properSnapshotGroups()Scope
4 files, 38 additions, 11 deletions. Minimal surface area — only touches the off-thread access sites, not the 70+ UI-thread accesses.
Fixes #327