Skip to content

fix: prevent duplicate groups via race condition in GetOrCreateRepoGroup#638

Merged
PureWeen merged 2 commits intomainfrom
fix/group-creation-race-condition
Apr 20, 2026
Merged

fix: prevent duplicate groups via race condition in GetOrCreateRepoGroup#638
PureWeen merged 2 commits intomainfrom
fix/group-creation-race-condition

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

After the worktree fix (PR #527) and the previous session's cleanup script, duplicate "maui" groups kept reappearing on every app restart — 4 groups for the same repo.

Root Cause

GetOrCreateRepoGroup() and GetOrCreateLocalFolderGroup() had a check-then-create race condition. The pattern was:

  1. Check if group exists (no lock)
  2. If not, create new group

During startup, multiple threads call these concurrently:

  • Session restore runs on ThreadPool via Task.Run
  • ReconcileOrganization() calls GetOrCreateRepoGroup for each session and tracked repo

Both threads see "no existing group" and both create one → duplicates.

Fix

Wrap the entire check-then-create body in lock(_organizationLock). C# Monitor is reentrant, so the nested AddGroup() call (which also takes the lock) is safe.

Testing

  • Build: ✅ 0 errors
  • All 19 SessionOrganizationTests pass
  • Data fix: also merged the 4 duplicate groups and removed stale dotnet-maui-local-66cccd41 repo entry

GetOrCreateRepoGroup and GetOrCreateLocalFolderGroup had a check-then-create
pattern without holding _organizationLock. During app startup, concurrent callers
(session restore on ThreadPool + ReconcileOrganization) could both see 'no existing
group' and create duplicates. This caused the '4 maui groups' bug where manual
cleanup was undone by every restart.

Fix: wrap the entire check-then-create in lock(_organizationLock). Monitor is
reentrant so nested AddGroup() calls are safe.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

Design-level concerns (outside diff)

🟡 MODERATE — PromoteOrCreateLocalFolderGroup has the same TOCTOU race, unprotected (3/3 reviewers)

File: PolyPilot/Services/CopilotService.Organization.cs, lines 1536–1588

The PR locks GetOrCreateLocalFolderGroup (the leaf), but its wrapper PromoteOrCreateLocalFolderGroup performs its own check-then-act entirely outside _organizationLock:

  1. Line 1542: reads Organization.Groups to find alreadyLocalno lock
  2. Line 1562: reads Organization.Groups again to find a URL-based candidateno lock
  3. Line 1568: mutates candidate.LocalPathno lock

Concrete failing scenario: Two concurrent callers (e.g., session restore + bridge sync via CopilotService.Bridge.cs:901) with the same repoId but potentially different localPath can race. Both find alreadyLocal == null, both find the same URL-based candidate, and both set candidate.LocalPath to different values. Last writer wins non-deterministically — the on-disk group ends up with a non-deterministic LocalPath.

Fix: Wrap the entire body of PromoteOrCreateLocalFolderGroup (after path normalization) in lock(_organizationLock). The nested GetOrCreateLocalFolderGroup call at line 1587 is safe due to Monitor reentrancy — same reasoning the PR already uses for AddGroup().


🟢 MINOR — No concurrency test for the race condition fix (3/3 reviewers)

All existing GetOrCreateRepoGroup and GetOrCreateLocalFolderGroup tests in SessionOrganizationTests.cs are single-threaded. Without a concurrent stress test, a future refactor could silently reintroduce the race.

Suggested test:

[Fact]
public async Task GetOrCreateRepoGroup_ConcurrentCalls_CreatesExactlyOneGroup()
{
    var svc = CreateService();
    var tasks = Enumerable.Range(0, 20).Select(_ =>
        Task.Run(() => svc.GetOrCreateRepoGroup("repo-1", "MyRepo")));
    var results = await Task.WhenAll(tasks);
    Assert.Single(svc.Organization.Groups.Where(g => g.RepoId == "repo-1" && !g.IsMultiAgent));
    Assert.All(results, g => Assert.Equal(results[0]!.Id, g!.Id));
}

Generated by Expert Code Review (auto) for issue #638 ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Model Review — PR #638: Fix duplicate groups via race condition

Summary

The core fix is correct and well-targeted: wrapping check-then-create in lock(_organizationLock) prevents the duplicate group race between ReconcileOrganization (ThreadPool) and session restore. Monitor reentrancy with nested AddGroup() is sound. The fix addresses the stated bug.

However, two MODERATE issues should be resolved before merge.

Findings (ranked by severity)

# Severity Consensus Location Issue
1 🟡 MODERATE 3/3 Lines 1474–1477, 1522–1523 (in diff) OnStateChanged?.Invoke() and SaveOrganization() called inside _organizationLock — breaks file-wide convention (all 4 Focus methods release lock before notifying), extends lock hold time with arbitrary subscriber code, and creates a latent deadlock surface
2 🟡 MODERATE 3/3 Lines 1536–1588 (outside diff) PromoteOrCreateLocalFolderGroup has the same TOCTOU pattern, entirely unprotected — concurrent callers can race on promotion, producing non-deterministic LocalPath
3 🟢 MINOR 3/3 Tests (outside diff) No concurrent stress test validates the fix; all existing tests are single-threaded

CI Status

CI checks not available for validation.

Recommendation

REQUEST_CHANGES — The two MODERATE findings are straightforward to address:

  1. Move SaveOrganization() + OnStateChanged?.Invoke() outside the lock (matching the pattern used everywhere else in the file)
  2. Wrap PromoteOrCreateLocalFolderGroup body in lock(_organizationLock) for consistency

Both fixes are small, low-risk, and align with the PR author's own reasoning about Monitor reentrancy.

Generated by Expert Code Review (auto) for issue #638

Comment on lines +1474 to +1477
AddGroup(group);
SaveOrganization();
OnStateChanged?.Invoke();
return group;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — OnStateChanged?.Invoke() and SaveOrganization() called inside _organizationLock (3/3 reviewers)

Every other method in this file that holds _organizationLock follows the pattern of releasing the lock before notifying — see AddToFocus (line 930), RemoveFromFocus (949), PromoteFocusSession (964), DemoteFocusSession (983):

lock (_organizationLock) { /* mutate */ }
if (changed) { SaveOrganization(); OnStateChanged?.Invoke(); }

This PR inverts that established convention. OnStateChanged?.Invoke() fires synchronously on the calling thread while the lock is held, invoking all subscribers including SessionSidebar.RefreshSessions and Dashboard.RefreshState.

While there is no deadlock today (Monitor is reentrant on the same thread), this creates a latent deadlock surface: any future OnStateChanged subscriber that synchronously acquires a second lock L2 while another path holds L2 then _organizationLock would deadlock.

Fix: Keep check+create+AddGroup() inside the lock, move SaveOrganization() and OnStateChanged?.Invoke() outside:

SessionGroup? group;
lock (_organizationLock)
{
    var existing = /* ... */;
    if (existing != null) return existing;
    // ... early-return guards ...
    Organization.DeletedRepoGroupRepoIds.Remove(repoId);
    group = new SessionGroup { /* ... */ };
    AddGroup(group);
}
SaveOrganization();
OnStateChanged?.Invoke();
return group;

Same fix applies to GetOrCreateLocalFolderGroup at line 1523.

@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Apr 20, 2026

Multi-Model Code Review — PR #638

Summary

The core fix is correct and necessary — wrapping the check-then-create pattern in lock(_organizationLock) eliminates the race between ReconcileOrganization (ThreadPool) and session restore that caused the "4 maui groups" bug. Monitor reentrancy with nested AddGroup() is sound.

CI Status: ✅ All checks passing (first commit); pending (second commit)


Findings (Initial Review)

# Severity Consensus Location Issue
1 🟡 MODERATE 3/3 GetOrCreateRepoGroup + GetOrCreateLocalFolderGroup SaveOrganization() + OnStateChanged?.Invoke() called inside _organizationLock
2 🟡 MODERATE 2/3 GetOrCreateLocalFolderGroup "update existing" branch Hot path now holds lock during side effects (previously lock-free)
3 🟡 MODERATE (prior review, 3/3) PromoteOrCreateLocalFolderGroup lines 1536–1588 Same TOCTOU race, entirely unprotected
4 🟢 MINOR (prior review, 3/3) Tests No concurrent stress test validates the fix

Re-Review After Fixes (3/3 reviewers)

# Finding Status
1 SaveOrganization/OnStateChanged inside lock ✅ FIXED — All three methods now defer side effects outside the lock via notify flag pattern
2 Hot-path GetOrCreateLocalFolderGroup held lock during side effects ✅ FIXED — Mutations inside lock, notify flag gates side effects after release
3 PromoteOrCreateLocalFolderGroup TOCTOU ✅ FIXED — Full check-and-mutate sequence now atomic under lock; fallthrough to GetOrCreateLocalFolderGroup is safe (itself lock-protected)
4 No concurrent stress tests ✅ FIXED — 3 stress tests with 20 parallel Task.Run callers each

New Findings: None

All 3 reviewers independently confirmed no new bugs, regressions, security issues, or data-loss risks.

One reviewer noted Organization.Groups.Max() could throw on an empty collection — pre-existing issue (not introduced by this PR), and practically unreachable since LoadOrganization() guarantees the default group exists. Discarded (1/3, same finding discarded in initial review).

Test Coverage

  • All 19 existing SessionOrganizationTests pass ✅
  • 3 new concurrent stress tests pass ✅
  • All three fixed methods have concurrent coverage

Recommendation

Approve — All previous findings addressed. No new issues found. The race-condition fix is correct, follows file-wide conventions, and has concurrent test coverage.

…erGroup

Address review findings:
1. Move SaveOrganization() + OnStateChanged?.Invoke() outside _organizationLock
   in GetOrCreateRepoGroup and GetOrCreateLocalFolderGroup — matches the
   file-wide convention (lock → mutate → release → notify) and eliminates
   latent deadlock risk from event subscribers.
2. Wrap PromoteOrCreateLocalFolderGroup in _organizationLock to fix the same
   TOCTOU race (concurrent callers could race on candidate promotion).
3. Add 3 concurrent stress tests validating that each method produces exactly
   one group under 20 parallel Task.Run callers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit ac8423d into main Apr 20, 2026
@PureWeen PureWeen deleted the fix/group-creation-race-condition branch April 20, 2026 03:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant