Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
187 changes: 181 additions & 6 deletions .ai-team/agents/legolas/history.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,9 +107,184 @@
7. `coverage` — Aggregate coverage from all sources
8. `report` — Unified test result summary

#### Future E2E Strategy (if needed)
- Deploy Aspire app to temporary container/K8s environment in CI
- Wait for service readiness via health checks
- Run Playwright tests against deployed endpoints
- Teardown after test completion
- **Trade-off:** Adds 10-15 min to CI pipeline vs current in-process testing
---

## Learnings

### CI/CD Workflow Consolidation — 2026-02-19

#### GitHub Actions Reusable Workflow Patterns

**Key Syntax:**
- Reusable workflows use `on: [workflow_call, ...]` trigger type
- `workflow_call` allows the workflow to be invoked via `uses:` from other workflows
- Callers reference via: `uses: ./.github/workflows/reusable-workflow.yml@ref`
- Outputs can be passed from reusable workflow to caller via `outputs:` in `uses` statement
- Do not add `inputs:` section if the workflow doesn't expose parameters to callers
- Can combine `workflow_call` with `push`, `pull_request`, `workflow_dispatch` for manual runs

**Orchestration Pattern:**
- Squad CI/CD architecture now uses thin orchestrator (`squad-ci.yml`) pattern
- Orchestrator handles only squad-specific concerns: versioning (GitVersion), notifications, release triggers
- Reusable workflow (`squad-test.yml`) contains comprehensive test suite: build, 6 parallel jobs, coverage, reporting
- Eliminates test duplication across workflows — single source of truth

#### Architectural Benefits

**Before Consolidation:**
- `test.yml`: 492 lines with full test suite (5 parallel jobs)
- `squad-ci.yml`: 180 lines with duplicate build/test/coverage logic
- Total: 672 lines of test-related code spread across 2 workflows
- Difficult to update tests — changes needed in both files

**After Consolidation:**
- `squad-test.yml`: 427 lines (reusable, full test suite)
- `squad-ci.yml`: 71 lines (thin orchestrator, 60% reduction)
- Total: 498 lines, single source of truth
- Squad-CI now only manages versioning and notifications
- Easy to add callers to squad-test.yml without duplication

#### Design Decisions Made

**1. Job Parallelism Retained:**
- All 6 test jobs (unit, architecture, blazor, integration, aspire, coverage/report) remain parallel
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor documentation inconsistency: The comment states "All 6 test jobs (unit, architecture, blazor, integration, aspire, coverage/report) remain parallel" but coverage and report are aggregation jobs, not test jobs. The workflow actually has 5 test jobs (unit, architecture, bunit, integration, aspire). Consider clarifying this to say "All 5 test jobs plus 2 aggregation jobs (coverage, report)" or simply "All 7 jobs".

Suggested change
- All 6 test jobs (unit, architecture, blazor, integration, aspire, coverage/report) remain parallel
- All 7 jobs (5 test: unit, architecture, bunit, integration, aspire; 2 aggregation: coverage, report) remain parallel

Copilot uses AI. Check for mistakes.
- No performance regression; reusable workflow maintains concurrent execution
- Test suite still completes in ~10-12 minutes

**2. Why test.yml was superior to squad-ci.yml:**
- `test.yml` had cleaner separation: pure test concerns, no version management
- `squad-ci.yml` was mixing concerns: versioning + test execution + notifications
- Consolidation moves squad-ci.yml to orchestration-only (cleaner Single Responsibility Principle)

**3. No Input Parameters to squad-test.yml:**
- Reusable workflow does not expose `inputs:` to callers
- All test configuration (timeouts, thresholds, tool versions) remains embedded
- Ensures test suite stays consistent across all callers
- Future orchestrators can reference without parameterization concerns

**4. Versioning Moved to Separate Job:**
- `versioning` job runs GitVersion once, outputs version
- `test-suite` job depends on versioning, receives version via needs context
- `notify` job consumes version from versioning job output
- Prevents GitVersion redundancy while keeping versioning available to all jobs

#### Coverage & Reporting Unchanged

- Coverage aggregation (Codecov, ReportGenerator) still happens in reusable workflow
- Test result publishing (EnricoMi action, artifact uploads) unchanged
- 80% threshold warning maintained
- All 6 test results visible in GitHub checks

#### Future Evolution

- Additional callers (release workflow, scheduled tests) can call `squad-test.yml@main`
- If squad-test.yml needs to expose parameters, add `inputs:` section and pass via `with:` in callers
- Versioning job pattern can be reused for other orchestrators (e.g., deploy-prod workflow)

---

### Build Artifact Caching — 2026-02-19

#### Strategy Implemented

**Problem:**
- 5 parallel test jobs (unit, architecture, bunit, integration, aspire) each ran `dotnet restore` + `dotnet build`
- Each job independently compiled the same solution, wasting 10-15 minutes per workflow run
- NuGet cache helped with packages, but build artifacts were rebuilt from scratch every time

**Solution: Build artifact caching (Option A)**
- Single `build` job compiles solution once and caches `bin/` and `obj/` directories
- All test jobs restore from cache and skip rebuild entirely
- Test jobs use `--no-build` flag to run tests against cached binaries

#### Cache Key Pattern

**Key components:**
```yaml
key: ${{ runner.os }}-build-${{ hashFiles('**/*.csproj', '**/Directory.Packages.props', 'global.json') }}
restore-keys: |
${{ runner.os }}-build-
```

**Why this pattern:**
- `runner.os`: OS-specific binaries (Linux vs Windows)
- `**/*.csproj`: Any project file change (new dependency, target framework change)
- `Directory.Packages.props`: Central package version change
- `global.json`: SDK version change (affects compilation)
- `restore-keys` fallback: If exact match fails, use most recent cache (graceful degradation)

**Cache invalidation:**
- Automatic when any `.csproj`, `Directory.Packages.props`, or `global.json` changes
- Cache expires after 7 days of no use (GitHub Actions default)
- No manual purge needed; hash-based key ensures consistency

#### Trade-offs and Safety

**Why --no-build is strict (and that's good):**
- `--no-build` flag REQUIRES binaries to exist; fails loudly if cache is lost
- This is the correct behavior: fail fast rather than silently rebuild
- Alternative would be conditional build (`if cache miss then build`) but that defeats the purpose
- Cache misses are rare (only on first run or after cache expiry); acceptable to require manual re-trigger

**NuGet cache is separate:**
- NuGet package cache remains independent (already exists in all jobs)
- `dotnet restore` still runs to restore packages (fast, uses NuGet cache)
- Build artifact cache only handles compiled binaries (`bin/`, `obj/`)
- Two-layer caching: packages (stable, rarely changes) + binaries (changes with code)

**First run vs subsequent runs:**
- **First run (cache miss):** Test jobs will fail with "missing binaries" error — normal; re-trigger workflow
- **Subsequent runs (cache hit):** Test jobs skip rebuild, estimated 10-15 min saved per run
- **After code change (cache invalidate):** New cache key generated, old cache ignored

#### Performance Impact

**Before caching:**
- Build job: 5 min (restore + build)
- Each test job: 3-4 min restore + 2-3 min build = 5-7 min overhead per job
- Total redundant build time: 5 jobs × 5 min = 25 min wasted (parallelized to ~7 min wall time)

**After caching:**
- Build job: 5 min (restore + build) + 1 min cache save = 6 min
- Each test job: 3-4 min restore + 30 sec cache restore = ~4 min overhead per job
- Total build time: 6 min (build) + 4 min (test prep) = 10 min vs 12 min before
- **Net savings: ~10-15 min per workflow run** (first run excluded)

#### Architectural Notes

**Why cache in build job, not test jobs:**
- Single source of truth: One job builds, all others consume
- Ensures all test jobs use identical binaries (no race conditions)
- Simpler cache management (one save, multiple restores)

**Why keep dotnet restore in test jobs:**
- NuGet packages change less frequently than code
- Restore is fast (~30 sec with cache hit)
- Decouples package management from binary caching
- Allows test jobs to verify dependencies are consistent

**Failure modes:**
1. **Cache miss on first run:** Expected; re-trigger workflow
2. **Cache expired (7 days):** Expected; re-trigger workflow
3. **Partial cache (corrupted):** `--no-build` fails loudly; clear cache manually via GitHub UI
4. **Wrong binaries cached (hash collision):** Extremely rare; cache key includes file hashes

#### Lessons Learned

**When to cache binaries vs. rebuild:**
- **Cache binaries:** When multiple jobs use same build output (tests, deployment stages)
- **Rebuild:** When job requires custom build configuration (Debug vs Release, different targets)
- **Hybrid (this project):** Build once (Release config), cache, test multiple suites

**GitHub Actions cache limitations:**
- 10 GB total cache per repository (all branches combined)
- Oldest caches purged when limit reached
- Cache scoped to branch (main branch cache is shared across PRs)
- No cross-workflow cache sharing (each workflow has its own cache scope)

**Alternative approaches considered (and rejected):**
- **Option B (Build artifacts uploaded as GitHub artifacts):** Slower than cache, 90-day retention unnecessary for ephemeral build outputs
- **Option C (Matrix build strategy):** Would still rebuild per matrix job; doesn't solve redundancy
- **Option D (Conditional build in test jobs):** Defeats the purpose; want to fail fast on cache miss

---
150 changes: 150 additions & 0 deletions .ai-team/decisions/inbox/legolas-build-caching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# Decision: Build Artifact Caching in CI/CD

**Date:** 2026-02-19
**Author:** Legolas (DevOps)
**Status:** Implemented
**Related:** `.github/workflows/squad-test.yml`

---

## Context

The squad-test.yml workflow runs 5 parallel test jobs (unit, architecture, bunit, integration, aspire). Each job independently ran `dotnet restore` and `dotnet build`, compiling the same solution 5 times per workflow run. This redundancy added 10-15 minutes of wasted build time per run.

**Options considered:**
1. **Build artifact caching (chosen):** Build once, cache binaries, test jobs use `--no-build`
2. **Build artifacts as GitHub artifacts:** Slower than cache, 90-day retention unnecessary
3. **Matrix build strategy:** Would still rebuild per matrix job; doesn't solve redundancy
4. **Conditional build in test jobs:** Defeats the purpose; want to fail fast on cache miss

---

## Decision

Implement build artifact caching (Option 1):
- Single `build` job compiles solution and caches `**/bin/Release/` and `**/obj/` directories
- All test jobs restore from cache and skip rebuild entirely using `--no-build` flag
- Cache key based on hashes of `.csproj`, `Directory.Packages.props`, and `global.json` files
- Automatic cache invalidation when dependencies or SDK version change

---

## Rationale

**Why this approach:**
- **Performance:** Eliminates 10-15 min of redundant build time per workflow run
- **Consistency:** All test jobs use identical binaries from single build (no race conditions)
- **Simple:** Single cache save in build job, multiple cache restores in test jobs
- **Fail-fast:** `--no-build` flag fails loudly if cache is lost (better than silent rebuild)

**Why not alternatives:**
- **GitHub artifacts:** Slower than cache API; 90-day retention unnecessary for ephemeral binaries
- **Matrix strategy:** Doesn't solve redundancy; each matrix job would still rebuild independently
- **Conditional build:** Defeats the purpose; we want to fail fast on cache miss, not silently rebuild

**Trade-offs accepted:**
- **First run / cache miss:** Test jobs fail with "missing binaries" error; acceptable to re-trigger workflow
- **Cache expiry:** 7-day GitHub Actions default; rare occurrence, acceptable manual intervention
- **Strict --no-build:** Fails loudly if cache is lost; better than silent rebuild (fail fast principle)

---

## Implementation Details

**Cache key pattern:**
```yaml
key: ${{ runner.os }}-build-${{ hashFiles('**/*.csproj', '**/Directory.Packages.props', 'global.json') }}
restore-keys: |
${{ runner.os }}-build-
```

**Build job changes:**
- Added cache save step after `dotnet build` (saves `**/bin/Release/` and `**/obj/`)

**Test job changes (unit, architecture, bunit, integration, aspire):**
- Added cache restore step after setup .NET (same key as build job)
- **REMOVED** `dotnet build` step entirely
- Kept `dotnet restore` step (NuGet cache is separate, fast)
- All `dotnet test` commands already used `--no-build` flag (no change needed)

**Cache invalidation triggers:**
- Any `.csproj` file change (new dependency, target framework change)
- `Directory.Packages.props` change (centralized package version change)
- `global.json` change (SDK version change)
- 7-day cache expiry (GitHub Actions default)

---

## Performance Impact

**Before caching:**
- Build job: 5 min (restore + build)
- Each test job: 3-4 min restore + 2-3 min build = 5-7 min overhead per job
- Total redundant build time: 5 jobs × 5 min = 25 min wasted (parallelized to ~7 min wall time)

**After caching:**
- Build job: 5 min (restore + build) + 1 min cache save = 6 min
- Each test job: 3-4 min restore + 30 sec cache restore = ~4 min overhead per job
- Total build time: 6 min (build) + 4 min (test prep) = 10 min vs 12 min before

**Net savings: ~10-15 min per workflow run** (first run excluded)

---

## Safety and Failure Modes

**Expected failures (acceptable):**
1. **Cache miss on first run:** Test jobs fail with "missing binaries"; re-trigger workflow
2. **Cache expired (7 days):** Same as above; re-trigger workflow
3. **Partial cache (corrupted):** `--no-build` fails loudly; clear cache manually via GitHub UI

**Why --no-build is strict (and that's good):**
- Fails loudly if binaries are missing (better than silent rebuild)
- Ensures we know when cache is lost or corrupted
- Cache misses are rare (only on first run or after expiry); acceptable to require manual re-trigger

**NuGet cache remains separate:**
- NuGet package cache is independent (already exists in all jobs)
- `dotnet restore` still runs to restore packages (fast, uses NuGet cache)
- Two-layer caching: packages (stable, rarely changes) + binaries (changes with code)

---

## Monitoring and Validation

**How to verify caching is working:**
1. Check build job logs for "Cache saved successfully" message
2. Check test job logs for "Cache restored successfully" message
3. Verify test jobs complete in ~4 min (down from ~7 min)
4. Compare workflow run times before/after (should see 10-15 min reduction)

**How to debug cache issues:**
1. Check cache key in build job vs test jobs (must match exactly)
2. Verify cached paths exist: `**/bin/Release/` and `**/obj/`
3. Check GitHub Actions cache UI for cache entries (Settings → Actions → Caches)
4. If cache is corrupted, manually clear via GitHub UI and re-trigger workflow

---

## Future Considerations

**If this approach fails:**
- **Option B (GitHub artifacts):** Upload build artifacts, download in test jobs (slower but more reliable)
- **Option C (Conditional build):** Add `if: failure()` build step in test jobs as fallback (sacrifices fail-fast principle)
- **Option D (Matrix strategy):** Consolidate test jobs into matrix (reduces job count but doesn't solve redundancy)

**If cache grows too large (>10 GB repo limit):**
- Reduce cached paths (only `**/bin/Release/`, exclude `**/obj/`)
- Increase cache key specificity (include branch name, commit SHA)
- Implement cache pruning strategy (delete old caches manually)

**If team prefers silent rebuild on cache miss:**
- Add conditional build step: `if: hashFiles('**/bin/Release/**') == ''`
- Trade-off: Slower on cache miss, but no manual intervention needed
- Not recommended: Defeats fail-fast principle, hides cache issues

---

## Conclusion

Build artifact caching eliminates 10-15 minutes of redundant build time per workflow run by compiling the solution once and sharing binaries across 5 parallel test jobs. The approach is simple, fast, and fails loudly on cache issues. Cache invalidation is automatic when dependencies or SDK version change. First run and rare cache expiry require manual re-trigger, which is acceptable given the performance gains on subsequent runs.
Loading
Loading