Skip to content

fix(evidence): restore --cncf-submission behavioral evidence collection#321

Closed
yuanchen8911 wants to merge 1 commit intomainfrom
fix/restore-cncf-submission
Closed

fix(evidence): restore --cncf-submission behavioral evidence collection#321
yuanchen8911 wants to merge 1 commit intomainfrom
fix/restore-cncf-submission

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Restore the --cncf-submission behavioral evidence collection feature that was inadvertently removed by PR #290 (container-per-validator execution engine), plus fix several pre-existing bugs in the evidence collection script.

Motivation / Context

PR #290 refactored the validation engine and deleted the --cncf-submission flag, --feature flag, runCNCFSubmission() function, and pkg/evidence/collector.go that were originally added in PR #214. The docs still reference these flags but the implementation was gone.

Fixes: N/A
Related: #290, #214

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: pkg/evidence

Implementation Notes

Restored files (from latest pre-PR#290 state, not the original PR #214):

  • pkg/evidence/collector.go — behavioral evidence collector (shell script orchestrator)
  • pkg/evidence/collector_test.go — unit tests for feature resolution, script sections, collector options
  • pkg/evidence/scripts/collect-evidence.sh — 1237-line evidence collection script

Bug fixes in the script:

Fix Issue Solution
DCGM metrics empty kubectl run curl pod raced against DNS; DCGM container too minimal for kubectl exec Port-forward to DCGM service with retry loop
DCGM result false FAIL Stale dcgm_pod variable reference after refactor to dcgm_svc Fixed variable name
ASG details empty Custom ASGs lack eks:nodegroup-name tag; multi-line None broke string comparison Strip whitespace + instance ID fallback via describe-auto-scaling-instances
ELB hostname exposed Public endpoint in evidence docs Post-processing sed redaction
NO_CLEANUP broken Skipped both pre-run and post-run cleanup cleanup_ns takes pre/post phase; pre-run always cleans stale resources

CLI additions:

  • --cncf-submission flag triggers behavioral evidence collection (bypasses normal validation)
  • --feature/-f flag for selective feature collection
  • --kubeconfig propagated to evidence script via KUBECONFIG env var
  • Flag validation: --cncf-submission requires --evidence-dir, --feature requires --cncf-submission

Testing

go test ./pkg/evidence/ ./pkg/cli/ -race -count=1
# ok  github.com/NVIDIA/aicr/pkg/evidence  1.628s
# ok  github.com/NVIDIA/aicr/pkg/cli       1.902s

Verified end-to-end on EKS cluster with H100 GPUs — all 8 evidence features collected successfully with all bug fixes confirmed.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A — restores previously existing functionality

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

PR #290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR #214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Signed-off-by: yuanchen97@gmail.com
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner March 10, 2026 01:39
@yuanchen8911 yuanchen8911 deleted the fix/restore-cncf-submission branch March 10, 2026 01:41
@github-actions
Copy link
Copy Markdown

Coverage Report ✅

Metric Value
Coverage 73.3%
Threshold 70%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-73.3%25-green)

No Go source files changed in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant