Draft
Conversation
Add retry logic for gtest binaries via GTEST_MAX_RETRIES (default 1) and pytest reruns via --reruns 2 --reruns-delay 5. Tests that fail then pass on retry are classified as flaky rather than failures. Add pytest-rerunfailures as a test dependency.
Add matrix-aware nightly test report generator that parses JUnit XML, classifies failures as new/recurring/flaky/stabilized, maintains per-matrix failure history on S3, and outputs Markdown, HTML, and JSON reports. Extract S3 helpers into shared module and shell helper to eliminate duplication across test scripts.
Add nightly report generation to cpp, python, wheel-python, wheel-server, and notebook test scripts using the shared helper. Wheel and notebook scripts also gain JUnit XML output and EXITCODE trap pattern for consistent error handling.
Add aggregate_nightly.py to merge per-matrix JSON summaries into a consolidated report with matrix grid. Add Slack notifiers for both per-job and consolidated messages with HTML file upload support. Add nightly_summary.sh wrapper for the post-test aggregation job. Add static HTML dashboard with matrix overview, failure drill-down, and trend charts reading from S3 index.json.
Pass Slack webhook secret to all test jobs. Add nightly-summary job that runs after all test jobs complete, aggregates results from S3, sends a consolidated Slack notification, and uploads the dashboard. Pass S3 and Slack secrets via container-options for the custom job.
Add pitfall entries for cross-cutting change discipline: full scope audits, code duplication, CI matrix parallelism, extensibility, and actionable reporting.
Apply ruff formatting to Python files, update copyright years to 2026 in shell scripts, regenerate conda environment files and pyproject.toml from dependencies.yaml, and remove hardcoded version from comment.
custom-job.yaml does not support secret references in container-options. Remove them and make nightly_summary.sh gracefully skip when CUOPT_DATASET_S3_URI is not available.
The shared workflows only support 3 secret slots. The Slack webhook is only needed by the nightly-summary aggregation job which uses secrets: inherit.
Tests that resolve then fail again within 14 days are recognized as bouncing rather than new failures. After 2+ bounces a test is automatically classified as cross-run flaky. Resolved tests only generate one notification — subsequent passes are silent.
CUOPT_BOUNCE_WINDOW_DAYS (default 14) and CUOPT_BOUNCE_THRESHOLD (default 2) can now be set as environment variables to tune flaky test detection without code changes.
The custom-job.yaml reusable workflow does not expose secrets as env vars. Convert nightly-summary to an inline job that directly sets all required secrets (S3, Slack) in the step environment.
In CI, aws-actions/configure-aws-credentials sets role-based tokens (AWS_ACCESS_KEY_ID + AWS_SESSION_TOKEN). The CUOPT_AWS_* overrides were replacing these with static keys that lack the session token, causing InvalidToken errors. Now only fall back to CUOPT_AWS_* when standard AWS credentials are not already set.
The cuOpt S3 bucket requires CUOPT_AWS_* static credentials. The role-based session token from aws-actions/configure-aws-credentials was causing InvalidToken errors. Always override with CUOPT_AWS_* and unset AWS_SESSION_TOKEN, matching the pattern in datasets/*.sh.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Move the nightly-summary job out of test.yaml into its own nightly-summary.yaml reusable workflow. Runs in a python:3.12-slim container to avoid PEP 668 externally-managed-environment errors when installing awscli. Also adds workflow_dispatch trigger so the summary can be re-run manually against an earlier test run.
The python:3.12-slim image doesn't include curl, which is needed by send_consolidated_summary.sh for Slack webhook and file upload.
- Filter consolidated.json from S3 aggregation to fix "unknown" entry - Migrate Slack file upload from deprecated files.upload to getUploadURLExternal + completeUploadExternal - Chunk Slack messages into header/grid/details/links to stay within block and character limits - Remove S3 link from Slack in favor of HTML file attachment - Add --junitxml to Pyomo, CvxPy, and PuLP thirdparty test scripts so failures appear in nightly reports - Export RAPIDS_TESTS_DIR from test_wheel_cuopt.sh for subprocesses
…Slack - Generate presigned S3 URLs (7-day expiry) for consolidated HTML report and dashboard, linked in Slack messages - Query GitHub API for workflow job statuses to surface CI-level failures (notebooks, JuMP, etc.) that don't produce JUnit XML - Show only failed/flaky matrix entries in Slack instead of listing all passing ones — compact summary line for green runs - Pass GITHUB_TOKEN and GITHUB_RUN_ID to nightly-summary container - Remove temporary test workflow file
Post the main summary (status + links) as a top-level message via chat.postMessage, then post matrix details, failure breakdowns, and the HTML report as thread replies. Keeps the channel clean while preserving full detail in the thread. Falls back to webhook (no threading) if bot token is not available.
- Filter out per-matrix test jobs (conda-cpp-tests, conda-python-tests, wheel-tests-*) from workflow job status since they are already tracked by S3 summaries. Only surface untracked jobs like notebooks and JuMP. - Move full CI job failure list to thread reply to avoid exceeding Slack's 3000-char block limit. Main message shows compact summary. - Chunk CI job details into multiple blocks if needed.
- Include ALL workflow jobs in consolidated JSON (not just untracked) with has_test_details flag to distinguish tracked vs untracked - Thread reply 1: CI Workflow Status showing every workflow group with pass/fail counts — new workflows automatically visible - Thread reply 2: Failing and flaky tests grouped by workflow so users see which workflow has which test issues - Main message alerts only on untracked CI failures (notebooks, JuMP) since tracked failures already appear in the matrix test grid
- Map CUOPT_AWS_* to standard AWS env vars before aws s3 presign so the CLI has credentials in the container - Log presign failures instead of swallowing them silently - Reorder thread: test failures/details first, CI workflow overview last
- Main message now lists which workflows have failures by name (e.g., "Failures in: conda-notebook-tests, wheel-tests-cuopt") with per-workflow failure counts - Add build-summary job to build.yaml that sends a Slack message after all builds complete, showing pass/fail per build job - Build summary queries GitHub API for job statuses, grouped by workflow prefix (cpp-build, wheel-build-cuopt, docs-build, etc.)
Skips all test jobs when summary-only=true, so nightly-summary runs immediately without waiting for GPU runners.
- Embed index.json and consolidated data directly into dashboard HTML during aggregation so it works on private S3 buckets without runtime fetches (no more 403 errors) - Dashboard falls back to S3 fetch if embedded data is absent - Add summary-only input to test.yaml to skip all test jobs and run only nightly-summary (avoids waiting for GPU runners when testing)
- Test totals: only show failed/flaky counts, skip passed/skipped/total - CI Workflow Status thread: only list failing workflows, one-line summary for passing ones
Write GitHub API response to a temp file instead of passing it as a shell argument to Python. The jobs JSON for a full build matrix exceeds the OS argument length limit.
Skips all build/publish/test/image jobs when summary-only=true, so build-summary runs immediately without waiting for runners.
Use GitHub API job counts (e.g., "1/11 failed") instead of vague "failed" or "matrix job(s)" in the per-workflow failure summary.
When the dashboard has embedded data and no S3 access, show a friendly message instead of a 403 error when switching dates. The embedded dashboard always shows the latest run.
Replace dropdowns with static labels when dashboard has embedded data, since switching dates/branches requires S3 access.
- S3 summaries, reports, and history now include branch slug in path:
summaries/{date}/{branch}/, reports/{date}/{branch}/, history/{branch}/
- Each branch gets its own dashboard at dashboard/{branch}/index.html
- index.json entries keyed by date/branch instead of just date
- Dashboard date selector shows "date — branch" labels
- Trends filtered to current branch
- Prevents main and release/26.04 nightlies from overwriting each other
Checks if the new branch-separated summaries path has data before aggregating. Falls back to the old flat path for backward compatibility with summaries uploaded before the branch separation.
The nightly-summary was hardcoding today's date, but summaries on S3 are keyed by the date they were created. When re-running against earlier data, the date must match. Now uses the date input from the workflow, falling back to today if not provided.
The aws s3 ls command in the legacy path fallback needs credentials to access the private bucket. Moving CUOPT_AWS_* → AWS_* mapping to the top of the script so all aws CLI calls have credentials.
Previous empty runs uploaded consolidated.json to the branch path, causing the fallback to think data exists. Now only counts actual per-matrix summary files when deciding whether to fall back.
Branch-separated paths are now the only path structure. A full build+test run will populate the new paths.
Build summary should wait for and report on all jobs including the test trigger and image builds.
Only need to depend on tests, build-images, and docs-build since they transitively depend on all upstream build/publish jobs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Issue
Checklist