Flaky test reurun and reporting by rgsl888prabhu · Pull Request #1098 · NVIDIA/cuopt

rgsl888prabhu · 2026-04-14T16:05:49Z

Description

Issue

Checklist

I am familiar with the Contributing Guidelines.
Testing
- New or existing tests cover these changes
- Added tests
- Created an issue to follow-up
- NA
Documentation
- The documentation is up to date with these changes
- Added new documentation
- NA

Add retry logic for gtest binaries via GTEST_MAX_RETRIES (default 1) and pytest reruns via --reruns 2 --reruns-delay 5. Tests that fail then pass on retry are classified as flaky rather than failures. Add pytest-rerunfailures as a test dependency.

Add matrix-aware nightly test report generator that parses JUnit XML, classifies failures as new/recurring/flaky/stabilized, maintains per-matrix failure history on S3, and outputs Markdown, HTML, and JSON reports. Extract S3 helpers into shared module and shell helper to eliminate duplication across test scripts.

Add nightly report generation to cpp, python, wheel-python, wheel-server, and notebook test scripts using the shared helper. Wheel and notebook scripts also gain JUnit XML output and EXITCODE trap pattern for consistent error handling.

Add aggregate_nightly.py to merge per-matrix JSON summaries into a consolidated report with matrix grid. Add Slack notifiers for both per-job and consolidated messages with HTML file upload support. Add nightly_summary.sh wrapper for the post-test aggregation job. Add static HTML dashboard with matrix overview, failure drill-down, and trend charts reading from S3 index.json.

Pass Slack webhook secret to all test jobs. Add nightly-summary job that runs after all test jobs complete, aggregates results from S3, sends a consolidated Slack notification, and uploads the dashboard. Pass S3 and Slack secrets via container-options for the custom job.

Add pitfall entries for cross-cutting change discipline: full scope audits, code duplication, CI matrix parallelism, extensibility, and actionable reporting.

Apply ruff formatting to Python files, update copyright years to 2026 in shell scripts, regenerate conda environment files and pyproject.toml from dependencies.yaml, and remove hardcoded version from comment.

custom-job.yaml does not support secret references in container-options. Remove them and make nightly_summary.sh gracefully skip when CUOPT_DATASET_S3_URI is not available.

The shared workflows only support 3 secret slots. The Slack webhook is only needed by the nightly-summary aggregation job which uses secrets: inherit.

Tests that resolve then fail again within 14 days are recognized as bouncing rather than new failures. After 2+ bounces a test is automatically classified as cross-run flaky. Resolved tests only generate one notification — subsequent passes are silent.

CUOPT_BOUNCE_WINDOW_DAYS (default 14) and CUOPT_BOUNCE_THRESHOLD (default 2) can now be set as environment variables to tune flaky test detection without code changes.

The custom-job.yaml reusable workflow does not expose secrets as env vars. Convert nightly-summary to an inline job that directly sets all required secrets (S3, Slack) in the step environment.

In CI, aws-actions/configure-aws-credentials sets role-based tokens (AWS_ACCESS_KEY_ID + AWS_SESSION_TOKEN). The CUOPT_AWS_* overrides were replacing these with static keys that lack the session token, causing InvalidToken errors. Now only fall back to CUOPT_AWS_* when standard AWS credentials are not already set.

The cuOpt S3 bucket requires CUOPT_AWS_* static credentials. The role-based session token from aws-actions/configure-aws-credentials was causing InvalidToken errors. Always override with CUOPT_AWS_* and unset AWS_SESSION_TOKEN, matching the pattern in datasets/*.sh.

copy-pr-bot · 2026-04-14T16:05:53Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Move the nightly-summary job out of test.yaml into its own nightly-summary.yaml reusable workflow. Runs in a python:3.12-slim container to avoid PEP 668 externally-managed-environment errors when installing awscli. Also adds workflow_dispatch trigger so the summary can be re-run manually against an earlier test run.

The python:3.12-slim image doesn't include curl, which is needed by send_consolidated_summary.sh for Slack webhook and file upload.

- Filter consolidated.json from S3 aggregation to fix "unknown" entry - Migrate Slack file upload from deprecated files.upload to getUploadURLExternal + completeUploadExternal - Chunk Slack messages into header/grid/details/links to stay within block and character limits - Remove S3 link from Slack in favor of HTML file attachment - Add --junitxml to Pyomo, CvxPy, and PuLP thirdparty test scripts so failures appear in nightly reports - Export RAPIDS_TESTS_DIR from test_wheel_cuopt.sh for subprocesses

…Slack - Generate presigned S3 URLs (7-day expiry) for consolidated HTML report and dashboard, linked in Slack messages - Query GitHub API for workflow job statuses to surface CI-level failures (notebooks, JuMP, etc.) that don't produce JUnit XML - Show only failed/flaky matrix entries in Slack instead of listing all passing ones — compact summary line for green runs - Pass GITHUB_TOKEN and GITHUB_RUN_ID to nightly-summary container - Remove temporary test workflow file

Post the main summary (status + links) as a top-level message via chat.postMessage, then post matrix details, failure breakdowns, and the HTML report as thread replies. Keeps the channel clean while preserving full detail in the thread. Falls back to webhook (no threading) if bot token is not available.

- Filter out per-matrix test jobs (conda-cpp-tests, conda-python-tests, wheel-tests-*) from workflow job status since they are already tracked by S3 summaries. Only surface untracked jobs like notebooks and JuMP. - Move full CI job failure list to thread reply to avoid exceeding Slack's 3000-char block limit. Main message shows compact summary. - Chunk CI job details into multiple blocks if needed.

- Include ALL workflow jobs in consolidated JSON (not just untracked) with has_test_details flag to distinguish tracked vs untracked - Thread reply 1: CI Workflow Status showing every workflow group with pass/fail counts — new workflows automatically visible - Thread reply 2: Failing and flaky tests grouped by workflow so users see which workflow has which test issues - Main message alerts only on untracked CI failures (notebooks, JuMP) since tracked failures already appear in the matrix test grid

- Map CUOPT_AWS_* to standard AWS env vars before aws s3 presign so the CLI has credentials in the container - Log presign failures instead of swallowing them silently - Reorder thread: test failures/details first, CI workflow overview last

- Main message now lists which workflows have failures by name (e.g., "Failures in: conda-notebook-tests, wheel-tests-cuopt") with per-workflow failure counts - Add build-summary job to build.yaml that sends a Slack message after all builds complete, showing pass/fail per build job - Build summary queries GitHub API for job statuses, grouped by workflow prefix (cpp-build, wheel-build-cuopt, docs-build, etc.)

Skips all test jobs when summary-only=true, so nightly-summary runs immediately without waiting for GPU runners.

- Embed index.json and consolidated data directly into dashboard HTML during aggregation so it works on private S3 buckets without runtime fetches (no more 403 errors) - Dashboard falls back to S3 fetch if embedded data is absent - Add summary-only input to test.yaml to skip all test jobs and run only nightly-summary (avoids waiting for GPU runners when testing)

- Test totals: only show failed/flaky counts, skip passed/skipped/total - CI Workflow Status thread: only list failing workflows, one-line summary for passing ones

Write GitHub API response to a temp file instead of passing it as a shell argument to Python. The jobs JSON for a full build matrix exceeds the OS argument length limit.

Skips all build/publish/test/image jobs when summary-only=true, so build-summary runs immediately without waiting for runners.

Use GitHub API job counts (e.g., "1/11 failed") instead of vague "failed" or "matrix job(s)" in the per-workflow failure summary.

When the dashboard has embedded data and no S3 access, show a friendly message instead of a 403 error when switching dates. The embedded dashboard always shows the latest run.

Replace dropdowns with static labels when dashboard has embedded data, since switching dates/branches requires S3 access.

- S3 summaries, reports, and history now include branch slug in path: summaries/{date}/{branch}/, reports/{date}/{branch}/, history/{branch}/ - Each branch gets its own dashboard at dashboard/{branch}/index.html - index.json entries keyed by date/branch instead of just date - Dashboard date selector shows "date — branch" labels - Trends filtered to current branch - Prevents main and release/26.04 nightlies from overwriting each other

Checks if the new branch-separated summaries path has data before aggregating. Falls back to the old flat path for backward compatibility with summaries uploaded before the branch separation.

The nightly-summary was hardcoding today's date, but summaries on S3 are keyed by the date they were created. When re-running against earlier data, the date must match. Now uses the date input from the workflow, falling back to today if not provided.

The aws s3 ls command in the legacy path fallback needs credentials to access the private bucket. Moving CUOPT_AWS_* → AWS_* mapping to the top of the script so all aws CLI calls have credentials.

Previous empty runs uploaded consolidated.json to the branch path, causing the fallback to think data exists. Now only counts actual per-matrix summary files when deciding whether to fall back.

Branch-separated paths are now the only path structure. A full build+test run will populate the new paths.

Build summary should wait for and report on all jobs including the test trigger and image builds.

Only need to depend on tests, build-images, and docs-build since they transitively depend on all upstream build/publish jobs.

rgsl888prabhu added 14 commits April 13, 2026 11:45

Instrument all test scripts with nightly reporting

1392691

Add nightly report generation to cpp, python, wheel-python, wheel-server, and notebook test scripts using the shared helper. Wheel and notebook scripts also gain JUnit XML output and EXITCODE trap pattern for consistent error handling.

Update developer skill with CI best practices

a226607

Add pitfall entries for cross-cutting change discipline: full scope audits, code duplication, CI matrix parallelism, extensibility, and actionable reporting.

Fix pre-commit: ruff format, copyright years, dependency files

b6d1303

Apply ruff formatting to Python files, update copyright years to 2026 in shell scripts, regenerate conda environment files and pyproject.toml from dependencies.yaml, and remove hardcoded version from comment.

Fix nightly-summary job: remove unsupported container-options secrets

6b7605b

custom-job.yaml does not support secret references in container-options. Remove them and make nightly_summary.sh gracefully skip when CUOPT_DATASET_S3_URI is not available.

Remove unsupported script-env-secret-4 from test workflow

84069f8

The shared workflows only support 3 secret slots. The Slack webhook is only needed by the nightly-summary aggregation job which uses secrets: inherit.

Make bounce window and threshold configurable via env vars

d34bf28

CUOPT_BOUNCE_WINDOW_DAYS (default 14) and CUOPT_BOUNCE_THRESHOLD (default 2) can now be set as environment variables to tune flaky test detection without code changes.

Convert nightly-summary to inline job for secret access

2eae30b

The custom-job.yaml reusable workflow does not expose secrets as env vars. Convert nightly-summary to an inline job that directly sets all required secrets (S3, Slack) in the step environment.

rgsl888prabhu added 15 commits April 14, 2026 11:26

Add curl to nightly-summary container for Slack notifications

15641c5

The python:3.12-slim image doesn't include curl, which is needed by send_consolidated_summary.sh for Slack webhook and file upload.

Remove @channel ping from Slack notifications

777a58d

Add summary-only flag to test.yaml for quick nightly-summary testing

4419b92

Skips all test jobs when summary-only=true, so nightly-summary runs immediately without waiting for GPU runners.

Show Failures tab first in dashboard instead of Matrix Grid

cc9246e

Trim verbose Slack output: compact stats, failures-only CI status

1b45e24

- Test totals: only show failed/flaky counts, skip passed/skipped/total - CI Workflow Status thread: only list failing workflows, one-line summary for passing ones

Fix build summary argument list too long error

78c1e38

Write GitHub API response to a temp file instead of passing it as a shell argument to Python. The jobs JSON for a full build matrix exceeds the OS argument length limit.

rgsl888prabhu added 13 commits April 15, 2026 13:27

Add summary-only flag to build.yaml for quick build-summary testing

dd75886

Skips all build/publish/test/image jobs when summary-only=true, so build-summary runs immediately without waiting for runners.

Show CI job pass/fail counts in main Slack message

bd1ec6f

Use GitHub API job counts (e.g., "1/11 failed") instead of vague "failed" or "matrix job(s)" in the per-workflow failure summary.

Show failures before matrix overview in consolidated HTML report

609db51

Handle date switching gracefully in embedded dashboard

bcde49c

When the dashboard has embedded data and no S3 access, show a friendly message instead of a 403 error when switching dates. The embedded dashboard always shows the latest run.

Show branch and date as info labels in embedded dashboard

2499eb3

Replace dropdowns with static labels when dashboard has embedded data, since switching dates/branches requires S3 access.

Fall back to legacy S3 path when branch-separated path is empty

9f7cef5

Checks if the new branch-separated summaries path has data before aggregating. Falls back to the old flat path for backward compatibility with summaries uploaded before the branch separation.

Move AWS credential mapping before S3 fallback check

f30a4e2

The aws s3 ls command in the legacy path fallback needs credentials to access the private bucket. Moving CUOPT_AWS_* → AWS_* mapping to the top of the script so all aws CLI calls have credentials.

Exclude consolidated.json from fallback path check

404263c

Previous empty runs uploaded consolidated.json to the branch path, causing the fallback to think data exists. Now only counts actual per-matrix summary files when deciding whether to fall back.

Remove legacy S3 path fallback

c90e73c

Branch-separated paths are now the only path structure. A full build+test run will populate the new paths.

Add tests and build-images to build-summary needs

54d01fd

Build summary should wait for and report on all jobs including the test trigger and image builds.

Simplify build-summary needs to leaf jobs only

b7efe3b

Only need to depend on tests, build-images, and docs-build since they transitively depend on all upstream build/publish jobs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test reurun and reporting#1098

Flaky test reurun and reporting#1098
rgsl888prabhu wants to merge 42 commits intomainfrom
flaky_test_reurun_and_reporting

rgsl888prabhu commented Apr 14, 2026

Uh oh!

copy-pr-bot bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rgsl888prabhu commented Apr 14, 2026

Description

Issue

Checklist

Uh oh!

copy-pr-bot bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant