Skip to content

Add flaky test detection, retry, crash isolation, and nightly reporting#1098

Open
rgsl888prabhu wants to merge 87 commits intomainfrom
flaky_test_reurun_and_reporting
Open

Add flaky test detection, retry, crash isolation, and nightly reporting#1098
rgsl888prabhu wants to merge 87 commits intomainfrom
flaky_test_reurun_and_reporting

Conversation

@rgsl888prabhu
Copy link
Copy Markdown
Collaborator

@rgsl888prabhu rgsl888prabhu commented Apr 14, 2026

Summary

End-to-end CI pipeline for detecting, classifying, and reporting flaky tests across all test types (C++ gtests, Python pytest, server pytest).

Flaky detection and retry

  • pytest: --reruns 2 via pytest-rerunfailures for all builds (PR + nightly). Custom cuopt_rerun_xml plugin writes supplementary JUnit XML for rerun failures (pytest-rerunfailures v14+ only records final outcome)
  • gtest: per-test-case retry with separate XML files per attempt
  • Crash isolation (nightly only): on segfault/SIGABRT, collects test list, subtracts already-passed tests, retries remaining individually to identify the crashing test
  • Flaky validation tests (test_flaky_validation.py/.cpp) included for pipeline validation

Nightly reporting pipeline

  • Per-job report (nightly_report.py): classifies tests as passed/failed/flaky/error, tracks failure history on S3 (new vs recurring vs stabilized), generates Markdown + HTML + JSON
  • Consolidated report (aggregate_nightly.py): merges all matrix job summaries into a single view with matrix grid, failure details, and flaky test list
  • Slack notification (generate_slack_payloads.py): threaded messages grouped by workflow, mentions @ramakrishna_prabhu on new failures or new flaky tests
  • Dashboard (ci/dashboard/index.html): self-contained HTML with tabs for Failures, Flaky, Stabilized, Matrix Grid, and Trends
  • GitHub Step Summary: both per-job and consolidated summaries visible directly in Actions

History and cross-run tracking

  • Persistent failure history per branch on S3 (history/{branch_slug}/)
  • Bounce detection: tests that flip between passing and failing across runs are classified as cross-run flaky after 2+ bounces (configurable via CUOPT_BOUNCE_WINDOW_DAYS, CUOPT_BOUNCE_THRESHOLD)
  • History seeding: new branches (e.g., release/26.06) inherit failure history from main on first run, so known failures aren't re-reported as new
  • New flaky tracking: distinguishes first-time flaky (triggers notification) from recurring flaky

S3 path isolation

  • Summaries scoped by GITHUB_RUN_ID (summaries/run-{id}/) — prevents cross-run pollution when RAPIDS_BRANCH defaults to "main" in rapidsai containers for feature branches
  • Branch-scoped copy also written (summaries/{date}/{branch}/) for manual browsing
  • Date-free run-scoped path prevents midnight UTC mismatches when jobs span days

Code quality

  • Shared helpers: crash_helpers.sh (signal_name, was_signal_death, write_crash_xml, xml_escape, pytest_crash_isolate), junit_helpers.py (JUnit XML parsing for failed/passed/gtest-list)
  • Extracted inline scripts: generate_slack_payloads.py, generate_step_summary.py
  • XML escaping for crash markers (handles parameterized test names with <>&")
  • Fixed CUOPT_DATASET_S3_URICUOPT_S3_URI migration in PDLP and MIPLIB download scripts
  • Fixed nested f-string quote issue in build_summary.sh
image image

Test plan

  • Flaky validation tests detected correctly in per-job reports (1 flaky per test type)
  • Consolidated report aggregates flaky tests from all matrix jobs
  • Slack notification shows new flaky tests with :new: :warning: tag
  • Dashboard Flaky tab shows error messages from failed attempts
  • GitHub Step Summary visible for both per-job and nightly-summary jobs
  • Dataset download succeeds via S3 (no HTTP fallback for nug08-3rd)
  • History seeding works for non-main branches (first run inherits from main)
  • Remove test_flaky_validation files before merging to main

Add retry logic for gtest binaries via GTEST_MAX_RETRIES (default 1)
and pytest reruns via --reruns 2 --reruns-delay 5. Tests that fail
then pass on retry are classified as flaky rather than failures.
Add pytest-rerunfailures as a test dependency.
Add matrix-aware nightly test report generator that parses JUnit XML,
classifies failures as new/recurring/flaky/stabilized, maintains
per-matrix failure history on S3, and outputs Markdown, HTML, and JSON
reports. Extract S3 helpers into shared module and shell helper to
eliminate duplication across test scripts.
Add nightly report generation to cpp, python, wheel-python,
wheel-server, and notebook test scripts using the shared helper.
Wheel and notebook scripts also gain JUnit XML output and EXITCODE
trap pattern for consistent error handling.
Add aggregate_nightly.py to merge per-matrix JSON summaries into a
consolidated report with matrix grid. Add Slack notifiers for both
per-job and consolidated messages with HTML file upload support.
Add nightly_summary.sh wrapper for the post-test aggregation job.
Add static HTML dashboard with matrix overview, failure drill-down,
and trend charts reading from S3 index.json.
Pass Slack webhook secret to all test jobs. Add nightly-summary job
that runs after all test jobs complete, aggregates results from S3,
sends a consolidated Slack notification, and uploads the dashboard.
Pass S3 and Slack secrets via container-options for the custom job.
Add pitfall entries for cross-cutting change discipline: full scope
audits, code duplication, CI matrix parallelism, extensibility, and
actionable reporting.
Apply ruff formatting to Python files, update copyright years
to 2026 in shell scripts, regenerate conda environment files
and pyproject.toml from dependencies.yaml, and remove hardcoded
version from comment.
custom-job.yaml does not support secret references in container-options.
Remove them and make nightly_summary.sh gracefully skip when
CUOPT_DATASET_S3_URI is not available.
The shared workflows only support 3 secret slots. The Slack webhook
is only needed by the nightly-summary aggregation job which uses
secrets: inherit.
Tests that resolve then fail again within 14 days are recognized as
bouncing rather than new failures. After 2+ bounces a test is
automatically classified as cross-run flaky. Resolved tests only
generate one notification — subsequent passes are silent.
CUOPT_BOUNCE_WINDOW_DAYS (default 14) and CUOPT_BOUNCE_THRESHOLD
(default 2) can now be set as environment variables to tune flaky
test detection without code changes.
The custom-job.yaml reusable workflow does not expose secrets as env
vars. Convert nightly-summary to an inline job that directly sets
all required secrets (S3, Slack) in the step environment.
In CI, aws-actions/configure-aws-credentials sets role-based tokens
(AWS_ACCESS_KEY_ID + AWS_SESSION_TOKEN). The CUOPT_AWS_* overrides
were replacing these with static keys that lack the session token,
causing InvalidToken errors. Now only fall back to CUOPT_AWS_* when
standard AWS credentials are not already set.
The cuOpt S3 bucket requires CUOPT_AWS_* static credentials. The
role-based session token from aws-actions/configure-aws-credentials
was causing InvalidToken errors. Always override with CUOPT_AWS_*
and unset AWS_SESSION_TOKEN, matching the pattern in datasets/*.sh.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 14, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Move the nightly-summary job out of test.yaml into its own
nightly-summary.yaml reusable workflow. Runs in a python:3.12-slim
container to avoid PEP 668 externally-managed-environment errors
when installing awscli. Also adds workflow_dispatch trigger so
the summary can be re-run manually against an earlier test run.
The python:3.12-slim image doesn't include curl, which is needed
by send_consolidated_summary.sh for Slack webhook and file upload.
- Filter consolidated.json from S3 aggregation to fix "unknown" entry
- Migrate Slack file upload from deprecated files.upload to
  getUploadURLExternal + completeUploadExternal
- Chunk Slack messages into header/grid/details/links to stay within
  block and character limits
- Remove S3 link from Slack in favor of HTML file attachment
- Add --junitxml to Pyomo, CvxPy, and PuLP thirdparty test scripts
  so failures appear in nightly reports
- Export RAPIDS_TESTS_DIR from test_wheel_cuopt.sh for subprocesses
…Slack

- Generate presigned S3 URLs (7-day expiry) for consolidated HTML
  report and dashboard, linked in Slack messages
- Query GitHub API for workflow job statuses to surface CI-level
  failures (notebooks, JuMP, etc.) that don't produce JUnit XML
- Show only failed/flaky matrix entries in Slack instead of listing
  all passing ones — compact summary line for green runs
- Pass GITHUB_TOKEN and GITHUB_RUN_ID to nightly-summary container
- Remove temporary test workflow file
Post the main summary (status + links) as a top-level message via
chat.postMessage, then post matrix details, failure breakdowns,
and the HTML report as thread replies. Keeps the channel clean
while preserving full detail in the thread.

Falls back to webhook (no threading) if bot token is not available.
- Filter out per-matrix test jobs (conda-cpp-tests, conda-python-tests,
  wheel-tests-*) from workflow job status since they are already tracked
  by S3 summaries. Only surface untracked jobs like notebooks and JuMP.
- Move full CI job failure list to thread reply to avoid exceeding
  Slack's 3000-char block limit. Main message shows compact summary.
- Chunk CI job details into multiple blocks if needed.
- Include ALL workflow jobs in consolidated JSON (not just untracked)
  with has_test_details flag to distinguish tracked vs untracked
- Thread reply 1: CI Workflow Status showing every workflow group
  with pass/fail counts — new workflows automatically visible
- Thread reply 2: Failing and flaky tests grouped by workflow so
  users see which workflow has which test issues
- Main message alerts only on untracked CI failures (notebooks, JuMP)
  since tracked failures already appear in the matrix test grid
- Map CUOPT_AWS_* to standard AWS env vars before aws s3 presign
  so the CLI has credentials in the container
- Log presign failures instead of swallowing them silently
- Reorder thread: test failures/details first, CI workflow overview last
- Main message now lists which workflows have failures by name
  (e.g., "Failures in: conda-notebook-tests, wheel-tests-cuopt")
  with per-workflow failure counts
- Add build-summary job to build.yaml that sends a Slack message
  after all builds complete, showing pass/fail per build job
- Build summary queries GitHub API for job statuses, grouped by
  workflow prefix (cpp-build, wheel-build-cuopt, docs-build, etc.)
Skips all test jobs when summary-only=true, so nightly-summary runs
immediately without waiting for GPU runners.
- Embed index.json and consolidated data directly into dashboard HTML
  during aggregation so it works on private S3 buckets without
  runtime fetches (no more 403 errors)
- Dashboard falls back to S3 fetch if embedded data is absent
- Add summary-only input to test.yaml to skip all test jobs and run
  only nightly-summary (avoids waiting for GPU runners when testing)
- Test totals: only show failed/flaky counts, skip passed/skipped/total
- CI Workflow Status thread: only list failing workflows, one-line
  summary for passing ones
Write GitHub API response to a temp file instead of passing it as
a shell argument to Python. The jobs JSON for a full build matrix
exceeds the OS argument length limit.
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds end-to-end nightly CI aggregation and reporting: new reusable workflows/jobs, S3-backed aggregation and index updates, Slack notifications, rerun/crash-isolation tooling, per-suite nightly report generation (JSON/HTML/Markdown), a static dashboard UI, and multiple CI script and dependency updates to emit and consume JUnit outputs.

Changes

Cohort / File(s) Summary
Workflows
.github/workflows/build.yaml, .github/workflows/test.yaml, .github/workflows/pr.yaml, .github/workflows/nightly-summary.yaml
Added nightly-summary reusable workflow and conditional build-summary/nightly-summary jobs; updated secret wiring to use CUOPT_S3_URI; added workflow inputs/secrets for AWS/S3 and optional Slack integration.
Top-level CI dispatch & summary scripts
ci/build_summary.sh, ci/nightly_summary.sh, ci/utils/send_consolidated_summary.sh
New scripts to collect run/job metadata, invoke aggregation, create presigned URLs, and post Slack notifications (threading and file upload).
Aggregation & dashboard UI
ci/utils/aggregate_nightly.py, ci/dashboard/index.html
New aggregator that merges per-matrix JSON into consolidated JSON/HTML, updates S3 index/pruning, optionally embeds data into a static dashboard; added interactive dashboard HTML.
Nightly report generation
ci/utils/nightly_report.py, ci/utils/nightly_report_helper.sh
New nightly report generator parsing JUnit XMLs, maintaining history, classifying new/recurring/flaky/resolved tests, emitting JSON/HTML/Markdown, and helper to compute S3/history paths.
Slack & step-summary utilities
ci/utils/generate_slack_payloads.py, ci/utils/generate_step_summary.py
New Python utilities to build Slack Block Kit payloads (main + threaded replies) and to render GitHub Step Summary Markdown from consolidated summaries.
JUnit / pytest rerun tooling
ci/utils/junit_helpers.py, ci/utils/cuopt_rerun_xml.py
Added JUnit parsing helpers and a pytest plugin that records rerun failure messages and emits supplemental rerun XML for aggregation.
Crash isolation helpers
ci/utils/crash_helpers.sh
New bash helpers to detect signal-based crashes, write crash-specific JUnit XML, and retry isolated pytest node IDs to classify consistent vs flaky crashes.
Test-run orchestration & retries
ci/run_ctests.sh, ci/run_cuopt_pytests.sh, ci/run_cuopt_server_pytests.sh, ci/test_cpp.sh, ci/test_python.sh, ci/test_notebooks.sh, ci/test_wheel_cuopt.sh, ci/test_wheel_cuopt_server.sh, ci/thirdparty-testing/run_*.sh
Updated runners to emit JUnit XML to RAPIDS_TESTS_DIR, wrap runs with retry/crash-isolation flows, accumulate exit codes, and call nightly reporting helpers.
S3 helpers
ci/utils/s3_helpers.py
New AWS CLI wrappers building env from CUOPT AWS vars and providing s3_download/s3_upload/s3_list with error handling.
Slack payload sender
ci/utils/send_consolidated_summary.sh
Bash notifier that delegates payload creation to Python, posts main message and threaded replies, and uploads consolidated HTML when present.
Flaky-test validation artifacts
cpp/tests/utilities/CMakeLists.txt, cpp/tests/utilities/test_flaky_validation.cpp, python/.../tests/test_flaky_validation.py, python/cuopt_server/.../tests/test_flaky_validation.py
Added intentionally flaky C++ and Python tests and registered CMake test target to validate rerun/crash-handling pipeline.
Datasets S3 env rename & paths
datasets/get_test_data.sh, datasets/.../download_*.sh
Replaced CUOPT_DATASET_S3_URI with CUOPT_S3_URI, updated dataset S3 prefixes to ci_datasets/..., and improved presence/warning messages.
Conda / project test dependencies
conda/environments/*, dependencies.yaml, python/*/pyproject.toml, python/.../linear_programming/pyproject.toml
Added pytest-rerunfailures to conda envs and project test extras to enable rerun behavior used by the new tooling.
C++ test registration / formatting
cpp/tests/utilities/CMakeLists.txt
Added FLAKY_VALIDATION_TEST test target and minor formatting change for existing CLI_TEST.
Misc guidance
skills/cuopt-developer/SKILL.md
Added CI "change discipline" guidance covering persistence, notifications, lifecycle validation, parallel-safety, shared helpers, and context-rich reporting.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 78.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add flaky test detection, retry, crash isolation, and nightly reporting' accurately summarizes the main objective of the PR—implementing an end-to-end CI pipeline for flaky test management across multiple test frameworks.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description comprehensively relates to the changeset, detailing end-to-end flaky test detection, retry mechanisms, crash isolation, nightly reporting pipeline, history tracking, and supporting infrastructure changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch flaky_test_reurun_and_reporting

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (8)
datasets/get_test_data.sh-38-38 (1)

38-38: ⚠️ Potential issue | 🟡 Minor

Normalize CUOPT_S3_URI before appending subpaths.

Line 38 assumes a trailing slash; CUOPT_S3_URI=s3://bucket produces s3://bucketci_datasets/routing/ and breaks sync.

Suggested fix
-    local s3_uri="${CUOPT_S3_URI}ci_datasets/routing/"
+    local s3_base="${CUOPT_S3_URI%/}"
+    local s3_uri="${s3_base}/ci_datasets/routing/"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datasets/get_test_data.sh` at line 38, The assignment creating s3_uri using
CUOPT_S3_URI doesn't ensure a trailing slash so concatenation can yield invalid
paths (e.g., s3://bucketci_datasets/...); update the logic that sets local
s3_uri to normalize CUOPT_S3_URI first (for example, append a "/" if
CUOPT_S3_URI does not already end with "/" or strip/ensure a single separator)
before adding "ci_datasets/routing/". Locate the assignment to s3_uri and adjust
the string handling for CUOPT_S3_URI so downstream sync operations get a
well-formed S3 URI.
datasets/linear_programming/download_pdlp_test_dataset.sh-57-58 (1)

57-58: ⚠️ Potential issue | 🟡 Minor

Normalize S3 URI joining to avoid malformed paths.

Line 58 directly concatenates CUOPT_S3_URI with a suffix. If the configured URI is missing a trailing slash, the resolved S3 path becomes invalid.

Suggested fix
-    local s3_uri="${CUOPT_S3_URI}ci_datasets/linear_programming/pdlp/"
+    local s3_uri="${CUOPT_S3_URI%/}/ci_datasets/linear_programming/pdlp/"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datasets/linear_programming/download_pdlp_test_dataset.sh` around lines 57 -
58, The s3_uri construction concatenates CUOPT_S3_URI and a suffix directly,
which can produce malformed paths if CUOPT_S3_URI lacks a trailing slash; update
the assignment for s3_uri to normalize joining by ensuring CUOPT_S3_URI ends
with a single slash (e.g., strip any trailing slashes then append one, or
conditionally add a slash when missing) before appending
"ci_datasets/linear_programming/pdlp/"; modify the s3_uri variable assignment
(referencing CUOPT_S3_URI and s3_uri) to perform this normalization so the
resulting S3 URI is always well-formed.
ci/utils/generate_slack_payloads.py-28-31 (1)

28-31: ⚠️ Potential issue | 🟡 Minor

Add CLI argument validation before indexing sys.argv.

Line 29 can throw IndexError when invoked without arguments; return a usage error instead.

Suggested fix
 def main():
+    if len(sys.argv) < 2:
+        print(
+            f"Usage: {sys.argv[0]} <summary.json> [presigned_report_url] [presigned_dashboard_url]",
+            file=sys.stderr,
+        )
+        sys.exit(1)
     summary_path = sys.argv[1]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/generate_slack_payloads.py` around lines 28 - 31, The main()
function currently indexes sys.argv directly (summary_path = sys.argv[1],
presigned_report_url = sys.argv[2], presigned_dashboard_url = sys.argv[3]) which
can raise IndexError when invoked without arguments; add CLI argument validation
at the start of main() to check len(sys.argv) (e.g., require at least 2 elements
for summary_path), and if missing print a usage message (showing expected
arguments) and exit non-zero before attempting to access sys.argv[1], while
still using optional defaults for presigned_report_url and
presigned_dashboard_url when those indices are absent.
ci/utils/junit_helpers.py-82-85 (1)

82-85: ⚠️ Potential issue | 🟡 Minor

Reject --sep when no value is provided.

Lines 82-85 currently ignore --sep without a value, which silently falls back to ".". Fail fast with a usage error instead.

Suggested fix
         sep = "."
         for i, arg in enumerate(sys.argv[3:], 3):
-            if arg == "--sep" and i + 1 < len(sys.argv):
-                sep = sys.argv[i + 1]
+            if arg == "--sep":
+                if i + 1 >= len(sys.argv):
+                    print("--sep requires a value", file=sys.stderr)
+                    sys.exit(1)
+                sep = sys.argv[i + 1]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/junit_helpers.py` around lines 82 - 85, The argument parsing loop
silently ignores a bare "--sep" and falls back to "."; change the parsing in the
sys.argv loop so that when arg == "--sep" and there is no next element (i + 1 >=
len(sys.argv)) the program fails fast with a usage error (e.g., write a clear
error to stderr and exit with non-zero or raise SystemExit) instead of leaving
sep unchanged; keep the rest of the logic (setting sep from sys.argv[i + 1] and
calling extract_tests(xml_path, status=cmd, sep=sep)) intact so behaviour only
changes when "--sep" is provided without a value.
ci/utils/junit_helpers.py-18-33 (1)

18-33: ⚠️ Potential issue | 🟡 Minor

Use defusedxml for hardened JUnit XML parsing.

Line 31 uses stdlib ElementTree.parse(), which Ruff S314 flags for potential XML parser vulnerabilities. While JUnit XML in CI comes from test harness outputs (trusted source), switching to defusedxml provides defensive hardening against parser attacks.

Suggested fix
 import sys
-from xml.etree import ElementTree
+from defusedxml import ElementTree
+from defusedxml.common import DefusedXmlException
@@
-    except (ElementTree.ParseError, FileNotFoundError, OSError):
+    except (
+        ElementTree.ParseError,
+        DefusedXmlException,
+        FileNotFoundError,
+        OSError,
+    ):
         return

Note: Verify defusedxml can be added as a dependency for this CI utility script.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/junit_helpers.py` around lines 18 - 33, Replace the use of the
stdlib XML parser in extract_tests with defusedxml's safe parser: change the
import and call to use defusedxml.ElementTree.parse (or alias
defusedxml.ElementTree as ElementTree) instead of xml.etree.ElementTree, and
update the exception tuple to catch the parser's ParseError from defusedxml
(keep FileNotFoundError and OSError). Specifically, update the module import at
top and the parse call inside extract_tests so ElementTree.parse uses
defusedxml's implementation while preserving the existing error handling.
ci/test_wheel_cuopt_server.sh-45-55 (1)

45-55: ⚠️ Potential issue | 🟡 Minor

Doc test failures may not affect the exit code.

The ERR trap combined with set +e on line 47 means the trap is disabled for subsequent commands. If test_doc_examples.sh (line 54) fails, the ERR trap won't fire and EXITCODE will remain 0.

If doc test failures should also be captured:

🔧 Proposed fix to capture doc test failures
-# Run documentation tests
-./ci/test_doc_examples.sh
+# Run documentation tests
+./ci/test_doc_examples.sh || EXITCODE=1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/test_wheel_cuopt_server.sh` around lines 45 - 55, The current script
disables the ERR trap by using "set +e" so failures in test_doc_examples.sh
won't update EXITCODE; restore failure detection by either removing "set +e"
(use "set -e" or no change) or explicitly check test_doc_examples.sh's exit
status and set EXITCODE=1 on failure — e.g., after calling
./ci/test_doc_examples.sh, test "$?" -ne 0 && EXITCODE=1; keep references to
EXITCODE, the ERR trap, set +e, and the ./ci/test_doc_examples.sh invocation
when making the change.
ci/build_summary.sh-26-30 (1)

26-30: ⚠️ Potential issue | 🟡 Minor

GitHub API pagination may miss jobs.

The API call uses per_page=100 but doesn't handle pagination. Workflows with more than 100 jobs would have incomplete data.

🔧 Consider adding pagination or increasing awareness

For most workflows, 100 jobs is sufficient. If this becomes an issue, you'd need to loop through pages. For now, consider adding a comment acknowledging the limit:

     curl -s -L \
         -H "Authorization: Bearer ${GITHUB_TOKEN}" \
         -H "Accept: application/vnd.github+json" \
+        # Note: Limited to first 100 jobs. Add pagination if workflow grows larger.
         "https://api.github.com/repos/${GITHUB_REPOSITORY}/actions/runs/${GITHUB_RUN_ID}/jobs?per_page=100" \
         > "${JOBS_FILE}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/build_summary.sh` around lines 26 - 30, The curl call that writes to
"${JOBS_FILE}" for the Actions jobs endpoint uses per_page=100 but doesn't
handle pagination; update the script to either loop through paginated pages
(using the "page" query param and aggregating results before writing to
JOBS_FILE) or at minimum add a clear comment near the curl invocation
referencing the 100-item limit and when to implement pagination; locate the curl
invocation that references GITHUB_RUN_ID and GITHUB_REPOSITORY and modify it to
fetch subsequent pages until no more results (or document the limitation) so
workflows with >100 jobs aren't truncated.
.github/workflows/nightly-summary.yaml-57-57 (1)

57-57: ⚠️ Potential issue | 🟡 Minor

This introduces an actionlint error unless the custom runner label is configured.

linux-amd64-cpu4 is not a built-in label, so the workflow lint will keep failing unless the repo's actionlint.yaml declares it or this job switches to the standard self-hosted label set.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-summary.yaml at line 57, The workflow uses a
non-standard runner label 'runs-on: linux-amd64-cpu4' which triggers actionlint
unless declared; either replace that label with the standard self-hosted set
(e.g., 'runs-on: self-hosted' or 'runs-on: [self-hosted, linux, amd64, cpu4]')
in nightly-summary.yaml, or add the custom label 'linux-amd64-cpu4' to your
actionlint configuration (actionlint.yaml) so the linter recognizes it.
🧹 Nitpick comments (12)
python/cuopt/pyproject.toml (1)

50-50: Consider adding a minimum version constraint for pytest-rerunfailures.

The dependency is added without any version pin. While the comment indicates this list is auto-generated from dependencies.yaml, consider specifying at least a minimum version (e.g., pytest-rerunfailures>=10.0) in the source dependencies.yaml to ensure consistent behavior across CI environments and prevent potential breakage from future incompatible releases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/pyproject.toml` at line 50, The pytest-rerunfailures entry
currently has no version constraint; update the source dependencies.yaml to pin
a minimum version (for example set pytest-rerunfailures>=10.0) and then
regenerate the auto-generated list so pyproject.toml's dependency line for
"pytest-rerunfailures" includes the minimum version; ensure the final
pyproject.toml contains the string pytest-rerunfailures>=10.0 (or your chosen
minimum) instead of the unpinned name.
python/cuopt_server/cuopt_server/tests/test_flaky_validation.py (1)

15-28: Consider using a pytest fixture for marker cleanup to ensure test isolation.

If the test crashes after creating the marker but before cleanup (e.g., test timeout, process kill), the marker persists and subsequent runs will pass immediately without exercising the flaky detection. A fixture-based approach ensures cleanup even on failures.

Additionally, consider including a unique identifier (e.g., pytest-rerunfailures run ID or timestamp) in the marker path to prevent interference on shared CI runners or parallel test execution.

♻️ Proposed refactor using pytest fixture
 import os
 import tempfile
+import pytest
 
 
-MARKER = os.path.join(tempfile.gettempdir(), "cuopt_flaky_validation_server")
+@pytest.fixture
+def flaky_marker():
+    marker = os.path.join(tempfile.gettempdir(), "cuopt_flaky_validation_server")
+    yield marker
+    # Cleanup on any exit (pass, fail, or error)
+    if os.path.exists(marker):
+        os.remove(marker)
 
 
-def test_flaky_fails_first_passes_on_retry():
-    if os.path.exists(MARKER):
+def test_flaky_fails_first_passes_on_retry(flaky_marker):
+    if os.path.exists(flaky_marker):
         # Second run — pass and clean up
-        os.remove(MARKER)
+        os.remove(flaky_marker)
         assert True, "Passed on retry (flaky validation working)"
     else:
         # First run — create marker and fail
-        with open(MARKER, "w") as f:
+        with open(flaky_marker, "w") as f:
             f.write("first_attempt")
         assert False, (
             "Intentional first-run failure for flaky detection validation"
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt_server/cuopt_server/tests/test_flaky_validation.py` around lines
15 - 28, The test creates a persistent MARKER and doesn't guarantee cleanup;
replace the global MARKER use with a pytest fixture that creates a unique marker
path (use tmp_path/tmp_path_factory or append a uuid/timestamp/pid) and yields
it, then always removes the marker in the fixture teardown to ensure cleanup
even on failures or timeouts; update test_flaky_fails_first_passes_on_retry to
accept that fixture and use the provided marker path instead of the global
MARKER so test isolation and parallel CI runs are safe.
ci/utils/cuopt_rerun_xml.py (1)

60-62: Consider logging when RAPIDS_TESTS_DIR is unset.

The plugin silently returns without writing output if RAPIDS_TESTS_DIR is not set. While this is intentional for non-CI runs, a debug message could help troubleshoot missing rerun XML in CI:

💡 Optional: Add debug output for missing env var
     output_dir = os.environ.get("RAPIDS_TESTS_DIR", "")
     if not output_dir:
+        # Silently skip when not in CI context (no RAPIDS_TESTS_DIR)
         return

Or if you want visibility:

     output_dir = os.environ.get("RAPIDS_TESTS_DIR", "")
     if not output_dir:
+        print("Note: RAPIDS_TESTS_DIR not set, skipping rerun XML output")
         return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/cuopt_rerun_xml.py` around lines 60 - 62, When RAPIDS_TESTS_DIR is
unset the function silently returns (output_dir =
os.environ.get("RAPIDS_TESTS_DIR", "") followed by if not output_dir: return);
add a debug log immediately before the early return to indicate the env var is
missing so CI runs are easier to troubleshoot. Use the module logger (e.g.,
logging.getLogger(__name__).debug(...)) or the existing logger used in this file
to emit a clear message referencing RAPIDS_TESTS_DIR and the fact that rerun XML
will not be written, placing the call just before the return that checks
output_dir.
ci/utils/send_nightly_summary.sh (1)

153-158: Hardcoded channel name is overridden by webhook configuration.

The channel field on line 154 is typically ignored when using Slack incoming webhooks, as the webhook is bound to a specific channel. This is acceptable, but consider removing it to avoid confusion or making it configurable via environment variable if you need flexibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/send_nightly_summary.sh` around lines 153 - 158, The payload object
currently hardcodes the Slack "channel" field which is ignored for incoming
webhooks; update the payload construction in send_nightly_summary.sh to either
remove the "channel" key from the payload (delete the "channel":
"cuopt-regression-testing" entry) or make it configurable via an environment
variable (e.g., read $SLACK_CHANNEL and include "channel" only when set) so the
payload variable reflects actual webhook behavior; adjust any related
documentation or default env setup if you choose the configurable approach.
ci/utils/generate_step_summary.py (1)

16-18: Add basic argument and file error handling.

The script will fail with unhelpful errors if called without arguments or with a non-existent file. Consider adding minimal validation:

🛡️ Proposed fix
 def main():
+    if len(sys.argv) < 2:
+        print("Usage: generate_step_summary.py <consolidated_summary.json>", file=sys.stderr)
+        sys.exit(1)
+    try:
+        with open(sys.argv[1]) as f:
+            d = json.load(f)
+    except (FileNotFoundError, json.JSONDecodeError) as e:
+        print(f"Error loading {sys.argv[1]}: {e}", file=sys.stderr)
+        sys.exit(1)
-    with open(sys.argv[1]) as f:
-        d = json.load(f)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/generate_step_summary.py` around lines 16 - 18, main currently
assumes sys.argv[1] exists and the path is valid; add minimal argument and file
error handling by validating len(sys.argv) (or using argparse) and printing a
usage message and exiting with non-zero when missing, then wrap the
open/sys.argv[1] and json.load calls in try/except to catch FileNotFoundError
and json.JSONDecodeError and log or print a clear error message before exiting;
update the main function (and any helper functions that read the file) to return
a non-zero exit code on error rather than allowing an uncaught exception.
ci/run_ctests.sh (1)

80-85: Quote variable in arithmetic context for signal_name.

The signal_name function receives an unquoted ${rc} which could cause issues if the variable is empty or contains unexpected characters. While unlikely given the context, quoting is safer and more consistent with the rest of the script.

🔧 Proposed fix
-        echo "CRASH: ${test_name} died from $(signal_name ${rc}) (exit code ${rc})"
+        echo "CRASH: ${test_name} died from $(signal_name "${rc}") (exit code ${rc})"

This same pattern appears at lines 91, 118, 158, and 160. Consider applying consistently.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/run_ctests.sh` around lines 80 - 85, Calls to signal_name use an unquoted
${rc} which can break if rc is empty or contains spaces; update every invocation
(e.g., in the block that calls was_signal_death and echo "CRASH: ${test_name}
died from $(signal_name ${rc}) ...") to pass "${rc}" (quoted) instead of ${rc}.
Search for other usages of signal_name ${rc} (the same pattern appears later in
the script) and change them to signal_name "${rc}" to ensure safe handling; keep
the rest of the logic (was_signal_death, test_name, OVERALL_RC) unchanged.
ci/run_cuopt_server_pytests.sh (2)

23-29: Same JUnit XML extraction pattern as run_cuopt_pytests.sh.

Consider using the more precise pattern --junitxml=* as suggested for the other file.

🔧 More precise pattern
 for arg in "$@"; do
-    if [[ "${arg}" == *"junitxml"*  ]]; then
+    if [[ "${arg}" == --junitxml=* ]]; then
         xml_file="${arg#*=}"
         break
     fi
 done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/run_cuopt_server_pytests.sh` around lines 23 - 29, The JUnit XML
extraction loop uses a loose pattern and should match the exact flag; update the
condition inside the for-loop that inspects args (the block setting xml_file and
breaking) to test for the precise pattern --junitxml=* instead of *"junitxml"*
so xml_file="${arg#*=}" still extracts the filename only when the argument
starts with --junitxml=; keep the same variable name xml_file and break
behavior.

34-49: Consider extracting shared pytest crash handling.

The crash handling logic (lines 34-49) is identical to ci/run_cuopt_pytests.sh. Per the coding guidelines in SKILL.md about avoiding duplication, consider extracting this to a shared helper in crash_helpers.sh:

# In crash_helpers.sh:
run_pytest_with_crash_isolation() {
    local test_dir="$1"
    shift
    # ... shared logic
}

This would reduce maintenance burden and ensure consistent behavior across pytest runners.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/run_cuopt_server_pytests.sh` around lines 34 - 49, Extract the duplicated
crash handling from ci/run_cuopt_server_pytests.sh into a shared helper (e.g.,
crash_helpers.sh) by implementing a function run_pytest_with_crash_isolation
that accepts the pytest arguments and any context (e.g., xml_file, test_dir) and
encapsulates the logic around running pytest, capturing rc, checking
signal_name, honoring IS_NIGHTLY, calling pytest_crash_isolate when needed, and
exiting with rc; then replace the block in run_cuopt_server_pytests.sh (the
current rc/pytest invocation and the if-checks that call signal_name,
pytest_crash_isolate, and exit) with a call to run_pytest_with_crash_isolation,
ensuring you pass xml_file and other required variables and keep references to
pytest_crash_isolate and signal_name intact so behavior is unchanged.
ci/utils/nightly_report_helper.sh (1)

46-55: Consider uname -m for broader compatibility.

The arch command is specific to GNU coreutils and may not be available in all CI environments (e.g., minimal containers). uname -m is more portable:

🔧 More portable architecture detection
     local cuda_tag="cuda${RAPIDS_CUDA_VERSION:-unknown}"
     local arch_tag
-    arch_tag="$(arch)"
+    arch_tag="$(uname -m)"
     local matrix_label="${cuda_tag}-${arch_tag}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/nightly_report_helper.sh` around lines 46 - 55, The script uses the
arch command to set arch_tag which may be missing in some environments; change
the architecture detection in the build matrix block to use uname -m instead
(replace arch invocation used when setting arch_tag and in matrix_label
computation), e.g., set arch_tag="$(uname -m)" so matrix_label (and the py-tag
branch) remains correct and more portable across minimal CI containers.
ci/run_cuopt_pytests.sh (1)

23-30: JUnit XML path extraction could match unintended arguments.

The pattern *"junitxml"* would match any argument containing "junitxml" anywhere. Consider a more precise pattern:

🔧 More precise pattern
 for arg in "$@"; do
-    if [[ "${arg}" == *"junitxml"* ]]; then
+    if [[ "${arg}" == --junitxml=* ]]; then
         xml_file="${arg#*=}"
         break
     fi
 done

This ensures only the exact --junitxml=<path> argument is matched.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/run_cuopt_pytests.sh` around lines 23 - 30, The current argument scan in
the for loop (for arg in "$@") uses a broad pattern (*"junitxml"*) that can
match unintended args; change the conditional to match the exact flag pattern
(e.g., --junitxml=*) so only the intended `--junitxml=<path>` argument is
captured, and still set xml_file using the existing parameter expansion
(xml_file="${arg#*=}") inside the if block; update the if that checks arg to use
the precise `--junitxml=*` pattern so xml_file only receives the proper junitxml
path.
ci/build_summary.sh (1)

140-144: Add timeout to curl requests to prevent hangs.

The curl calls to Slack API have no timeout, which could cause the CI job to hang indefinitely if the Slack API is unresponsive.

🔧 Add timeouts
     RESPONSE=$(curl -s -X POST \
+        --connect-timeout 10 \
+        --max-time 30 \
         -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \
         -H "Content-Type: application/json" \
         --data "${BOT_PAYLOAD}" \
         "https://slack.com/api/chat.postMessage")

Apply similarly to the webhook curl calls at lines 150 and 155-158.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/build_summary.sh` around lines 140 - 144, The curl POST that populates
RESPONSE using BOT_PAYLOAD lacks timeouts and can hang; update that invocation
to include connection and overall timeouts (for example --connect-timeout 10 and
--max-time 30) so the CI job won't block, and apply the same timeout flags to
the other webhook curl calls near the RESPONSE invocation (the subsequent
webhook/Post calls around lines 150 and 155-158) to ensure all external HTTP
requests fail fast on unresponsive endpoints.
ci/dashboard/index.html (1)

241-248: Add timeout to fetch calls to prevent indefinite hangs.

The fetch() calls have no timeout, which could cause the dashboard to hang if S3 is unreachable. Consider using AbortController:

🔧 Add fetch timeout
async function fetchWithTimeout(url, timeoutMs = 10000) {
  const controller = new AbortController();
  const id = setTimeout(() => controller.abort(), timeoutMs);
  try {
    const response = await fetch(url, { signal: controller.signal });
    clearTimeout(id);
    return response;
  } catch (e) {
    clearTimeout(id);
    throw e;
  }
}

Then replace fetch(S.baseUrl + 'index.json') with fetchWithTimeout(...).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/dashboard/index.html` around lines 241 - 248, The fetch call that loads
index.json can hang indefinitely; add a reusable fetchWithTimeout helper that
uses AbortController (e.g., async function fetchWithTimeout(url, timeoutMs =
10000)) and replace the direct call to fetch(S.baseUrl + 'index.json') with
await fetchWithTimeout(S.baseUrl + 'index.json', timeoutMs). Ensure
fetchWithTimeout clears the timeout on success/failure and rethrows errors so
the existing try/catch around the code that sets S.index and calls showEmpty
still works; update any other fetch usages in this file to use fetchWithTimeout
as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci/nightly_summary.sh`:
- Around line 112-121: The current conditional only runs the Slack send step
when CUOPT_SLACK_WEBHOOK_URL is set, skipping cases where only bot credentials
are provided; update the if test around the block that invokes bash
"${SCRIPT_DIR}/utils/send_consolidated_summary.sh" so it executes when
RAPIDS_BUILD_TYPE="nightly" AND (CUOPT_SLACK_WEBHOOK_URL is non-empty OR both
CUOPT_SLACK_BOT_TOKEN and CUOPT_SLACK_CHANNEL_ID are non-empty), preserving all
exported env vars (CONSOLIDATED_SUMMARY, CONSOLIDATED_HTML, SLACK_WEBHOOK_URL,
SLACK_BOT_TOKEN, SLACK_CHANNEL_ID, PRESIGNED_REPORT_URL,
PRESIGNED_DASHBOARD_URL) passed into the script; adjust the conditional
operators to check the trio CUOPT_SLACK_BOT_TOKEN and CUOPT_SLACK_CHANNEL_ID
together to allow bot-only setups to call send_consolidated_summary.sh.

In `@ci/utils/aggregate_nightly.py`:
- Around line 263-269: The current filter only picks jobs with conclusion ==
"failure"; change it to capture any completed non-success outcomes by selecting
jobs where j.get("conclusion") exists and j["conclusion"] != "success" (e.g., in
the comprehension that defines failed_ci_jobs). Then keep the existing
untracked_failed filter (which references has_test_details) but point it at the
updated failed_ci_jobs variable so timeouts, cancellations, and other
non-success conclusions are included.
- Around line 565-570: The current report_link anchors use raw s3:// URIs
(s3_reports_prefix + report_filename) which browsers can't open; change the
logic around s3_reports_prefix/report_filename to first detect s3:// prefixes
and either (a) convert them to browser-accessible HTTPS URLs (e.g. parse
"s3://bucket/path" into bucket and key and form
"https://{bucket}.s3.amazonaws.com/{key}{report_filename}" or preferably use an
S3 client to call generate_presigned_url for GetObject to produce a temporary
HTTPS link), or (b) if credentials/permissions for presigning are not available,
suppress the "View" column for entries where s3_reports_prefix is a raw s3://
URI; ensure the final link value is passed through _html_escape before
interpolation into report_link.
- Around line 809-819: The injected JSON (built from index_data and
consolidated) must be escaped to prevent sequences like "</script>" becoming
executable; update the inject_script creation in aggregate_nightly.py where
inject_script is composed (and before dashboard_html.replace) to serialize the
data with json.dumps and then sanitize the resulting JSON string(s) by replacing
occurrences of "</" (and optionally "<!--") with their safe escapes (e.g.,
"\\u003C/") or by escaping the forward slash (e.g., "<\\/") so that
window.__EMBEDDED_INDEX__ and window.__EMBEDDED_CONSOLIDATED__ receive safe,
non-executable JSON; ensure you apply this transformation to both index_data and
consolidated before embedding into inject_script.

In `@ci/utils/crash_helpers.sh`:
- Around line 76-82: The current block in crash_helpers.sh calls write_crash_xml
with "${xml_file}" when test_list is empty, which overwrites any partial JUnit
results; instead create a separate crash XML (e.g., derive a new path like
"${xml_file}.crash" or use a new variable CRASH_XML) and pass that to
write_crash_xml so the original "${xml_file}" is preserved; update the branch
where test_list is empty to write the synthetic "pytest-crash"/"PROCESS_CRASH"
entry into the separate crash file and leave "${xml_file}" untouched,
referencing write_crash_xml and the xml_file/test_list variables to locate the
change.
- Around line 66-75: The crash isolation helper pytest_crash_isolate currently
recollects tests from the hardcoded "tests" directory and ignores the original
pytest selection arguments forwarded by ci/run_cuopt_pytests.sh; modify
pytest_crash_isolate to accept and forward the original pytest selection/filter
args (e.g., "$@"/a named parameter like pytest_args) when calling pytest
--collect-only so the collected test_list matches the initial run's
markers/paths, and ensure subsequent retry invocations also use those same args;
update references inside the function (pytest_crash_isolate and the pytest
--collect-only invocation) to use the forwarded args instead of the literal
"tests".
- Around line 129-137: The crash branch only records crashes when attempt ==
PYTEST_MAX_CRASH_RETRIES, so intermediate crash attempts are lost; change the
logic in the block guarded by retry_rc > 128 (using symbols retry_rc, test_id,
attempt, PYTEST_MAX_CRASH_RETRIES) to persist every crash attempt by calling
write_crash_xml (using retry_xml, pytest-crash, test_id) and adding the test to
crash_tests for each crash occurrence (include attempt number in the
message/details so each crash attempt is distinct) instead of only on the final
retry; keep the existing "crashes consistently" behavior when attempt ==
PYTEST_MAX_CRASH_RETRIES but still record earlier crashes.

In `@ci/utils/nightly_report.py`:
- Around line 147-152: The grouping key using key =
f"{r['classname']}::{r['name']}” can collapse distinct tests from different XML
producers; update the grouping in the loop that builds test_groups to include a
stable producer identifier (for example a 'file', 'suite', or 'binary' field on
r) so each test is keyed by classname + name + producer (e.g.,
classname::name::producer_id). Modify the code that computes key in the for r in
results loop and any downstream logic that reads test_groups to use the new
composite key to avoid merging unrelated entries.

In `@ci/utils/send_consolidated_summary.sh`:
- Around line 28-34: The script currently forces SLACK_WEBHOOK_URL to be set
unconditionally which prevents running the bot-postMessage path; change the
variable validation so SLACK_WEBHOOK_URL is only required when the script will
use the webhook fallback. Specifically, remove or relax the unconditional
expansion check for SLACK_WEBHOOK_URL="${SLACK_WEBHOOK_URL:?…}" and instead
validate SLACK_WEBHOOK_URL only inside the fallback branch that posts via the
incoming webhook, leaving SLACK_BOT_TOKEN and SLACK_CHANNEL_ID validation (and
the chat.postMessage branch) intact; ensure the code still errors if neither bot
credentials (SLACK_BOT_TOKEN+SLACK_CHANNEL_ID) nor a webhook are available.
- Around line 61-65: The curl calls that assign RESPONSE using BOT_PAYLOAD must
not be allowed to abort the whole script under set -e; wrap each curl POST (the
lines that set RESPONSE from curl with BOT_PAYLOAD and the other similar curl
blocks) with a temporary disable of errexit (set +e), run curl, capture its exit
code ($?), then re-enable errexit (set -e) and treat failures as warnings: if rc
!= 0 or the JSON RESPONSE contains "ok":false, log a warning (stderr) including
the RESPONSE and continue instead of exiting. Update the blocks that set
RESPONSE and any similar curl invocations to follow this pattern so transient
Slack transport failures degrade to warnings rather than aborting the job.

In `@python/cuopt/cuopt/linear_programming/pyproject.toml`:
- Line 40: The pytest-rerunfailures dependency currently lacks a minimum
version; update the pyproject.toml entry for "pytest-rerunfailures" to
"pytest-rerunfailures>=14" to satisfy the CI script
ci/utils/cuopt_rerun_xml.py's requirement, and make the same change in
dependencies.yaml and python/cuopt_self_hosted/pyproject.toml so all manifests
consistently enforce pytest-rerunfailures v14+.

In `@python/cuopt/cuopt/tests/test_flaky_validation.py`:
- Around line 15-22: The global MARKER file causes cross-run contamination; make
the marker run-scoped by moving MARKER creation into the test and using pytest's
tmp_path (or tmp_path_factory) to create a unique temp file for
test_flaky_fails_first_passes_on_retry; replace references to the global MARKER
with a test-local marker (e.g., marker = tmp_path /
"cuopt_flaky_validation_python") so the test cleans up only its own file and
cannot be affected by other runs.
- Around line 18-29: The test test_flaky_fails_first_passes_on_retry currently
fails on first run unconditionally; modify it to require an explicit opt-in
before performing the fail-first behavior by checking an environment variable or
pytest marker (e.g., os.getenv("RUN_FLAKY_VALIDATION") == "1" or
pytest.config.getoption/marker) at the top of the test and call pytest.skip when
opt-in is not present, leaving the MARKER-based fail-first logic unchanged when
opt-in is enabled; reference test_flaky_fails_first_passes_on_retry and MARKER
to locate and update the test.

---

Minor comments:
In @.github/workflows/nightly-summary.yaml:
- Line 57: The workflow uses a non-standard runner label 'runs-on:
linux-amd64-cpu4' which triggers actionlint unless declared; either replace that
label with the standard self-hosted set (e.g., 'runs-on: self-hosted' or
'runs-on: [self-hosted, linux, amd64, cpu4]') in nightly-summary.yaml, or add
the custom label 'linux-amd64-cpu4' to your actionlint configuration
(actionlint.yaml) so the linter recognizes it.

In `@ci/build_summary.sh`:
- Around line 26-30: The curl call that writes to "${JOBS_FILE}" for the Actions
jobs endpoint uses per_page=100 but doesn't handle pagination; update the script
to either loop through paginated pages (using the "page" query param and
aggregating results before writing to JOBS_FILE) or at minimum add a clear
comment near the curl invocation referencing the 100-item limit and when to
implement pagination; locate the curl invocation that references GITHUB_RUN_ID
and GITHUB_REPOSITORY and modify it to fetch subsequent pages until no more
results (or document the limitation) so workflows with >100 jobs aren't
truncated.

In `@ci/test_wheel_cuopt_server.sh`:
- Around line 45-55: The current script disables the ERR trap by using "set +e"
so failures in test_doc_examples.sh won't update EXITCODE; restore failure
detection by either removing "set +e" (use "set -e" or no change) or explicitly
check test_doc_examples.sh's exit status and set EXITCODE=1 on failure — e.g.,
after calling ./ci/test_doc_examples.sh, test "$?" -ne 0 && EXITCODE=1; keep
references to EXITCODE, the ERR trap, set +e, and the ./ci/test_doc_examples.sh
invocation when making the change.

In `@ci/utils/generate_slack_payloads.py`:
- Around line 28-31: The main() function currently indexes sys.argv directly
(summary_path = sys.argv[1], presigned_report_url = sys.argv[2],
presigned_dashboard_url = sys.argv[3]) which can raise IndexError when invoked
without arguments; add CLI argument validation at the start of main() to check
len(sys.argv) (e.g., require at least 2 elements for summary_path), and if
missing print a usage message (showing expected arguments) and exit non-zero
before attempting to access sys.argv[1], while still using optional defaults for
presigned_report_url and presigned_dashboard_url when those indices are absent.

In `@ci/utils/junit_helpers.py`:
- Around line 82-85: The argument parsing loop silently ignores a bare "--sep"
and falls back to "."; change the parsing in the sys.argv loop so that when arg
== "--sep" and there is no next element (i + 1 >= len(sys.argv)) the program
fails fast with a usage error (e.g., write a clear error to stderr and exit with
non-zero or raise SystemExit) instead of leaving sep unchanged; keep the rest of
the logic (setting sep from sys.argv[i + 1] and calling extract_tests(xml_path,
status=cmd, sep=sep)) intact so behaviour only changes when "--sep" is provided
without a value.
- Around line 18-33: Replace the use of the stdlib XML parser in extract_tests
with defusedxml's safe parser: change the import and call to use
defusedxml.ElementTree.parse (or alias defusedxml.ElementTree as ElementTree)
instead of xml.etree.ElementTree, and update the exception tuple to catch the
parser's ParseError from defusedxml (keep FileNotFoundError and OSError).
Specifically, update the module import at top and the parse call inside
extract_tests so ElementTree.parse uses defusedxml's implementation while
preserving the existing error handling.

In `@datasets/get_test_data.sh`:
- Line 38: The assignment creating s3_uri using CUOPT_S3_URI doesn't ensure a
trailing slash so concatenation can yield invalid paths (e.g.,
s3://bucketci_datasets/...); update the logic that sets local s3_uri to
normalize CUOPT_S3_URI first (for example, append a "/" if CUOPT_S3_URI does not
already end with "/" or strip/ensure a single separator) before adding
"ci_datasets/routing/". Locate the assignment to s3_uri and adjust the string
handling for CUOPT_S3_URI so downstream sync operations get a well-formed S3
URI.

In `@datasets/linear_programming/download_pdlp_test_dataset.sh`:
- Around line 57-58: The s3_uri construction concatenates CUOPT_S3_URI and a
suffix directly, which can produce malformed paths if CUOPT_S3_URI lacks a
trailing slash; update the assignment for s3_uri to normalize joining by
ensuring CUOPT_S3_URI ends with a single slash (e.g., strip any trailing slashes
then append one, or conditionally add a slash when missing) before appending
"ci_datasets/linear_programming/pdlp/"; modify the s3_uri variable assignment
(referencing CUOPT_S3_URI and s3_uri) to perform this normalization so the
resulting S3 URI is always well-formed.

---

Nitpick comments:
In `@ci/build_summary.sh`:
- Around line 140-144: The curl POST that populates RESPONSE using BOT_PAYLOAD
lacks timeouts and can hang; update that invocation to include connection and
overall timeouts (for example --connect-timeout 10 and --max-time 30) so the CI
job won't block, and apply the same timeout flags to the other webhook curl
calls near the RESPONSE invocation (the subsequent webhook/Post calls around
lines 150 and 155-158) to ensure all external HTTP requests fail fast on
unresponsive endpoints.

In `@ci/dashboard/index.html`:
- Around line 241-248: The fetch call that loads index.json can hang
indefinitely; add a reusable fetchWithTimeout helper that uses AbortController
(e.g., async function fetchWithTimeout(url, timeoutMs = 10000)) and replace the
direct call to fetch(S.baseUrl + 'index.json') with await
fetchWithTimeout(S.baseUrl + 'index.json', timeoutMs). Ensure fetchWithTimeout
clears the timeout on success/failure and rethrows errors so the existing
try/catch around the code that sets S.index and calls showEmpty still works;
update any other fetch usages in this file to use fetchWithTimeout as well.

In `@ci/run_ctests.sh`:
- Around line 80-85: Calls to signal_name use an unquoted ${rc} which can break
if rc is empty or contains spaces; update every invocation (e.g., in the block
that calls was_signal_death and echo "CRASH: ${test_name} died from
$(signal_name ${rc}) ...") to pass "${rc}" (quoted) instead of ${rc}. Search for
other usages of signal_name ${rc} (the same pattern appears later in the script)
and change them to signal_name "${rc}" to ensure safe handling; keep the rest of
the logic (was_signal_death, test_name, OVERALL_RC) unchanged.

In `@ci/run_cuopt_pytests.sh`:
- Around line 23-30: The current argument scan in the for loop (for arg in "$@")
uses a broad pattern (*"junitxml"*) that can match unintended args; change the
conditional to match the exact flag pattern (e.g., --junitxml=*) so only the
intended `--junitxml=<path>` argument is captured, and still set xml_file using
the existing parameter expansion (xml_file="${arg#*=}") inside the if block;
update the if that checks arg to use the precise `--junitxml=*` pattern so
xml_file only receives the proper junitxml path.

In `@ci/run_cuopt_server_pytests.sh`:
- Around line 23-29: The JUnit XML extraction loop uses a loose pattern and
should match the exact flag; update the condition inside the for-loop that
inspects args (the block setting xml_file and breaking) to test for the precise
pattern --junitxml=* instead of *"junitxml"* so xml_file="${arg#*=}" still
extracts the filename only when the argument starts with --junitxml=; keep the
same variable name xml_file and break behavior.
- Around line 34-49: Extract the duplicated crash handling from
ci/run_cuopt_server_pytests.sh into a shared helper (e.g., crash_helpers.sh) by
implementing a function run_pytest_with_crash_isolation that accepts the pytest
arguments and any context (e.g., xml_file, test_dir) and encapsulates the logic
around running pytest, capturing rc, checking signal_name, honoring IS_NIGHTLY,
calling pytest_crash_isolate when needed, and exiting with rc; then replace the
block in run_cuopt_server_pytests.sh (the current rc/pytest invocation and the
if-checks that call signal_name, pytest_crash_isolate, and exit) with a call to
run_pytest_with_crash_isolation, ensuring you pass xml_file and other required
variables and keep references to pytest_crash_isolate and signal_name intact so
behavior is unchanged.

In `@ci/utils/cuopt_rerun_xml.py`:
- Around line 60-62: When RAPIDS_TESTS_DIR is unset the function silently
returns (output_dir = os.environ.get("RAPIDS_TESTS_DIR", "") followed by if not
output_dir: return); add a debug log immediately before the early return to
indicate the env var is missing so CI runs are easier to troubleshoot. Use the
module logger (e.g., logging.getLogger(__name__).debug(...)) or the existing
logger used in this file to emit a clear message referencing RAPIDS_TESTS_DIR
and the fact that rerun XML will not be written, placing the call just before
the return that checks output_dir.

In `@ci/utils/generate_step_summary.py`:
- Around line 16-18: main currently assumes sys.argv[1] exists and the path is
valid; add minimal argument and file error handling by validating len(sys.argv)
(or using argparse) and printing a usage message and exiting with non-zero when
missing, then wrap the open/sys.argv[1] and json.load calls in try/except to
catch FileNotFoundError and json.JSONDecodeError and log or print a clear error
message before exiting; update the main function (and any helper functions that
read the file) to return a non-zero exit code on error rather than allowing an
uncaught exception.

In `@ci/utils/nightly_report_helper.sh`:
- Around line 46-55: The script uses the arch command to set arch_tag which may
be missing in some environments; change the architecture detection in the build
matrix block to use uname -m instead (replace arch invocation used when setting
arch_tag and in matrix_label computation), e.g., set arch_tag="$(uname -m)" so
matrix_label (and the py-tag branch) remains correct and more portable across
minimal CI containers.

In `@ci/utils/send_nightly_summary.sh`:
- Around line 153-158: The payload object currently hardcodes the Slack
"channel" field which is ignored for incoming webhooks; update the payload
construction in send_nightly_summary.sh to either remove the "channel" key from
the payload (delete the "channel": "cuopt-regression-testing" entry) or make it
configurable via an environment variable (e.g., read $SLACK_CHANNEL and include
"channel" only when set) so the payload variable reflects actual webhook
behavior; adjust any related documentation or default env setup if you choose
the configurable approach.

In `@python/cuopt_server/cuopt_server/tests/test_flaky_validation.py`:
- Around line 15-28: The test creates a persistent MARKER and doesn't guarantee
cleanup; replace the global MARKER use with a pytest fixture that creates a
unique marker path (use tmp_path/tmp_path_factory or append a
uuid/timestamp/pid) and yields it, then always removes the marker in the fixture
teardown to ensure cleanup even on failures or timeouts; update
test_flaky_fails_first_passes_on_retry to accept that fixture and use the
provided marker path instead of the global MARKER so test isolation and parallel
CI runs are safe.

In `@python/cuopt/pyproject.toml`:
- Line 50: The pytest-rerunfailures entry currently has no version constraint;
update the source dependencies.yaml to pin a minimum version (for example set
pytest-rerunfailures>=10.0) and then regenerate the auto-generated list so
pyproject.toml's dependency line for "pytest-rerunfailures" includes the minimum
version; ensure the final pyproject.toml contains the string
pytest-rerunfailures>=10.0 (or your chosen minimum) instead of the unpinned
name.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f34de122-bc54-4d2c-aa14-45f9b5db7d6e

📥 Commits

Reviewing files that changed from the base of the PR and between 7323006 and 7b8a8dc.

📒 Files selected for processing (46)
  • .github/workflows/build.yaml
  • .github/workflows/nightly-summary.yaml
  • .github/workflows/pr.yaml
  • .github/workflows/test.yaml
  • ci/build_summary.sh
  • ci/dashboard/index.html
  • ci/nightly_summary.sh
  • ci/run_ctests.sh
  • ci/run_cuopt_pytests.sh
  • ci/run_cuopt_server_pytests.sh
  • ci/test_cpp.sh
  • ci/test_notebooks.sh
  • ci/test_python.sh
  • ci/test_wheel_cuopt.sh
  • ci/test_wheel_cuopt_server.sh
  • ci/thirdparty-testing/run_cvxpy_tests.sh
  • ci/thirdparty-testing/run_pulp_tests.sh
  • ci/thirdparty-testing/run_pyomo_tests.sh
  • ci/utils/aggregate_nightly.py
  • ci/utils/crash_helpers.sh
  • ci/utils/cuopt_rerun_xml.py
  • ci/utils/generate_slack_payloads.py
  • ci/utils/generate_step_summary.py
  • ci/utils/junit_helpers.py
  • ci/utils/nightly_report.py
  • ci/utils/nightly_report_helper.sh
  • ci/utils/s3_helpers.py
  • ci/utils/send_consolidated_summary.sh
  • ci/utils/send_nightly_summary.sh
  • conda/environments/all_cuda-129_arch-aarch64.yaml
  • conda/environments/all_cuda-129_arch-x86_64.yaml
  • conda/environments/all_cuda-131_arch-aarch64.yaml
  • conda/environments/all_cuda-131_arch-x86_64.yaml
  • cpp/tests/utilities/CMakeLists.txt
  • cpp/tests/utilities/test_flaky_validation.cpp
  • datasets/get_test_data.sh
  • datasets/linear_programming/download_pdlp_test_dataset.sh
  • datasets/mip/download_miplib_test_dataset.sh
  • dependencies.yaml
  • python/cuopt/cuopt/linear_programming/pyproject.toml
  • python/cuopt/cuopt/tests/test_flaky_validation.py
  • python/cuopt/pyproject.toml
  • python/cuopt_self_hosted/pyproject.toml
  • python/cuopt_server/cuopt_server/tests/test_flaky_validation.py
  • python/cuopt_server/pyproject.toml
  • skills/cuopt-developer/SKILL.md

Comment thread ci/nightly_summary.sh Outdated
Comment thread ci/utils/aggregate_nightly.py
Comment thread ci/utils/aggregate_nightly.py
Comment thread ci/utils/aggregate_nightly.py
Comment thread ci/utils/crash_helpers.sh
Comment thread ci/utils/send_consolidated_summary.sh
Comment thread ci/utils/send_consolidated_summary.sh Outdated
Comment thread python/cuopt/cuopt/linear_programming/pyproject.toml
Comment thread python/cuopt/cuopt/tests/test_flaky_validation.py Outdated
Comment thread python/cuopt/cuopt/tests/test_flaky_validation.py Outdated
- Escape </script> in dashboard embedded JSON to prevent XSS (#4 critical)
- Make Slack curl non-fatal with timeout + fallback so nightly pipeline
  isn't killed by transient Slack failures (#10)
- Don't overwrite partial JUnit XML when test collection fails after
  crash — write crash marker to a separate file (#6)
- Make flaky validation test markers PID-scoped to prevent cross-run
  contamination (#12)

Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
@rgsl888prabhu
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

- Replace hardcoded @ramakrishna_prabhu with CUOPT_SLACK_MENTION_ID
  env var (accepts Slack user ID or handle, empty disables mentions)
- Fix aws s3 presign stderr warnings polluting the presigned URL
  (2>&1 → 2>/dev/null)

Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
ci/utils/aggregate_nightly.py (1)

50-55: Sort S3 summary URIs before processing for deterministic outputs.

Processing s3_list(...) results in source order can make consolidated failure/flaky section ordering vary run-to-run. A stable sort here avoids noisy report diffs.

♻️ Proposed patch
-    json_uris = [
+    json_uris = sorted([
         u
         for u in uris
         if u.endswith(".json") and not u.endswith("/consolidated.json")
-    ]
+    ])
@@
-        json_uris = [
+        json_uris = sorted([
             u
             for u in uris
             if u.endswith(".json") and not u.endswith("/consolidated.json")
-        ]
+        ])

Also applies to: 67-71

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/aggregate_nightly.py` around lines 50 - 55, The S3 URIs returned by
s3_list(s3_prefix) must be sorted to ensure deterministic consolidated output;
update the code that builds json_uris (the block using uris, s3_list, s3_prefix
and filtering out "/consolidated.json") to sort the uris (e.g., call sorted() /
.sort()) before filtering, and apply the same change to the other similar block
that processes S3 summary URIs (the later block that also builds json_uris from
s3_list) so both places produce a stable, deterministic order.
ci/utils/crash_helpers.sh (2)

72-75: Make the collection cap configurable to avoid silent truncation.

The fixed head -500 can miss the crashing test when the suite grows. Consider a tunable limit with a safe default.

♻️ Proposed change
-    test_list=$(pytest --collect-only -q tests 2>/dev/null | grep "::" | head -500 || echo "")
+    local collect_limit="${PYTEST_CRASH_COLLECT_LIMIT:-500}"
+    test_list=$(pytest --collect-only -q tests 2>/dev/null | grep "::" | head -n "${collect_limit}" || echo "")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/crash_helpers.sh` around lines 72 - 75, The current fixed cap using
head -500 truncates collected tests; make the cap configurable by reading an
environment variable (e.g., TEST_COLLECTION_LIMIT) with a safe default of 500,
validate/fallback to the default if empty or non-numeric, and use that variable
in the pytest collection pipeline that sets test_list (the command around
test_list=$(pytest --collect-only ... | head -500 ...)). Update crash_helpers.sh
to read and validate TEST_COLLECTION_LIMIT before calling head so larger suites
aren't silently truncated while preserving the 500 default.

124-124: Consider keeping per-test retry stderr as an artifact for faster crash triage.

2>/dev/null drops useful native crash breadcrumbs (stack traces/assert output). Redirecting to per-test logs in ${RAPIDS_TESTS_DIR} can improve debugging without changing pass/fail behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/crash_helpers.sh` at line 124, The pytest invocation currently
discards stderr with "2>/dev/null"; preserve crash output by redirecting stderr
to a per-test artifact file under ${RAPIDS_TESTS_DIR} (e.g.,
"${RAPIDS_TESTS_DIR}/${test_id}.stderr") while keeping the same exit-code
capture into retry_rc and junit xml into retry_xml; ensure ${RAPIDS_TESTS_DIR}
exists/writable before running the pytest command and update the pytest line
(the invocation that sets retry_rc) to redirect stderr to that per-test file
instead of /dev/null so pass/fail behavior is unchanged but native crash
breadcrumbs are retained.
.github/workflows/nightly-summary.yaml (1)

53-80: Set explicit workflow permissions for GITHUB_TOKEN.

This reusable workflow relies on github.token, but permission scope is currently implicit. Declare minimal required permissions (e.g., actions: read, contents: read) to avoid permission drift and tighten security posture.

Suggested patch
 name: nightly-summary
 
 on:
@@
       CUOPT_SLACK_CHANNEL_ID:
         required: false
 
+permissions:
+  actions: read
+  contents: read
+
 jobs:
   nightly-summary:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-summary.yaml around lines 53 - 80, Add an explicit
top-level permissions block to the reusable workflow to scope the GITHUB_TOKEN
to the minimum required rights (e.g., permissions: actions: read, contents:
read). Insert a permissions: mapping at the workflow root (above jobs) so the
GITHUB_TOKEN used in the nightly-summary job and the Run nightly summary step
has only the declared permissions instead of implicit full access.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/build.yaml:
- Line 251: The workflow uses a custom runner label runs-on: linux-amd64-cpu4
for the build-summary job; confirm that a self-hosted runner with the exact
label "linux-amd64-cpu4" is registered in your repo/org or replace this label
with a supported runner (e.g., ubuntu-latest) or your actual self-hosted label.
Update the same label used in .github/workflows/nightly-summary.yaml (the job
referencing the same runs-on value) to match the registered runner if you change
it. Ensure both workflows reference the identical, registered runner label so
the jobs will be scheduled.

In `@ci/build_summary.sh`:
- Around line 25-29: The curl command that writes to "${JOBS_FILE}" can hang
because it lacks timeout and fallback handling; update the invocation in
build_summary.sh to include connection and overall timeouts (e.g.
--connect-timeout and --max-time), enable --fail and retries (e.g. --retry and
--retry-delay) so transient errors don't block, and after the curl run check its
exit code and, on failure, write a safe fallback (like an empty JSON array or
cached minimal structure) to "${JOBS_FILE}" and emit a clear error/log message
so downstream steps can continue deterministically.

In `@ci/utils/aggregate_nightly.py`:
- Around line 565-570: The report link construction assumes s3_reports_prefix
ends with a slash, causing malformed URLs when it doesn't; update the logic that
builds report_link (which uses s3_reports_prefix, report_filename and
_html_escape) to ensure a single slash separates prefix and filename (e.g.,
append '/' to s3_reports_prefix only if it doesn't already end with one) before
escaping and concatenating so the generated View URL is always valid.

---

Nitpick comments:
In @.github/workflows/nightly-summary.yaml:
- Around line 53-80: Add an explicit top-level permissions block to the reusable
workflow to scope the GITHUB_TOKEN to the minimum required rights (e.g.,
permissions: actions: read, contents: read). Insert a permissions: mapping at
the workflow root (above jobs) so the GITHUB_TOKEN used in the nightly-summary
job and the Run nightly summary step has only the declared permissions instead
of implicit full access.

In `@ci/utils/aggregate_nightly.py`:
- Around line 50-55: The S3 URIs returned by s3_list(s3_prefix) must be sorted
to ensure deterministic consolidated output; update the code that builds
json_uris (the block using uris, s3_list, s3_prefix and filtering out
"/consolidated.json") to sort the uris (e.g., call sorted() / .sort()) before
filtering, and apply the same change to the other similar block that processes
S3 summary URIs (the later block that also builds json_uris from s3_list) so
both places produce a stable, deterministic order.

In `@ci/utils/crash_helpers.sh`:
- Around line 72-75: The current fixed cap using head -500 truncates collected
tests; make the cap configurable by reading an environment variable (e.g.,
TEST_COLLECTION_LIMIT) with a safe default of 500, validate/fallback to the
default if empty or non-numeric, and use that variable in the pytest collection
pipeline that sets test_list (the command around test_list=$(pytest
--collect-only ... | head -500 ...)). Update crash_helpers.sh to read and
validate TEST_COLLECTION_LIMIT before calling head so larger suites aren't
silently truncated while preserving the 500 default.
- Line 124: The pytest invocation currently discards stderr with "2>/dev/null";
preserve crash output by redirecting stderr to a per-test artifact file under
${RAPIDS_TESTS_DIR} (e.g., "${RAPIDS_TESTS_DIR}/${test_id}.stderr") while
keeping the same exit-code capture into retry_rc and junit xml into retry_xml;
ensure ${RAPIDS_TESTS_DIR} exists/writable before running the pytest command and
update the pytest line (the invocation that sets retry_rc) to redirect stderr to
that per-test file instead of /dev/null so pass/fail behavior is unchanged but
native crash breadcrumbs are retained.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: acd206f7-6c7d-48df-b928-08a54a50d71b

📥 Commits

Reviewing files that changed from the base of the PR and between 7b8a8dc and cc52edc.

📒 Files selected for processing (12)
  • .github/workflows/build.yaml
  • .github/workflows/nightly-summary.yaml
  • .github/workflows/test.yaml
  • ci/build_summary.sh
  • ci/nightly_summary.sh
  • ci/utils/aggregate_nightly.py
  • ci/utils/crash_helpers.sh
  • ci/utils/generate_slack_payloads.py
  • ci/utils/send_consolidated_summary.sh
  • cpp/tests/utilities/test_flaky_validation.cpp
  • python/cuopt/cuopt/tests/test_flaky_validation.py
  • python/cuopt_server/cuopt_server/tests/test_flaky_validation.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • python/cuopt/cuopt/tests/test_flaky_validation.py
  • cpp/tests/utilities/test_flaky_validation.cpp
  • ci/utils/send_consolidated_summary.sh
  • ci/nightly_summary.sh
  • .github/workflows/test.yaml
  • ci/utils/generate_slack_payloads.py

Comment thread .github/workflows/build.yaml
Comment thread ci/build_summary.sh Outdated
Comment thread ci/utils/aggregate_nightly.py
- Add --max-time 30 to GitHub API curl calls in build_summary.sh and
  nightly_summary.sh to prevent job hangs on network issues
- Add fallback to empty JSON on curl failure in build_summary.sh
- Normalize trailing slash in report link prefix to prevent broken URLs

Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
@rgsl888prabhu
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
ci/build_summary.sh (1)

138-142: ⚠️ Potential issue | 🟠 Major

Add timeout + fallback to Slack chat.postMessage call.

At Line 138, this curl has no timeout and no fallback. Under set -e, transient network failures can terminate the script before your structured ok check runs.

Suggested patch
-RESPONSE=$(curl -s -X POST \
+RESPONSE=$(curl -s --max-time 30 -X POST \
     -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \
     -H "Content-Type: application/json" \
     --data "${BOT_PAYLOAD}" \
-    "https://slack.com/api/chat.postMessage")
+    "https://slack.com/api/chat.postMessage" || echo '{"ok":false,"error":"curl_failed"}')
#!/bin/bash
# Verify Slack post curl resilience settings in ci/build_summary.sh
# Expected after fix: RESPONSE curl line includes --max-time and curl_failed fallback.
rg -n -C3 'RESPONSE=\$\(curl|chat\.postMessage|--max-time|curl_failed' ci/build_summary.sh
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/build_summary.sh` around lines 138 - 142, The curl POST that sets RESPONSE
(using BOT_PAYLOAD and chat.postMessage) has no timeout and can cause the script
to exit under set -e on transient failures; add a curl timeout flag (e.g.
--max-time 30) and prevent set -e from aborting by capturing curl's exit code
(run curl ... || true or assign to a temp var), then if the exit code is
non-zero set RESPONSE to a deterministic fallback JSON (e.g.
{"ok":false,"error":"curl_failed"}) so downstream checks of RESPONSE always run;
update the invocation that assigns RESPONSE to reference the new timeout flag
and add the exit-code check/fallback logic around that assignment.
🧹 Nitpick comments (1)
ci/utils/aggregate_nightly.py (1)

421-442: Consider caching SHA lookups for larger runs.

The nested loop to find sha from matrix_grid is O(n×m). For typical CI runs this is fine, but if the failure lists grow large, a precomputed dict would be more efficient.

♻️ Optional optimization
# Build lookup once at the start of generate_consolidated_html
sha_by_matrix = {
    (g["test_type"], g["matrix_label"]): g.get("sha", "unknown")
    for g in agg["matrix_grid"]
}

def _test_name_html(entry):
    name_escaped = _html_escape(entry["name"])
    suite = entry.get("suite", "")
    sha = sha_by_matrix.get(
        (entry.get("test_type"), entry.get("matrix_label")), "unknown"
    )
    # ... rest unchanged
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/aggregate_nightly.py` around lines 421 - 442, The current
_test_name_html function does an O(n×m) lookup over agg["matrix_grid"] to find
sha for each entry; precompute a mapping (e.g. sha_by_matrix keyed by
(g["test_type"], g["matrix_label"])) once at the start of
generate_consolidated_html and then have _test_name_html read sha =
sha_by_matrix.get((entry.get("test_type"), entry.get("matrix_label")),
"unknown"); keep the remaining logic (suite checks and URL construction)
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@ci/build_summary.sh`:
- Around line 138-142: The curl POST that sets RESPONSE (using BOT_PAYLOAD and
chat.postMessage) has no timeout and can cause the script to exit under set -e
on transient failures; add a curl timeout flag (e.g. --max-time 30) and prevent
set -e from aborting by capturing curl's exit code (run curl ... || true or
assign to a temp var), then if the exit code is non-zero set RESPONSE to a
deterministic fallback JSON (e.g. {"ok":false,"error":"curl_failed"}) so
downstream checks of RESPONSE always run; update the invocation that assigns
RESPONSE to reference the new timeout flag and add the exit-code check/fallback
logic around that assignment.

---

Nitpick comments:
In `@ci/utils/aggregate_nightly.py`:
- Around line 421-442: The current _test_name_html function does an O(n×m)
lookup over agg["matrix_grid"] to find sha for each entry; precompute a mapping
(e.g. sha_by_matrix keyed by (g["test_type"], g["matrix_label"])) once at the
start of generate_consolidated_html and then have _test_name_html read sha =
sha_by_matrix.get((entry.get("test_type"), entry.get("matrix_label")),
"unknown"); keep the remaining logic (suite checks and URL construction)
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: fe4d177c-afb3-4b1d-86d9-96bb1124b906

📥 Commits

Reviewing files that changed from the base of the PR and between cc52edc and 6788736.

📒 Files selected for processing (3)
  • ci/build_summary.sh
  • ci/nightly_summary.sh
  • ci/utils/aggregate_nightly.py
✅ Files skipped from review due to trivial changes (1)
  • ci/nightly_summary.sh

The Slack notification said "NEW failures" when the only failed jobs
were untracked CI jobs (e.g., conda-notebook-tests) with no per-matrix
test reports. These are not new TEST failures — they're CI job failures
without test details.

Split the condition so:
- has_new (actual new test failures) → "NEW failures"
- untracked CI jobs only → "Recurring failures + N CI job(s) failed"
Both still mention the configured user for attention.

Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
@rgsl888prabhu
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
ci/build_summary.sh (2)

144-146: String comparison with Python boolean is fragile.

The check "${OK}" != "True" relies on Python printing the boolean as "True". This works but could be more explicit.

🔧 Suggested improvement for robustness
-OK=$(echo "${RESPONSE}" | python3 -c "import json,sys; print(json.load(sys.stdin).get('ok',''))" 2>/dev/null || echo "")
-if [ "${OK}" != "True" ]; then
+OK=$(echo "${RESPONSE}" | python3 -c "import json,sys; print('true' if json.load(sys.stdin).get('ok') else 'false')" 2>/dev/null || echo "false")
+if [ "${OK}" != "true" ]; then
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/build_summary.sh` around lines 144 - 146, The current check uses OK=$(echo
"${RESPONSE}" | python3 -c "import json,sys;
print(json.load(sys.stdin).get('ok',''))" ...) and then compares if [ "${OK}" !=
"True" ]; make the comparison robust by extracting the JSON boolean explicitly
and normalizing it before comparing: update the python extraction to emit a
deterministic lowercase string (e.g.,
print(str(json.load(...).get('ok')).lower()) ) or use a JSON tool to output
true/false, then compare OK to "true" (or use a shell test that checks for
non-empty true) so the check on RESPONSE reliably detects a boolean false
regardless of Python capitalization; adjust the references to OK and RESPONSE
accordingly.

145-149: Consider whether Slack failures should fail the job.

Currently the script logs an error but exits successfully (implicit exit 0) when the Slack API call fails. This is likely intentional to avoid failing CI for notification issues, but worth confirming this is the desired behavior.

If Slack failures should fail the build-summary job:

🔧 Optional: Exit with error on Slack failure
 if [ "${OK}" != "True" ]; then
     echo "ERROR: chat.postMessage failed: ${RESPONSE}" >&2
+    exit 1
 else
     echo "Build summary posted to Slack."
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/build_summary.sh` around lines 145 - 149, The current block logs Slack API
failures (when OK != "True") but still exits successfully; decide whether Slack
failures should fail the job and implement it: either append an explicit
non-zero exit (e.g., add exit 1) after the processLogger-styled error echo that
uses ${RESPONSE} to fail the job, or make the behavior configurable by checking
an env var (e.g., SLACK_FAIL_JOB) and call exit 1 only when SLACK_FAIL_JOB is
set; update the branch handling the chat.postMessage response (variables OK and
RESPONSE) accordingly so the script's exit status reflects the chosen policy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci/utils/generate_slack_payloads.py`:
- Around line 57-60: The code uses direct dict indexing (e.g., computing
passed_ci_count with j["conclusion"]) which can raise KeyError when fields are
missing; update these to safe access patterns (use j.get("conclusion") ==
"success" or j.get("field", default)) and add sensible defaults where
counts/strings are computed so payload generation never aborts; apply the same
defensive changes to other occurrences noted (places that compute totals or read
summary fields around lines referenced: the blocks computing
total_ci_jobs/passed_ci_count and the other access sites) and ensure any
boolean/number conversions use safe defaults (0 or empty string) so Slack
payload creation proceeds even with partial schemas.
- Around line 145-146: The Slack mrkdwn f-strings are interpolating unescaped
dynamic fields (e.g., name, message, matrix_label, job names) which can break
formatting or create unintended links; add a helper function escape_mrkdwn(text)
in generate_slack_payloads.py that replaces &, <, > and backticks with their
safe equivalents (e.g., &amp;, &lt;, &gt;, and escaped backtick) and use it
wherever dynamic values are injected (for example when building the "text"
f-string and the other mrkdwn lines referenced), i.e., wrap variables like name,
message, matrix_label, job_name with escape_mrkdwn(...) before interpolation so
all mrkdwn fields are safe.
- Around line 115-117: The message "all {passed_ci_count} CI jobs succeeded" can
be misleading when some jobs were cancelled/skipped; update the logic in
generate_slack_payloads.py to only claim "all ... succeeded" when
passed_ci_count equals the number of completed jobs (exclude cancelled/skipped),
e.g. compute completed_jobs_count from the job list (or total_ci_jobs minus
skipped/cancelled) and replace the condition that uses total_ci_jobs with a
check against completed_jobs_count, and change the text to say "all
{passed_ci_count} completed CI jobs succeeded" (update the variables
total_ci_jobs, passed_ci_count, and text accordingly).

---

Nitpick comments:
In `@ci/build_summary.sh`:
- Around line 144-146: The current check uses OK=$(echo "${RESPONSE}" | python3
-c "import json,sys; print(json.load(sys.stdin).get('ok',''))" ...) and then
compares if [ "${OK}" != "True" ]; make the comparison robust by extracting the
JSON boolean explicitly and normalizing it before comparing: update the python
extraction to emit a deterministic lowercase string (e.g.,
print(str(json.load(...).get('ok')).lower()) ) or use a JSON tool to output
true/false, then compare OK to "true" (or use a shell test that checks for
non-empty true) so the check on RESPONSE reliably detects a boolean false
regardless of Python capitalization; adjust the references to OK and RESPONSE
accordingly.
- Around line 145-149: The current block logs Slack API failures (when OK !=
"True") but still exits successfully; decide whether Slack failures should fail
the job and implement it: either append an explicit non-zero exit (e.g., add
exit 1) after the processLogger-styled error echo that uses ${RESPONSE} to fail
the job, or make the behavior configurable by checking an env var (e.g.,
SLACK_FAIL_JOB) and call exit 1 only when SLACK_FAIL_JOB is set; update the
branch handling the chat.postMessage response (variables OK and RESPONSE)
accordingly so the script's exit status reflects the chosen policy.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 92a7ad7d-2de2-4763-8dd8-63cfd1e1b1aa

📥 Commits

Reviewing files that changed from the base of the PR and between 6788736 and 3e2b921.

📒 Files selected for processing (2)
  • ci/build_summary.sh
  • ci/utils/generate_slack_payloads.py

Comment thread ci/utils/generate_slack_payloads.py
Comment on lines +115 to +117
if total_ci_jobs > 0:
text += f", all {passed_ci_count} CI jobs succeeded"
mention = ""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix misleading “all CI jobs succeeded” wording.

Line 116 can report an incorrect “all … succeeded” message when some jobs are cancelled/skipped (non-success but not failure).

✏️ Proposed fix
-        if total_ci_jobs > 0:
-            text += f", all {passed_ci_count} CI jobs succeeded"
+        if total_ci_jobs > 0:
+            if passed_ci_count == total_ci_jobs:
+                text += f", all {total_ci_jobs} CI jobs succeeded"
+            else:
+                text += f", {passed_ci_count}/{total_ci_jobs} CI jobs succeeded"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/generate_slack_payloads.py` around lines 115 - 117, The message "all
{passed_ci_count} CI jobs succeeded" can be misleading when some jobs were
cancelled/skipped; update the logic in generate_slack_payloads.py to only claim
"all ... succeeded" when passed_ci_count equals the number of completed jobs
(exclude cancelled/skipped), e.g. compute completed_jobs_count from the job list
(or total_ci_jobs minus skipped/cancelled) and replace the condition that uses
total_ci_jobs with a check against completed_jobs_count, and change the text to
say "all {passed_ci_count} completed CI jobs succeeded" (update the variables
total_ci_jobs, passed_ci_count, and text accordingly).

Comment thread ci/utils/generate_slack_payloads.py Outdated
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
- Add _esc() helper to escape &, <, > in all dynamic mrkdwn fields
- Add _job_prefix() helper for safe job name extraction
- Replace all j["field"] with j.get("field", default) to prevent
  KeyError on malformed input
- Fix "all CI jobs succeeded" to show N/M when some are cancelled

Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
@rgsl888prabhu
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
ci/run_ctests.sh (1)

149-149: Edge case: Test names with glob pattern characters.

If a test name contains *, ?, or other glob pattern characters, --gtest_filter may not match exactly since it uses glob patterns. However, this is an extremely rare edge case in practice.

Potential fix if needed

Escape special characters before passing to --gtest_filter:

-"${gt}" --gtest_filter="${tc}" --gtest_output="xml:${retry_xml}" "$@" || retry_rc=$?
+local tc_escaped
+tc_escaped=$(printf '%s' "${tc}" | sed 's/[*?:]/\\&/g')
+"${gt}" --gtest_filter="${tc_escaped}" --gtest_output="xml:${retry_xml}" "$@" || retry_rc=$?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/run_ctests.sh` at line 149, The current invocation passes the test name
variable tc directly to --gtest_filter which treats *, ?, [] as glob patterns;
to avoid accidental pattern matching, pre-escape glob characters in tc (e.g., *,
?, [, ], and possibly \) before the line that runs "${gt}"
--gtest_filter="${tc}" --gtest_output="xml:${retry_xml}" "$@" || retry_rc=$?;
implement a small helper (e.g., escape_gtest_filter) that transforms tc into an
escaped variable (like escaped_tc) by prefixing those characters with
backslashes and then pass escaped_tc to --gtest_filter so the filter matches the
literal test name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci/run_ctests.sh`:
- Line 44: Rename the misleading variable IS_NIGHTLY to BUILD_TYPE and propagate
that rename where it's set from RAPIDS_BUILD_TYPE and where it's compared to
"nightly"; specifically change the assignment
IS_NIGHTLY="${RAPIDS_BUILD_TYPE:-}" to BUILD_TYPE="${RAPIDS_BUILD_TYPE:-}" and
update any later comparison that checks IS_NIGHTLY == "nightly" to compare
BUILD_TYPE == "nightly" instead (ensure all occurrences of IS_NIGHTLY in this
script are updated).

---

Nitpick comments:
In `@ci/run_ctests.sh`:
- Line 149: The current invocation passes the test name variable tc directly to
--gtest_filter which treats *, ?, [] as glob patterns; to avoid accidental
pattern matching, pre-escape glob characters in tc (e.g., *, ?, [, ], and
possibly \) before the line that runs "${gt}" --gtest_filter="${tc}"
--gtest_output="xml:${retry_xml}" "$@" || retry_rc=$?; implement a small helper
(e.g., escape_gtest_filter) that transforms tc into an escaped variable (like
escaped_tc) by prefixing those characters with backslashes and then pass
escaped_tc to --gtest_filter so the filter matches the literal test name.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a4db18df-cc30-463c-8cf3-fbb554977f47

📥 Commits

Reviewing files that changed from the base of the PR and between 3e2b921 and b3082d2.

📒 Files selected for processing (3)
  • ci/run_ctests.sh
  • ci/utils/crash_helpers.sh
  • ci/utils/generate_slack_payloads.py
✅ Files skipped from review due to trivial changes (1)
  • ci/utils/generate_slack_payloads.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • ci/utils/crash_helpers.sh

Comment thread ci/run_ctests.sh
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
@rgsl888prabhu rgsl888prabhu marked this pull request as ready for review April 23, 2026 20:00
@rgsl888prabhu rgsl888prabhu requested review from a team as code owners April 23, 2026 20:00
rgsl888prabhu and others added 2 commits April 24, 2026 01:32
PR builds now fail immediately on test failures with no retries,
so authors see exactly what broke. Nightly builds retain:
- pytest --reruns 2 + rerun XML plugin
- gtest per-test retry
- crash isolation

Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants