Add flaky test detection, retry, crash isolation, and nightly reporting#1098
Add flaky test detection, retry, crash isolation, and nightly reporting#1098rgsl888prabhu wants to merge 87 commits intomainfrom
Conversation
Add retry logic for gtest binaries via GTEST_MAX_RETRIES (default 1) and pytest reruns via --reruns 2 --reruns-delay 5. Tests that fail then pass on retry are classified as flaky rather than failures. Add pytest-rerunfailures as a test dependency.
Add matrix-aware nightly test report generator that parses JUnit XML, classifies failures as new/recurring/flaky/stabilized, maintains per-matrix failure history on S3, and outputs Markdown, HTML, and JSON reports. Extract S3 helpers into shared module and shell helper to eliminate duplication across test scripts.
Add nightly report generation to cpp, python, wheel-python, wheel-server, and notebook test scripts using the shared helper. Wheel and notebook scripts also gain JUnit XML output and EXITCODE trap pattern for consistent error handling.
Add aggregate_nightly.py to merge per-matrix JSON summaries into a consolidated report with matrix grid. Add Slack notifiers for both per-job and consolidated messages with HTML file upload support. Add nightly_summary.sh wrapper for the post-test aggregation job. Add static HTML dashboard with matrix overview, failure drill-down, and trend charts reading from S3 index.json.
Pass Slack webhook secret to all test jobs. Add nightly-summary job that runs after all test jobs complete, aggregates results from S3, sends a consolidated Slack notification, and uploads the dashboard. Pass S3 and Slack secrets via container-options for the custom job.
Add pitfall entries for cross-cutting change discipline: full scope audits, code duplication, CI matrix parallelism, extensibility, and actionable reporting.
Apply ruff formatting to Python files, update copyright years to 2026 in shell scripts, regenerate conda environment files and pyproject.toml from dependencies.yaml, and remove hardcoded version from comment.
custom-job.yaml does not support secret references in container-options. Remove them and make nightly_summary.sh gracefully skip when CUOPT_DATASET_S3_URI is not available.
The shared workflows only support 3 secret slots. The Slack webhook is only needed by the nightly-summary aggregation job which uses secrets: inherit.
Tests that resolve then fail again within 14 days are recognized as bouncing rather than new failures. After 2+ bounces a test is automatically classified as cross-run flaky. Resolved tests only generate one notification — subsequent passes are silent.
CUOPT_BOUNCE_WINDOW_DAYS (default 14) and CUOPT_BOUNCE_THRESHOLD (default 2) can now be set as environment variables to tune flaky test detection without code changes.
The custom-job.yaml reusable workflow does not expose secrets as env vars. Convert nightly-summary to an inline job that directly sets all required secrets (S3, Slack) in the step environment.
In CI, aws-actions/configure-aws-credentials sets role-based tokens (AWS_ACCESS_KEY_ID + AWS_SESSION_TOKEN). The CUOPT_AWS_* overrides were replacing these with static keys that lack the session token, causing InvalidToken errors. Now only fall back to CUOPT_AWS_* when standard AWS credentials are not already set.
The cuOpt S3 bucket requires CUOPT_AWS_* static credentials. The role-based session token from aws-actions/configure-aws-credentials was causing InvalidToken errors. Always override with CUOPT_AWS_* and unset AWS_SESSION_TOKEN, matching the pattern in datasets/*.sh.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Move the nightly-summary job out of test.yaml into its own nightly-summary.yaml reusable workflow. Runs in a python:3.12-slim container to avoid PEP 668 externally-managed-environment errors when installing awscli. Also adds workflow_dispatch trigger so the summary can be re-run manually against an earlier test run.
The python:3.12-slim image doesn't include curl, which is needed by send_consolidated_summary.sh for Slack webhook and file upload.
- Filter consolidated.json from S3 aggregation to fix "unknown" entry - Migrate Slack file upload from deprecated files.upload to getUploadURLExternal + completeUploadExternal - Chunk Slack messages into header/grid/details/links to stay within block and character limits - Remove S3 link from Slack in favor of HTML file attachment - Add --junitxml to Pyomo, CvxPy, and PuLP thirdparty test scripts so failures appear in nightly reports - Export RAPIDS_TESTS_DIR from test_wheel_cuopt.sh for subprocesses
…Slack - Generate presigned S3 URLs (7-day expiry) for consolidated HTML report and dashboard, linked in Slack messages - Query GitHub API for workflow job statuses to surface CI-level failures (notebooks, JuMP, etc.) that don't produce JUnit XML - Show only failed/flaky matrix entries in Slack instead of listing all passing ones — compact summary line for green runs - Pass GITHUB_TOKEN and GITHUB_RUN_ID to nightly-summary container - Remove temporary test workflow file
Post the main summary (status + links) as a top-level message via chat.postMessage, then post matrix details, failure breakdowns, and the HTML report as thread replies. Keeps the channel clean while preserving full detail in the thread. Falls back to webhook (no threading) if bot token is not available.
- Filter out per-matrix test jobs (conda-cpp-tests, conda-python-tests, wheel-tests-*) from workflow job status since they are already tracked by S3 summaries. Only surface untracked jobs like notebooks and JuMP. - Move full CI job failure list to thread reply to avoid exceeding Slack's 3000-char block limit. Main message shows compact summary. - Chunk CI job details into multiple blocks if needed.
- Include ALL workflow jobs in consolidated JSON (not just untracked) with has_test_details flag to distinguish tracked vs untracked - Thread reply 1: CI Workflow Status showing every workflow group with pass/fail counts — new workflows automatically visible - Thread reply 2: Failing and flaky tests grouped by workflow so users see which workflow has which test issues - Main message alerts only on untracked CI failures (notebooks, JuMP) since tracked failures already appear in the matrix test grid
- Map CUOPT_AWS_* to standard AWS env vars before aws s3 presign so the CLI has credentials in the container - Log presign failures instead of swallowing them silently - Reorder thread: test failures/details first, CI workflow overview last
- Main message now lists which workflows have failures by name (e.g., "Failures in: conda-notebook-tests, wheel-tests-cuopt") with per-workflow failure counts - Add build-summary job to build.yaml that sends a Slack message after all builds complete, showing pass/fail per build job - Build summary queries GitHub API for job statuses, grouped by workflow prefix (cpp-build, wheel-build-cuopt, docs-build, etc.)
Skips all test jobs when summary-only=true, so nightly-summary runs immediately without waiting for GPU runners.
- Embed index.json and consolidated data directly into dashboard HTML during aggregation so it works on private S3 buckets without runtime fetches (no more 403 errors) - Dashboard falls back to S3 fetch if embedded data is absent - Add summary-only input to test.yaml to skip all test jobs and run only nightly-summary (avoids waiting for GPU runners when testing)
- Test totals: only show failed/flaky counts, skip passed/skipped/total - CI Workflow Status thread: only list failing workflows, one-line summary for passing ones
Write GitHub API response to a temp file instead of passing it as a shell argument to Python. The jobs JSON for a full build matrix exceeds the OS argument length limit.
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds end-to-end nightly CI aggregation and reporting: new reusable workflows/jobs, S3-backed aggregation and index updates, Slack notifications, rerun/crash-isolation tooling, per-suite nightly report generation (JSON/HTML/Markdown), a static dashboard UI, and multiple CI script and dependency updates to emit and consume JUnit outputs. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (8)
datasets/get_test_data.sh-38-38 (1)
38-38:⚠️ Potential issue | 🟡 MinorNormalize
CUOPT_S3_URIbefore appending subpaths.Line 38 assumes a trailing slash;
CUOPT_S3_URI=s3://bucketproducess3://bucketci_datasets/routing/and breaks sync.Suggested fix
- local s3_uri="${CUOPT_S3_URI}ci_datasets/routing/" + local s3_base="${CUOPT_S3_URI%/}" + local s3_uri="${s3_base}/ci_datasets/routing/"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@datasets/get_test_data.sh` at line 38, The assignment creating s3_uri using CUOPT_S3_URI doesn't ensure a trailing slash so concatenation can yield invalid paths (e.g., s3://bucketci_datasets/...); update the logic that sets local s3_uri to normalize CUOPT_S3_URI first (for example, append a "/" if CUOPT_S3_URI does not already end with "/" or strip/ensure a single separator) before adding "ci_datasets/routing/". Locate the assignment to s3_uri and adjust the string handling for CUOPT_S3_URI so downstream sync operations get a well-formed S3 URI.datasets/linear_programming/download_pdlp_test_dataset.sh-57-58 (1)
57-58:⚠️ Potential issue | 🟡 MinorNormalize S3 URI joining to avoid malformed paths.
Line 58 directly concatenates
CUOPT_S3_URIwith a suffix. If the configured URI is missing a trailing slash, the resolved S3 path becomes invalid.Suggested fix
- local s3_uri="${CUOPT_S3_URI}ci_datasets/linear_programming/pdlp/" + local s3_uri="${CUOPT_S3_URI%/}/ci_datasets/linear_programming/pdlp/"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@datasets/linear_programming/download_pdlp_test_dataset.sh` around lines 57 - 58, The s3_uri construction concatenates CUOPT_S3_URI and a suffix directly, which can produce malformed paths if CUOPT_S3_URI lacks a trailing slash; update the assignment for s3_uri to normalize joining by ensuring CUOPT_S3_URI ends with a single slash (e.g., strip any trailing slashes then append one, or conditionally add a slash when missing) before appending "ci_datasets/linear_programming/pdlp/"; modify the s3_uri variable assignment (referencing CUOPT_S3_URI and s3_uri) to perform this normalization so the resulting S3 URI is always well-formed.ci/utils/generate_slack_payloads.py-28-31 (1)
28-31:⚠️ Potential issue | 🟡 MinorAdd CLI argument validation before indexing
sys.argv.Line 29 can throw
IndexErrorwhen invoked without arguments; return a usage error instead.Suggested fix
def main(): + if len(sys.argv) < 2: + print( + f"Usage: {sys.argv[0]} <summary.json> [presigned_report_url] [presigned_dashboard_url]", + file=sys.stderr, + ) + sys.exit(1) summary_path = sys.argv[1]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/utils/generate_slack_payloads.py` around lines 28 - 31, The main() function currently indexes sys.argv directly (summary_path = sys.argv[1], presigned_report_url = sys.argv[2], presigned_dashboard_url = sys.argv[3]) which can raise IndexError when invoked without arguments; add CLI argument validation at the start of main() to check len(sys.argv) (e.g., require at least 2 elements for summary_path), and if missing print a usage message (showing expected arguments) and exit non-zero before attempting to access sys.argv[1], while still using optional defaults for presigned_report_url and presigned_dashboard_url when those indices are absent.ci/utils/junit_helpers.py-82-85 (1)
82-85:⚠️ Potential issue | 🟡 MinorReject
--sepwhen no value is provided.Lines 82-85 currently ignore
--sepwithout a value, which silently falls back to".". Fail fast with a usage error instead.Suggested fix
sep = "." for i, arg in enumerate(sys.argv[3:], 3): - if arg == "--sep" and i + 1 < len(sys.argv): - sep = sys.argv[i + 1] + if arg == "--sep": + if i + 1 >= len(sys.argv): + print("--sep requires a value", file=sys.stderr) + sys.exit(1) + sep = sys.argv[i + 1]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/utils/junit_helpers.py` around lines 82 - 85, The argument parsing loop silently ignores a bare "--sep" and falls back to "."; change the parsing in the sys.argv loop so that when arg == "--sep" and there is no next element (i + 1 >= len(sys.argv)) the program fails fast with a usage error (e.g., write a clear error to stderr and exit with non-zero or raise SystemExit) instead of leaving sep unchanged; keep the rest of the logic (setting sep from sys.argv[i + 1] and calling extract_tests(xml_path, status=cmd, sep=sep)) intact so behaviour only changes when "--sep" is provided without a value.ci/utils/junit_helpers.py-18-33 (1)
18-33:⚠️ Potential issue | 🟡 MinorUse
defusedxmlfor hardened JUnit XML parsing.Line 31 uses stdlib
ElementTree.parse(), which Ruff S314 flags for potential XML parser vulnerabilities. While JUnit XML in CI comes from test harness outputs (trusted source), switching todefusedxmlprovides defensive hardening against parser attacks.Suggested fix
import sys -from xml.etree import ElementTree +from defusedxml import ElementTree +from defusedxml.common import DefusedXmlException @@ - except (ElementTree.ParseError, FileNotFoundError, OSError): + except ( + ElementTree.ParseError, + DefusedXmlException, + FileNotFoundError, + OSError, + ): returnNote: Verify
defusedxmlcan be added as a dependency for this CI utility script.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/utils/junit_helpers.py` around lines 18 - 33, Replace the use of the stdlib XML parser in extract_tests with defusedxml's safe parser: change the import and call to use defusedxml.ElementTree.parse (or alias defusedxml.ElementTree as ElementTree) instead of xml.etree.ElementTree, and update the exception tuple to catch the parser's ParseError from defusedxml (keep FileNotFoundError and OSError). Specifically, update the module import at top and the parse call inside extract_tests so ElementTree.parse uses defusedxml's implementation while preserving the existing error handling.ci/test_wheel_cuopt_server.sh-45-55 (1)
45-55:⚠️ Potential issue | 🟡 MinorDoc test failures may not affect the exit code.
The
ERRtrap combined withset +eon line 47 means the trap is disabled for subsequent commands. Iftest_doc_examples.sh(line 54) fails, theERRtrap won't fire andEXITCODEwill remain 0.If doc test failures should also be captured:
🔧 Proposed fix to capture doc test failures
-# Run documentation tests -./ci/test_doc_examples.sh +# Run documentation tests +./ci/test_doc_examples.sh || EXITCODE=1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/test_wheel_cuopt_server.sh` around lines 45 - 55, The current script disables the ERR trap by using "set +e" so failures in test_doc_examples.sh won't update EXITCODE; restore failure detection by either removing "set +e" (use "set -e" or no change) or explicitly check test_doc_examples.sh's exit status and set EXITCODE=1 on failure — e.g., after calling ./ci/test_doc_examples.sh, test "$?" -ne 0 && EXITCODE=1; keep references to EXITCODE, the ERR trap, set +e, and the ./ci/test_doc_examples.sh invocation when making the change.ci/build_summary.sh-26-30 (1)
26-30:⚠️ Potential issue | 🟡 MinorGitHub API pagination may miss jobs.
The API call uses
per_page=100but doesn't handle pagination. Workflows with more than 100 jobs would have incomplete data.🔧 Consider adding pagination or increasing awareness
For most workflows, 100 jobs is sufficient. If this becomes an issue, you'd need to loop through pages. For now, consider adding a comment acknowledging the limit:
curl -s -L \ -H "Authorization: Bearer ${GITHUB_TOKEN}" \ -H "Accept: application/vnd.github+json" \ + # Note: Limited to first 100 jobs. Add pagination if workflow grows larger. "https://api.github.com/repos/${GITHUB_REPOSITORY}/actions/runs/${GITHUB_RUN_ID}/jobs?per_page=100" \ > "${JOBS_FILE}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/build_summary.sh` around lines 26 - 30, The curl call that writes to "${JOBS_FILE}" for the Actions jobs endpoint uses per_page=100 but doesn't handle pagination; update the script to either loop through paginated pages (using the "page" query param and aggregating results before writing to JOBS_FILE) or at minimum add a clear comment near the curl invocation referencing the 100-item limit and when to implement pagination; locate the curl invocation that references GITHUB_RUN_ID and GITHUB_REPOSITORY and modify it to fetch subsequent pages until no more results (or document the limitation) so workflows with >100 jobs aren't truncated..github/workflows/nightly-summary.yaml-57-57 (1)
57-57:⚠️ Potential issue | 🟡 MinorThis introduces an actionlint error unless the custom runner label is configured.
linux-amd64-cpu4is not a built-in label, so the workflow lint will keep failing unless the repo'sactionlint.yamldeclares it or this job switches to the standard self-hosted label set.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/nightly-summary.yaml at line 57, The workflow uses a non-standard runner label 'runs-on: linux-amd64-cpu4' which triggers actionlint unless declared; either replace that label with the standard self-hosted set (e.g., 'runs-on: self-hosted' or 'runs-on: [self-hosted, linux, amd64, cpu4]') in nightly-summary.yaml, or add the custom label 'linux-amd64-cpu4' to your actionlint configuration (actionlint.yaml) so the linter recognizes it.
🧹 Nitpick comments (12)
python/cuopt/pyproject.toml (1)
50-50: Consider adding a minimum version constraint forpytest-rerunfailures.The dependency is added without any version pin. While the comment indicates this list is auto-generated from
dependencies.yaml, consider specifying at least a minimum version (e.g.,pytest-rerunfailures>=10.0) in the sourcedependencies.yamlto ensure consistent behavior across CI environments and prevent potential breakage from future incompatible releases.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/pyproject.toml` at line 50, The pytest-rerunfailures entry currently has no version constraint; update the source dependencies.yaml to pin a minimum version (for example set pytest-rerunfailures>=10.0) and then regenerate the auto-generated list so pyproject.toml's dependency line for "pytest-rerunfailures" includes the minimum version; ensure the final pyproject.toml contains the string pytest-rerunfailures>=10.0 (or your chosen minimum) instead of the unpinned name.python/cuopt_server/cuopt_server/tests/test_flaky_validation.py (1)
15-28: Consider using a pytest fixture for marker cleanup to ensure test isolation.If the test crashes after creating the marker but before cleanup (e.g., test timeout, process kill), the marker persists and subsequent runs will pass immediately without exercising the flaky detection. A fixture-based approach ensures cleanup even on failures.
Additionally, consider including a unique identifier (e.g.,
pytest-rerunfailuresrun ID or timestamp) in the marker path to prevent interference on shared CI runners or parallel test execution.♻️ Proposed refactor using pytest fixture
import os import tempfile +import pytest -MARKER = os.path.join(tempfile.gettempdir(), "cuopt_flaky_validation_server") +@pytest.fixture +def flaky_marker(): + marker = os.path.join(tempfile.gettempdir(), "cuopt_flaky_validation_server") + yield marker + # Cleanup on any exit (pass, fail, or error) + if os.path.exists(marker): + os.remove(marker) -def test_flaky_fails_first_passes_on_retry(): - if os.path.exists(MARKER): +def test_flaky_fails_first_passes_on_retry(flaky_marker): + if os.path.exists(flaky_marker): # Second run — pass and clean up - os.remove(MARKER) + os.remove(flaky_marker) assert True, "Passed on retry (flaky validation working)" else: # First run — create marker and fail - with open(MARKER, "w") as f: + with open(flaky_marker, "w") as f: f.write("first_attempt") assert False, ( "Intentional first-run failure for flaky detection validation" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt_server/cuopt_server/tests/test_flaky_validation.py` around lines 15 - 28, The test creates a persistent MARKER and doesn't guarantee cleanup; replace the global MARKER use with a pytest fixture that creates a unique marker path (use tmp_path/tmp_path_factory or append a uuid/timestamp/pid) and yields it, then always removes the marker in the fixture teardown to ensure cleanup even on failures or timeouts; update test_flaky_fails_first_passes_on_retry to accept that fixture and use the provided marker path instead of the global MARKER so test isolation and parallel CI runs are safe.ci/utils/cuopt_rerun_xml.py (1)
60-62: Consider logging whenRAPIDS_TESTS_DIRis unset.The plugin silently returns without writing output if
RAPIDS_TESTS_DIRis not set. While this is intentional for non-CI runs, a debug message could help troubleshoot missing rerun XML in CI:💡 Optional: Add debug output for missing env var
output_dir = os.environ.get("RAPIDS_TESTS_DIR", "") if not output_dir: + # Silently skip when not in CI context (no RAPIDS_TESTS_DIR) returnOr if you want visibility:
output_dir = os.environ.get("RAPIDS_TESTS_DIR", "") if not output_dir: + print("Note: RAPIDS_TESTS_DIR not set, skipping rerun XML output") return🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/utils/cuopt_rerun_xml.py` around lines 60 - 62, When RAPIDS_TESTS_DIR is unset the function silently returns (output_dir = os.environ.get("RAPIDS_TESTS_DIR", "") followed by if not output_dir: return); add a debug log immediately before the early return to indicate the env var is missing so CI runs are easier to troubleshoot. Use the module logger (e.g., logging.getLogger(__name__).debug(...)) or the existing logger used in this file to emit a clear message referencing RAPIDS_TESTS_DIR and the fact that rerun XML will not be written, placing the call just before the return that checks output_dir.ci/utils/send_nightly_summary.sh (1)
153-158: Hardcoded channel name is overridden by webhook configuration.The
channelfield on line 154 is typically ignored when using Slack incoming webhooks, as the webhook is bound to a specific channel. This is acceptable, but consider removing it to avoid confusion or making it configurable via environment variable if you need flexibility.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/utils/send_nightly_summary.sh` around lines 153 - 158, The payload object currently hardcodes the Slack "channel" field which is ignored for incoming webhooks; update the payload construction in send_nightly_summary.sh to either remove the "channel" key from the payload (delete the "channel": "cuopt-regression-testing" entry) or make it configurable via an environment variable (e.g., read $SLACK_CHANNEL and include "channel" only when set) so the payload variable reflects actual webhook behavior; adjust any related documentation or default env setup if you choose the configurable approach.ci/utils/generate_step_summary.py (1)
16-18: Add basic argument and file error handling.The script will fail with unhelpful errors if called without arguments or with a non-existent file. Consider adding minimal validation:
🛡️ Proposed fix
def main(): + if len(sys.argv) < 2: + print("Usage: generate_step_summary.py <consolidated_summary.json>", file=sys.stderr) + sys.exit(1) + try: + with open(sys.argv[1]) as f: + d = json.load(f) + except (FileNotFoundError, json.JSONDecodeError) as e: + print(f"Error loading {sys.argv[1]}: {e}", file=sys.stderr) + sys.exit(1) - with open(sys.argv[1]) as f: - d = json.load(f)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/utils/generate_step_summary.py` around lines 16 - 18, main currently assumes sys.argv[1] exists and the path is valid; add minimal argument and file error handling by validating len(sys.argv) (or using argparse) and printing a usage message and exiting with non-zero when missing, then wrap the open/sys.argv[1] and json.load calls in try/except to catch FileNotFoundError and json.JSONDecodeError and log or print a clear error message before exiting; update the main function (and any helper functions that read the file) to return a non-zero exit code on error rather than allowing an uncaught exception.ci/run_ctests.sh (1)
80-85: Quote variable in arithmetic context forsignal_name.The
signal_namefunction receives an unquoted${rc}which could cause issues if the variable is empty or contains unexpected characters. While unlikely given the context, quoting is safer and more consistent with the rest of the script.🔧 Proposed fix
- echo "CRASH: ${test_name} died from $(signal_name ${rc}) (exit code ${rc})" + echo "CRASH: ${test_name} died from $(signal_name "${rc}") (exit code ${rc})"This same pattern appears at lines 91, 118, 158, and 160. Consider applying consistently.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/run_ctests.sh` around lines 80 - 85, Calls to signal_name use an unquoted ${rc} which can break if rc is empty or contains spaces; update every invocation (e.g., in the block that calls was_signal_death and echo "CRASH: ${test_name} died from $(signal_name ${rc}) ...") to pass "${rc}" (quoted) instead of ${rc}. Search for other usages of signal_name ${rc} (the same pattern appears later in the script) and change them to signal_name "${rc}" to ensure safe handling; keep the rest of the logic (was_signal_death, test_name, OVERALL_RC) unchanged.ci/run_cuopt_server_pytests.sh (2)
23-29: Same JUnit XML extraction pattern asrun_cuopt_pytests.sh.Consider using the more precise pattern
--junitxml=*as suggested for the other file.🔧 More precise pattern
for arg in "$@"; do - if [[ "${arg}" == *"junitxml"* ]]; then + if [[ "${arg}" == --junitxml=* ]]; then xml_file="${arg#*=}" break fi done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/run_cuopt_server_pytests.sh` around lines 23 - 29, The JUnit XML extraction loop uses a loose pattern and should match the exact flag; update the condition inside the for-loop that inspects args (the block setting xml_file and breaking) to test for the precise pattern --junitxml=* instead of *"junitxml"* so xml_file="${arg#*=}" still extracts the filename only when the argument starts with --junitxml=; keep the same variable name xml_file and break behavior.
34-49: Consider extracting shared pytest crash handling.The crash handling logic (lines 34-49) is identical to
ci/run_cuopt_pytests.sh. Per the coding guidelines in SKILL.md about avoiding duplication, consider extracting this to a shared helper incrash_helpers.sh:# In crash_helpers.sh: run_pytest_with_crash_isolation() { local test_dir="$1" shift # ... shared logic }This would reduce maintenance burden and ensure consistent behavior across pytest runners.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/run_cuopt_server_pytests.sh` around lines 34 - 49, Extract the duplicated crash handling from ci/run_cuopt_server_pytests.sh into a shared helper (e.g., crash_helpers.sh) by implementing a function run_pytest_with_crash_isolation that accepts the pytest arguments and any context (e.g., xml_file, test_dir) and encapsulates the logic around running pytest, capturing rc, checking signal_name, honoring IS_NIGHTLY, calling pytest_crash_isolate when needed, and exiting with rc; then replace the block in run_cuopt_server_pytests.sh (the current rc/pytest invocation and the if-checks that call signal_name, pytest_crash_isolate, and exit) with a call to run_pytest_with_crash_isolation, ensuring you pass xml_file and other required variables and keep references to pytest_crash_isolate and signal_name intact so behavior is unchanged.ci/utils/nightly_report_helper.sh (1)
46-55: Consideruname -mfor broader compatibility.The
archcommand is specific to GNU coreutils and may not be available in all CI environments (e.g., minimal containers).uname -mis more portable:🔧 More portable architecture detection
local cuda_tag="cuda${RAPIDS_CUDA_VERSION:-unknown}" local arch_tag - arch_tag="$(arch)" + arch_tag="$(uname -m)" local matrix_label="${cuda_tag}-${arch_tag}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/utils/nightly_report_helper.sh` around lines 46 - 55, The script uses the arch command to set arch_tag which may be missing in some environments; change the architecture detection in the build matrix block to use uname -m instead (replace arch invocation used when setting arch_tag and in matrix_label computation), e.g., set arch_tag="$(uname -m)" so matrix_label (and the py-tag branch) remains correct and more portable across minimal CI containers.ci/run_cuopt_pytests.sh (1)
23-30: JUnit XML path extraction could match unintended arguments.The pattern
*"junitxml"*would match any argument containing "junitxml" anywhere. Consider a more precise pattern:🔧 More precise pattern
for arg in "$@"; do - if [[ "${arg}" == *"junitxml"* ]]; then + if [[ "${arg}" == --junitxml=* ]]; then xml_file="${arg#*=}" break fi doneThis ensures only the exact
--junitxml=<path>argument is matched.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/run_cuopt_pytests.sh` around lines 23 - 30, The current argument scan in the for loop (for arg in "$@") uses a broad pattern (*"junitxml"*) that can match unintended args; change the conditional to match the exact flag pattern (e.g., --junitxml=*) so only the intended `--junitxml=<path>` argument is captured, and still set xml_file using the existing parameter expansion (xml_file="${arg#*=}") inside the if block; update the if that checks arg to use the precise `--junitxml=*` pattern so xml_file only receives the proper junitxml path.ci/build_summary.sh (1)
140-144: Add timeout to curl requests to prevent hangs.The curl calls to Slack API have no timeout, which could cause the CI job to hang indefinitely if the Slack API is unresponsive.
🔧 Add timeouts
RESPONSE=$(curl -s -X POST \ + --connect-timeout 10 \ + --max-time 30 \ -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ -H "Content-Type: application/json" \ --data "${BOT_PAYLOAD}" \ "https://slack.com/api/chat.postMessage")Apply similarly to the webhook curl calls at lines 150 and 155-158.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/build_summary.sh` around lines 140 - 144, The curl POST that populates RESPONSE using BOT_PAYLOAD lacks timeouts and can hang; update that invocation to include connection and overall timeouts (for example --connect-timeout 10 and --max-time 30) so the CI job won't block, and apply the same timeout flags to the other webhook curl calls near the RESPONSE invocation (the subsequent webhook/Post calls around lines 150 and 155-158) to ensure all external HTTP requests fail fast on unresponsive endpoints.ci/dashboard/index.html (1)
241-248: Add timeout to fetch calls to prevent indefinite hangs.The
fetch()calls have no timeout, which could cause the dashboard to hang if S3 is unreachable. Consider usingAbortController:🔧 Add fetch timeout
async function fetchWithTimeout(url, timeoutMs = 10000) { const controller = new AbortController(); const id = setTimeout(() => controller.abort(), timeoutMs); try { const response = await fetch(url, { signal: controller.signal }); clearTimeout(id); return response; } catch (e) { clearTimeout(id); throw e; } }Then replace
fetch(S.baseUrl + 'index.json')withfetchWithTimeout(...).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/dashboard/index.html` around lines 241 - 248, The fetch call that loads index.json can hang indefinitely; add a reusable fetchWithTimeout helper that uses AbortController (e.g., async function fetchWithTimeout(url, timeoutMs = 10000)) and replace the direct call to fetch(S.baseUrl + 'index.json') with await fetchWithTimeout(S.baseUrl + 'index.json', timeoutMs). Ensure fetchWithTimeout clears the timeout on success/failure and rethrows errors so the existing try/catch around the code that sets S.index and calls showEmpty still works; update any other fetch usages in this file to use fetchWithTimeout as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ci/nightly_summary.sh`:
- Around line 112-121: The current conditional only runs the Slack send step
when CUOPT_SLACK_WEBHOOK_URL is set, skipping cases where only bot credentials
are provided; update the if test around the block that invokes bash
"${SCRIPT_DIR}/utils/send_consolidated_summary.sh" so it executes when
RAPIDS_BUILD_TYPE="nightly" AND (CUOPT_SLACK_WEBHOOK_URL is non-empty OR both
CUOPT_SLACK_BOT_TOKEN and CUOPT_SLACK_CHANNEL_ID are non-empty), preserving all
exported env vars (CONSOLIDATED_SUMMARY, CONSOLIDATED_HTML, SLACK_WEBHOOK_URL,
SLACK_BOT_TOKEN, SLACK_CHANNEL_ID, PRESIGNED_REPORT_URL,
PRESIGNED_DASHBOARD_URL) passed into the script; adjust the conditional
operators to check the trio CUOPT_SLACK_BOT_TOKEN and CUOPT_SLACK_CHANNEL_ID
together to allow bot-only setups to call send_consolidated_summary.sh.
In `@ci/utils/aggregate_nightly.py`:
- Around line 263-269: The current filter only picks jobs with conclusion ==
"failure"; change it to capture any completed non-success outcomes by selecting
jobs where j.get("conclusion") exists and j["conclusion"] != "success" (e.g., in
the comprehension that defines failed_ci_jobs). Then keep the existing
untracked_failed filter (which references has_test_details) but point it at the
updated failed_ci_jobs variable so timeouts, cancellations, and other
non-success conclusions are included.
- Around line 565-570: The current report_link anchors use raw s3:// URIs
(s3_reports_prefix + report_filename) which browsers can't open; change the
logic around s3_reports_prefix/report_filename to first detect s3:// prefixes
and either (a) convert them to browser-accessible HTTPS URLs (e.g. parse
"s3://bucket/path" into bucket and key and form
"https://{bucket}.s3.amazonaws.com/{key}{report_filename}" or preferably use an
S3 client to call generate_presigned_url for GetObject to produce a temporary
HTTPS link), or (b) if credentials/permissions for presigning are not available,
suppress the "View" column for entries where s3_reports_prefix is a raw s3://
URI; ensure the final link value is passed through _html_escape before
interpolation into report_link.
- Around line 809-819: The injected JSON (built from index_data and
consolidated) must be escaped to prevent sequences like "</script>" becoming
executable; update the inject_script creation in aggregate_nightly.py where
inject_script is composed (and before dashboard_html.replace) to serialize the
data with json.dumps and then sanitize the resulting JSON string(s) by replacing
occurrences of "</" (and optionally "<!--") with their safe escapes (e.g.,
"\\u003C/") or by escaping the forward slash (e.g., "<\\/") so that
window.__EMBEDDED_INDEX__ and window.__EMBEDDED_CONSOLIDATED__ receive safe,
non-executable JSON; ensure you apply this transformation to both index_data and
consolidated before embedding into inject_script.
In `@ci/utils/crash_helpers.sh`:
- Around line 76-82: The current block in crash_helpers.sh calls write_crash_xml
with "${xml_file}" when test_list is empty, which overwrites any partial JUnit
results; instead create a separate crash XML (e.g., derive a new path like
"${xml_file}.crash" or use a new variable CRASH_XML) and pass that to
write_crash_xml so the original "${xml_file}" is preserved; update the branch
where test_list is empty to write the synthetic "pytest-crash"/"PROCESS_CRASH"
entry into the separate crash file and leave "${xml_file}" untouched,
referencing write_crash_xml and the xml_file/test_list variables to locate the
change.
- Around line 66-75: The crash isolation helper pytest_crash_isolate currently
recollects tests from the hardcoded "tests" directory and ignores the original
pytest selection arguments forwarded by ci/run_cuopt_pytests.sh; modify
pytest_crash_isolate to accept and forward the original pytest selection/filter
args (e.g., "$@"/a named parameter like pytest_args) when calling pytest
--collect-only so the collected test_list matches the initial run's
markers/paths, and ensure subsequent retry invocations also use those same args;
update references inside the function (pytest_crash_isolate and the pytest
--collect-only invocation) to use the forwarded args instead of the literal
"tests".
- Around line 129-137: The crash branch only records crashes when attempt ==
PYTEST_MAX_CRASH_RETRIES, so intermediate crash attempts are lost; change the
logic in the block guarded by retry_rc > 128 (using symbols retry_rc, test_id,
attempt, PYTEST_MAX_CRASH_RETRIES) to persist every crash attempt by calling
write_crash_xml (using retry_xml, pytest-crash, test_id) and adding the test to
crash_tests for each crash occurrence (include attempt number in the
message/details so each crash attempt is distinct) instead of only on the final
retry; keep the existing "crashes consistently" behavior when attempt ==
PYTEST_MAX_CRASH_RETRIES but still record earlier crashes.
In `@ci/utils/nightly_report.py`:
- Around line 147-152: The grouping key using key =
f"{r['classname']}::{r['name']}” can collapse distinct tests from different XML
producers; update the grouping in the loop that builds test_groups to include a
stable producer identifier (for example a 'file', 'suite', or 'binary' field on
r) so each test is keyed by classname + name + producer (e.g.,
classname::name::producer_id). Modify the code that computes key in the for r in
results loop and any downstream logic that reads test_groups to use the new
composite key to avoid merging unrelated entries.
In `@ci/utils/send_consolidated_summary.sh`:
- Around line 28-34: The script currently forces SLACK_WEBHOOK_URL to be set
unconditionally which prevents running the bot-postMessage path; change the
variable validation so SLACK_WEBHOOK_URL is only required when the script will
use the webhook fallback. Specifically, remove or relax the unconditional
expansion check for SLACK_WEBHOOK_URL="${SLACK_WEBHOOK_URL:?…}" and instead
validate SLACK_WEBHOOK_URL only inside the fallback branch that posts via the
incoming webhook, leaving SLACK_BOT_TOKEN and SLACK_CHANNEL_ID validation (and
the chat.postMessage branch) intact; ensure the code still errors if neither bot
credentials (SLACK_BOT_TOKEN+SLACK_CHANNEL_ID) nor a webhook are available.
- Around line 61-65: The curl calls that assign RESPONSE using BOT_PAYLOAD must
not be allowed to abort the whole script under set -e; wrap each curl POST (the
lines that set RESPONSE from curl with BOT_PAYLOAD and the other similar curl
blocks) with a temporary disable of errexit (set +e), run curl, capture its exit
code ($?), then re-enable errexit (set -e) and treat failures as warnings: if rc
!= 0 or the JSON RESPONSE contains "ok":false, log a warning (stderr) including
the RESPONSE and continue instead of exiting. Update the blocks that set
RESPONSE and any similar curl invocations to follow this pattern so transient
Slack transport failures degrade to warnings rather than aborting the job.
In `@python/cuopt/cuopt/linear_programming/pyproject.toml`:
- Line 40: The pytest-rerunfailures dependency currently lacks a minimum
version; update the pyproject.toml entry for "pytest-rerunfailures" to
"pytest-rerunfailures>=14" to satisfy the CI script
ci/utils/cuopt_rerun_xml.py's requirement, and make the same change in
dependencies.yaml and python/cuopt_self_hosted/pyproject.toml so all manifests
consistently enforce pytest-rerunfailures v14+.
In `@python/cuopt/cuopt/tests/test_flaky_validation.py`:
- Around line 15-22: The global MARKER file causes cross-run contamination; make
the marker run-scoped by moving MARKER creation into the test and using pytest's
tmp_path (or tmp_path_factory) to create a unique temp file for
test_flaky_fails_first_passes_on_retry; replace references to the global MARKER
with a test-local marker (e.g., marker = tmp_path /
"cuopt_flaky_validation_python") so the test cleans up only its own file and
cannot be affected by other runs.
- Around line 18-29: The test test_flaky_fails_first_passes_on_retry currently
fails on first run unconditionally; modify it to require an explicit opt-in
before performing the fail-first behavior by checking an environment variable or
pytest marker (e.g., os.getenv("RUN_FLAKY_VALIDATION") == "1" or
pytest.config.getoption/marker) at the top of the test and call pytest.skip when
opt-in is not present, leaving the MARKER-based fail-first logic unchanged when
opt-in is enabled; reference test_flaky_fails_first_passes_on_retry and MARKER
to locate and update the test.
---
Minor comments:
In @.github/workflows/nightly-summary.yaml:
- Line 57: The workflow uses a non-standard runner label 'runs-on:
linux-amd64-cpu4' which triggers actionlint unless declared; either replace that
label with the standard self-hosted set (e.g., 'runs-on: self-hosted' or
'runs-on: [self-hosted, linux, amd64, cpu4]') in nightly-summary.yaml, or add
the custom label 'linux-amd64-cpu4' to your actionlint configuration
(actionlint.yaml) so the linter recognizes it.
In `@ci/build_summary.sh`:
- Around line 26-30: The curl call that writes to "${JOBS_FILE}" for the Actions
jobs endpoint uses per_page=100 but doesn't handle pagination; update the script
to either loop through paginated pages (using the "page" query param and
aggregating results before writing to JOBS_FILE) or at minimum add a clear
comment near the curl invocation referencing the 100-item limit and when to
implement pagination; locate the curl invocation that references GITHUB_RUN_ID
and GITHUB_REPOSITORY and modify it to fetch subsequent pages until no more
results (or document the limitation) so workflows with >100 jobs aren't
truncated.
In `@ci/test_wheel_cuopt_server.sh`:
- Around line 45-55: The current script disables the ERR trap by using "set +e"
so failures in test_doc_examples.sh won't update EXITCODE; restore failure
detection by either removing "set +e" (use "set -e" or no change) or explicitly
check test_doc_examples.sh's exit status and set EXITCODE=1 on failure — e.g.,
after calling ./ci/test_doc_examples.sh, test "$?" -ne 0 && EXITCODE=1; keep
references to EXITCODE, the ERR trap, set +e, and the ./ci/test_doc_examples.sh
invocation when making the change.
In `@ci/utils/generate_slack_payloads.py`:
- Around line 28-31: The main() function currently indexes sys.argv directly
(summary_path = sys.argv[1], presigned_report_url = sys.argv[2],
presigned_dashboard_url = sys.argv[3]) which can raise IndexError when invoked
without arguments; add CLI argument validation at the start of main() to check
len(sys.argv) (e.g., require at least 2 elements for summary_path), and if
missing print a usage message (showing expected arguments) and exit non-zero
before attempting to access sys.argv[1], while still using optional defaults for
presigned_report_url and presigned_dashboard_url when those indices are absent.
In `@ci/utils/junit_helpers.py`:
- Around line 82-85: The argument parsing loop silently ignores a bare "--sep"
and falls back to "."; change the parsing in the sys.argv loop so that when arg
== "--sep" and there is no next element (i + 1 >= len(sys.argv)) the program
fails fast with a usage error (e.g., write a clear error to stderr and exit with
non-zero or raise SystemExit) instead of leaving sep unchanged; keep the rest of
the logic (setting sep from sys.argv[i + 1] and calling extract_tests(xml_path,
status=cmd, sep=sep)) intact so behaviour only changes when "--sep" is provided
without a value.
- Around line 18-33: Replace the use of the stdlib XML parser in extract_tests
with defusedxml's safe parser: change the import and call to use
defusedxml.ElementTree.parse (or alias defusedxml.ElementTree as ElementTree)
instead of xml.etree.ElementTree, and update the exception tuple to catch the
parser's ParseError from defusedxml (keep FileNotFoundError and OSError).
Specifically, update the module import at top and the parse call inside
extract_tests so ElementTree.parse uses defusedxml's implementation while
preserving the existing error handling.
In `@datasets/get_test_data.sh`:
- Line 38: The assignment creating s3_uri using CUOPT_S3_URI doesn't ensure a
trailing slash so concatenation can yield invalid paths (e.g.,
s3://bucketci_datasets/...); update the logic that sets local s3_uri to
normalize CUOPT_S3_URI first (for example, append a "/" if CUOPT_S3_URI does not
already end with "/" or strip/ensure a single separator) before adding
"ci_datasets/routing/". Locate the assignment to s3_uri and adjust the string
handling for CUOPT_S3_URI so downstream sync operations get a well-formed S3
URI.
In `@datasets/linear_programming/download_pdlp_test_dataset.sh`:
- Around line 57-58: The s3_uri construction concatenates CUOPT_S3_URI and a
suffix directly, which can produce malformed paths if CUOPT_S3_URI lacks a
trailing slash; update the assignment for s3_uri to normalize joining by
ensuring CUOPT_S3_URI ends with a single slash (e.g., strip any trailing slashes
then append one, or conditionally add a slash when missing) before appending
"ci_datasets/linear_programming/pdlp/"; modify the s3_uri variable assignment
(referencing CUOPT_S3_URI and s3_uri) to perform this normalization so the
resulting S3 URI is always well-formed.
---
Nitpick comments:
In `@ci/build_summary.sh`:
- Around line 140-144: The curl POST that populates RESPONSE using BOT_PAYLOAD
lacks timeouts and can hang; update that invocation to include connection and
overall timeouts (for example --connect-timeout 10 and --max-time 30) so the CI
job won't block, and apply the same timeout flags to the other webhook curl
calls near the RESPONSE invocation (the subsequent webhook/Post calls around
lines 150 and 155-158) to ensure all external HTTP requests fail fast on
unresponsive endpoints.
In `@ci/dashboard/index.html`:
- Around line 241-248: The fetch call that loads index.json can hang
indefinitely; add a reusable fetchWithTimeout helper that uses AbortController
(e.g., async function fetchWithTimeout(url, timeoutMs = 10000)) and replace the
direct call to fetch(S.baseUrl + 'index.json') with await
fetchWithTimeout(S.baseUrl + 'index.json', timeoutMs). Ensure fetchWithTimeout
clears the timeout on success/failure and rethrows errors so the existing
try/catch around the code that sets S.index and calls showEmpty still works;
update any other fetch usages in this file to use fetchWithTimeout as well.
In `@ci/run_ctests.sh`:
- Around line 80-85: Calls to signal_name use an unquoted ${rc} which can break
if rc is empty or contains spaces; update every invocation (e.g., in the block
that calls was_signal_death and echo "CRASH: ${test_name} died from
$(signal_name ${rc}) ...") to pass "${rc}" (quoted) instead of ${rc}. Search for
other usages of signal_name ${rc} (the same pattern appears later in the script)
and change them to signal_name "${rc}" to ensure safe handling; keep the rest of
the logic (was_signal_death, test_name, OVERALL_RC) unchanged.
In `@ci/run_cuopt_pytests.sh`:
- Around line 23-30: The current argument scan in the for loop (for arg in "$@")
uses a broad pattern (*"junitxml"*) that can match unintended args; change the
conditional to match the exact flag pattern (e.g., --junitxml=*) so only the
intended `--junitxml=<path>` argument is captured, and still set xml_file using
the existing parameter expansion (xml_file="${arg#*=}") inside the if block;
update the if that checks arg to use the precise `--junitxml=*` pattern so
xml_file only receives the proper junitxml path.
In `@ci/run_cuopt_server_pytests.sh`:
- Around line 23-29: The JUnit XML extraction loop uses a loose pattern and
should match the exact flag; update the condition inside the for-loop that
inspects args (the block setting xml_file and breaking) to test for the precise
pattern --junitxml=* instead of *"junitxml"* so xml_file="${arg#*=}" still
extracts the filename only when the argument starts with --junitxml=; keep the
same variable name xml_file and break behavior.
- Around line 34-49: Extract the duplicated crash handling from
ci/run_cuopt_server_pytests.sh into a shared helper (e.g., crash_helpers.sh) by
implementing a function run_pytest_with_crash_isolation that accepts the pytest
arguments and any context (e.g., xml_file, test_dir) and encapsulates the logic
around running pytest, capturing rc, checking signal_name, honoring IS_NIGHTLY,
calling pytest_crash_isolate when needed, and exiting with rc; then replace the
block in run_cuopt_server_pytests.sh (the current rc/pytest invocation and the
if-checks that call signal_name, pytest_crash_isolate, and exit) with a call to
run_pytest_with_crash_isolation, ensuring you pass xml_file and other required
variables and keep references to pytest_crash_isolate and signal_name intact so
behavior is unchanged.
In `@ci/utils/cuopt_rerun_xml.py`:
- Around line 60-62: When RAPIDS_TESTS_DIR is unset the function silently
returns (output_dir = os.environ.get("RAPIDS_TESTS_DIR", "") followed by if not
output_dir: return); add a debug log immediately before the early return to
indicate the env var is missing so CI runs are easier to troubleshoot. Use the
module logger (e.g., logging.getLogger(__name__).debug(...)) or the existing
logger used in this file to emit a clear message referencing RAPIDS_TESTS_DIR
and the fact that rerun XML will not be written, placing the call just before
the return that checks output_dir.
In `@ci/utils/generate_step_summary.py`:
- Around line 16-18: main currently assumes sys.argv[1] exists and the path is
valid; add minimal argument and file error handling by validating len(sys.argv)
(or using argparse) and printing a usage message and exiting with non-zero when
missing, then wrap the open/sys.argv[1] and json.load calls in try/except to
catch FileNotFoundError and json.JSONDecodeError and log or print a clear error
message before exiting; update the main function (and any helper functions that
read the file) to return a non-zero exit code on error rather than allowing an
uncaught exception.
In `@ci/utils/nightly_report_helper.sh`:
- Around line 46-55: The script uses the arch command to set arch_tag which may
be missing in some environments; change the architecture detection in the build
matrix block to use uname -m instead (replace arch invocation used when setting
arch_tag and in matrix_label computation), e.g., set arch_tag="$(uname -m)" so
matrix_label (and the py-tag branch) remains correct and more portable across
minimal CI containers.
In `@ci/utils/send_nightly_summary.sh`:
- Around line 153-158: The payload object currently hardcodes the Slack
"channel" field which is ignored for incoming webhooks; update the payload
construction in send_nightly_summary.sh to either remove the "channel" key from
the payload (delete the "channel": "cuopt-regression-testing" entry) or make it
configurable via an environment variable (e.g., read $SLACK_CHANNEL and include
"channel" only when set) so the payload variable reflects actual webhook
behavior; adjust any related documentation or default env setup if you choose
the configurable approach.
In `@python/cuopt_server/cuopt_server/tests/test_flaky_validation.py`:
- Around line 15-28: The test creates a persistent MARKER and doesn't guarantee
cleanup; replace the global MARKER use with a pytest fixture that creates a
unique marker path (use tmp_path/tmp_path_factory or append a
uuid/timestamp/pid) and yields it, then always removes the marker in the fixture
teardown to ensure cleanup even on failures or timeouts; update
test_flaky_fails_first_passes_on_retry to accept that fixture and use the
provided marker path instead of the global MARKER so test isolation and parallel
CI runs are safe.
In `@python/cuopt/pyproject.toml`:
- Line 50: The pytest-rerunfailures entry currently has no version constraint;
update the source dependencies.yaml to pin a minimum version (for example set
pytest-rerunfailures>=10.0) and then regenerate the auto-generated list so
pyproject.toml's dependency line for "pytest-rerunfailures" includes the minimum
version; ensure the final pyproject.toml contains the string
pytest-rerunfailures>=10.0 (or your chosen minimum) instead of the unpinned
name.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: f34de122-bc54-4d2c-aa14-45f9b5db7d6e
📒 Files selected for processing (46)
.github/workflows/build.yaml.github/workflows/nightly-summary.yaml.github/workflows/pr.yaml.github/workflows/test.yamlci/build_summary.shci/dashboard/index.htmlci/nightly_summary.shci/run_ctests.shci/run_cuopt_pytests.shci/run_cuopt_server_pytests.shci/test_cpp.shci/test_notebooks.shci/test_python.shci/test_wheel_cuopt.shci/test_wheel_cuopt_server.shci/thirdparty-testing/run_cvxpy_tests.shci/thirdparty-testing/run_pulp_tests.shci/thirdparty-testing/run_pyomo_tests.shci/utils/aggregate_nightly.pyci/utils/crash_helpers.shci/utils/cuopt_rerun_xml.pyci/utils/generate_slack_payloads.pyci/utils/generate_step_summary.pyci/utils/junit_helpers.pyci/utils/nightly_report.pyci/utils/nightly_report_helper.shci/utils/s3_helpers.pyci/utils/send_consolidated_summary.shci/utils/send_nightly_summary.shconda/environments/all_cuda-129_arch-aarch64.yamlconda/environments/all_cuda-129_arch-x86_64.yamlconda/environments/all_cuda-131_arch-aarch64.yamlconda/environments/all_cuda-131_arch-x86_64.yamlcpp/tests/utilities/CMakeLists.txtcpp/tests/utilities/test_flaky_validation.cppdatasets/get_test_data.shdatasets/linear_programming/download_pdlp_test_dataset.shdatasets/mip/download_miplib_test_dataset.shdependencies.yamlpython/cuopt/cuopt/linear_programming/pyproject.tomlpython/cuopt/cuopt/tests/test_flaky_validation.pypython/cuopt/pyproject.tomlpython/cuopt_self_hosted/pyproject.tomlpython/cuopt_server/cuopt_server/tests/test_flaky_validation.pypython/cuopt_server/pyproject.tomlskills/cuopt-developer/SKILL.md
- Escape </script> in dashboard embedded JSON to prevent XSS (#4 critical) - Make Slack curl non-fatal with timeout + fallback so nightly pipeline isn't killed by transient Slack failures (#10) - Don't overwrite partial JUnit XML when test collection fails after crash — write crash marker to a separate file (#6) - Make flaky validation test markers PID-scoped to prevent cross-run contamination (#12) Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
|
@coderabbitai review |
- Replace hardcoded @ramakrishna_prabhu with CUOPT_SLACK_MENTION_ID env var (accepts Slack user ID or handle, empty disables mentions) - Fix aws s3 presign stderr warnings polluting the presigned URL (2>&1 → 2>/dev/null) Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
ci/utils/aggregate_nightly.py (1)
50-55: Sort S3 summary URIs before processing for deterministic outputs.Processing
s3_list(...)results in source order can make consolidated failure/flaky section ordering vary run-to-run. A stable sort here avoids noisy report diffs.♻️ Proposed patch
- json_uris = [ + json_uris = sorted([ u for u in uris if u.endswith(".json") and not u.endswith("/consolidated.json") - ] + ]) @@ - json_uris = [ + json_uris = sorted([ u for u in uris if u.endswith(".json") and not u.endswith("/consolidated.json") - ] + ])Also applies to: 67-71
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/utils/aggregate_nightly.py` around lines 50 - 55, The S3 URIs returned by s3_list(s3_prefix) must be sorted to ensure deterministic consolidated output; update the code that builds json_uris (the block using uris, s3_list, s3_prefix and filtering out "/consolidated.json") to sort the uris (e.g., call sorted() / .sort()) before filtering, and apply the same change to the other similar block that processes S3 summary URIs (the later block that also builds json_uris from s3_list) so both places produce a stable, deterministic order.ci/utils/crash_helpers.sh (2)
72-75: Make the collection cap configurable to avoid silent truncation.The fixed
head -500can miss the crashing test when the suite grows. Consider a tunable limit with a safe default.♻️ Proposed change
- test_list=$(pytest --collect-only -q tests 2>/dev/null | grep "::" | head -500 || echo "") + local collect_limit="${PYTEST_CRASH_COLLECT_LIMIT:-500}" + test_list=$(pytest --collect-only -q tests 2>/dev/null | grep "::" | head -n "${collect_limit}" || echo "")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/utils/crash_helpers.sh` around lines 72 - 75, The current fixed cap using head -500 truncates collected tests; make the cap configurable by reading an environment variable (e.g., TEST_COLLECTION_LIMIT) with a safe default of 500, validate/fallback to the default if empty or non-numeric, and use that variable in the pytest collection pipeline that sets test_list (the command around test_list=$(pytest --collect-only ... | head -500 ...)). Update crash_helpers.sh to read and validate TEST_COLLECTION_LIMIT before calling head so larger suites aren't silently truncated while preserving the 500 default.
124-124: Consider keeping per-test retry stderr as an artifact for faster crash triage.
2>/dev/nulldrops useful native crash breadcrumbs (stack traces/assert output). Redirecting to per-test logs in${RAPIDS_TESTS_DIR}can improve debugging without changing pass/fail behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/utils/crash_helpers.sh` at line 124, The pytest invocation currently discards stderr with "2>/dev/null"; preserve crash output by redirecting stderr to a per-test artifact file under ${RAPIDS_TESTS_DIR} (e.g., "${RAPIDS_TESTS_DIR}/${test_id}.stderr") while keeping the same exit-code capture into retry_rc and junit xml into retry_xml; ensure ${RAPIDS_TESTS_DIR} exists/writable before running the pytest command and update the pytest line (the invocation that sets retry_rc) to redirect stderr to that per-test file instead of /dev/null so pass/fail behavior is unchanged but native crash breadcrumbs are retained..github/workflows/nightly-summary.yaml (1)
53-80: Set explicit workflow permissions forGITHUB_TOKEN.This reusable workflow relies on
github.token, but permission scope is currently implicit. Declare minimal required permissions (e.g.,actions: read,contents: read) to avoid permission drift and tighten security posture.Suggested patch
name: nightly-summary on: @@ CUOPT_SLACK_CHANNEL_ID: required: false +permissions: + actions: read + contents: read + jobs: nightly-summary:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/nightly-summary.yaml around lines 53 - 80, Add an explicit top-level permissions block to the reusable workflow to scope the GITHUB_TOKEN to the minimum required rights (e.g., permissions: actions: read, contents: read). Insert a permissions: mapping at the workflow root (above jobs) so the GITHUB_TOKEN used in the nightly-summary job and the Run nightly summary step has only the declared permissions instead of implicit full access.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/build.yaml:
- Line 251: The workflow uses a custom runner label runs-on: linux-amd64-cpu4
for the build-summary job; confirm that a self-hosted runner with the exact
label "linux-amd64-cpu4" is registered in your repo/org or replace this label
with a supported runner (e.g., ubuntu-latest) or your actual self-hosted label.
Update the same label used in .github/workflows/nightly-summary.yaml (the job
referencing the same runs-on value) to match the registered runner if you change
it. Ensure both workflows reference the identical, registered runner label so
the jobs will be scheduled.
In `@ci/build_summary.sh`:
- Around line 25-29: The curl command that writes to "${JOBS_FILE}" can hang
because it lacks timeout and fallback handling; update the invocation in
build_summary.sh to include connection and overall timeouts (e.g.
--connect-timeout and --max-time), enable --fail and retries (e.g. --retry and
--retry-delay) so transient errors don't block, and after the curl run check its
exit code and, on failure, write a safe fallback (like an empty JSON array or
cached minimal structure) to "${JOBS_FILE}" and emit a clear error/log message
so downstream steps can continue deterministically.
In `@ci/utils/aggregate_nightly.py`:
- Around line 565-570: The report link construction assumes s3_reports_prefix
ends with a slash, causing malformed URLs when it doesn't; update the logic that
builds report_link (which uses s3_reports_prefix, report_filename and
_html_escape) to ensure a single slash separates prefix and filename (e.g.,
append '/' to s3_reports_prefix only if it doesn't already end with one) before
escaping and concatenating so the generated View URL is always valid.
---
Nitpick comments:
In @.github/workflows/nightly-summary.yaml:
- Around line 53-80: Add an explicit top-level permissions block to the reusable
workflow to scope the GITHUB_TOKEN to the minimum required rights (e.g.,
permissions: actions: read, contents: read). Insert a permissions: mapping at
the workflow root (above jobs) so the GITHUB_TOKEN used in the nightly-summary
job and the Run nightly summary step has only the declared permissions instead
of implicit full access.
In `@ci/utils/aggregate_nightly.py`:
- Around line 50-55: The S3 URIs returned by s3_list(s3_prefix) must be sorted
to ensure deterministic consolidated output; update the code that builds
json_uris (the block using uris, s3_list, s3_prefix and filtering out
"/consolidated.json") to sort the uris (e.g., call sorted() / .sort()) before
filtering, and apply the same change to the other similar block that processes
S3 summary URIs (the later block that also builds json_uris from s3_list) so
both places produce a stable, deterministic order.
In `@ci/utils/crash_helpers.sh`:
- Around line 72-75: The current fixed cap using head -500 truncates collected
tests; make the cap configurable by reading an environment variable (e.g.,
TEST_COLLECTION_LIMIT) with a safe default of 500, validate/fallback to the
default if empty or non-numeric, and use that variable in the pytest collection
pipeline that sets test_list (the command around test_list=$(pytest
--collect-only ... | head -500 ...)). Update crash_helpers.sh to read and
validate TEST_COLLECTION_LIMIT before calling head so larger suites aren't
silently truncated while preserving the 500 default.
- Line 124: The pytest invocation currently discards stderr with "2>/dev/null";
preserve crash output by redirecting stderr to a per-test artifact file under
${RAPIDS_TESTS_DIR} (e.g., "${RAPIDS_TESTS_DIR}/${test_id}.stderr") while
keeping the same exit-code capture into retry_rc and junit xml into retry_xml;
ensure ${RAPIDS_TESTS_DIR} exists/writable before running the pytest command and
update the pytest line (the invocation that sets retry_rc) to redirect stderr to
that per-test file instead of /dev/null so pass/fail behavior is unchanged but
native crash breadcrumbs are retained.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: acd206f7-6c7d-48df-b928-08a54a50d71b
📒 Files selected for processing (12)
.github/workflows/build.yaml.github/workflows/nightly-summary.yaml.github/workflows/test.yamlci/build_summary.shci/nightly_summary.shci/utils/aggregate_nightly.pyci/utils/crash_helpers.shci/utils/generate_slack_payloads.pyci/utils/send_consolidated_summary.shcpp/tests/utilities/test_flaky_validation.cpppython/cuopt/cuopt/tests/test_flaky_validation.pypython/cuopt_server/cuopt_server/tests/test_flaky_validation.py
🚧 Files skipped from review as they are similar to previous changes (6)
- python/cuopt/cuopt/tests/test_flaky_validation.py
- cpp/tests/utilities/test_flaky_validation.cpp
- ci/utils/send_consolidated_summary.sh
- ci/nightly_summary.sh
- .github/workflows/test.yaml
- ci/utils/generate_slack_payloads.py
- Add --max-time 30 to GitHub API curl calls in build_summary.sh and nightly_summary.sh to prevent job hangs on network issues - Add fallback to empty JSON on curl failure in build_summary.sh - Normalize trailing slash in report link prefix to prevent broken URLs Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
♻️ Duplicate comments (1)
ci/build_summary.sh (1)
138-142:⚠️ Potential issue | 🟠 MajorAdd timeout + fallback to Slack
chat.postMessagecall.At Line 138, this
curlhas no timeout and no fallback. Underset -e, transient network failures can terminate the script before your structuredokcheck runs.Suggested patch
-RESPONSE=$(curl -s -X POST \ +RESPONSE=$(curl -s --max-time 30 -X POST \ -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \ -H "Content-Type: application/json" \ --data "${BOT_PAYLOAD}" \ - "https://slack.com/api/chat.postMessage") + "https://slack.com/api/chat.postMessage" || echo '{"ok":false,"error":"curl_failed"}')#!/bin/bash # Verify Slack post curl resilience settings in ci/build_summary.sh # Expected after fix: RESPONSE curl line includes --max-time and curl_failed fallback. rg -n -C3 'RESPONSE=\$\(curl|chat\.postMessage|--max-time|curl_failed' ci/build_summary.sh🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/build_summary.sh` around lines 138 - 142, The curl POST that sets RESPONSE (using BOT_PAYLOAD and chat.postMessage) has no timeout and can cause the script to exit under set -e on transient failures; add a curl timeout flag (e.g. --max-time 30) and prevent set -e from aborting by capturing curl's exit code (run curl ... || true or assign to a temp var), then if the exit code is non-zero set RESPONSE to a deterministic fallback JSON (e.g. {"ok":false,"error":"curl_failed"}) so downstream checks of RESPONSE always run; update the invocation that assigns RESPONSE to reference the new timeout flag and add the exit-code check/fallback logic around that assignment.
🧹 Nitpick comments (1)
ci/utils/aggregate_nightly.py (1)
421-442: Consider caching SHA lookups for larger runs.The nested loop to find
shafrommatrix_gridis O(n×m). For typical CI runs this is fine, but if the failure lists grow large, a precomputed dict would be more efficient.♻️ Optional optimization
# Build lookup once at the start of generate_consolidated_html sha_by_matrix = { (g["test_type"], g["matrix_label"]): g.get("sha", "unknown") for g in agg["matrix_grid"] } def _test_name_html(entry): name_escaped = _html_escape(entry["name"]) suite = entry.get("suite", "") sha = sha_by_matrix.get( (entry.get("test_type"), entry.get("matrix_label")), "unknown" ) # ... rest unchanged🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/utils/aggregate_nightly.py` around lines 421 - 442, The current _test_name_html function does an O(n×m) lookup over agg["matrix_grid"] to find sha for each entry; precompute a mapping (e.g. sha_by_matrix keyed by (g["test_type"], g["matrix_label"])) once at the start of generate_consolidated_html and then have _test_name_html read sha = sha_by_matrix.get((entry.get("test_type"), entry.get("matrix_label")), "unknown"); keep the remaining logic (suite checks and URL construction) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@ci/build_summary.sh`:
- Around line 138-142: The curl POST that sets RESPONSE (using BOT_PAYLOAD and
chat.postMessage) has no timeout and can cause the script to exit under set -e
on transient failures; add a curl timeout flag (e.g. --max-time 30) and prevent
set -e from aborting by capturing curl's exit code (run curl ... || true or
assign to a temp var), then if the exit code is non-zero set RESPONSE to a
deterministic fallback JSON (e.g. {"ok":false,"error":"curl_failed"}) so
downstream checks of RESPONSE always run; update the invocation that assigns
RESPONSE to reference the new timeout flag and add the exit-code check/fallback
logic around that assignment.
---
Nitpick comments:
In `@ci/utils/aggregate_nightly.py`:
- Around line 421-442: The current _test_name_html function does an O(n×m)
lookup over agg["matrix_grid"] to find sha for each entry; precompute a mapping
(e.g. sha_by_matrix keyed by (g["test_type"], g["matrix_label"])) once at the
start of generate_consolidated_html and then have _test_name_html read sha =
sha_by_matrix.get((entry.get("test_type"), entry.get("matrix_label")),
"unknown"); keep the remaining logic (suite checks and URL construction)
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: fe4d177c-afb3-4b1d-86d9-96bb1124b906
📒 Files selected for processing (3)
ci/build_summary.shci/nightly_summary.shci/utils/aggregate_nightly.py
✅ Files skipped from review due to trivial changes (1)
- ci/nightly_summary.sh
The Slack notification said "NEW failures" when the only failed jobs were untracked CI jobs (e.g., conda-notebook-tests) with no per-matrix test reports. These are not new TEST failures — they're CI job failures without test details. Split the condition so: - has_new (actual new test failures) → "NEW failures" - untracked CI jobs only → "Recurring failures + N CI job(s) failed" Both still mention the configured user for attention. Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
ci/build_summary.sh (2)
144-146: String comparison with Python boolean is fragile.The check
"${OK}" != "True"relies on Python printing the boolean as"True". This works but could be more explicit.🔧 Suggested improvement for robustness
-OK=$(echo "${RESPONSE}" | python3 -c "import json,sys; print(json.load(sys.stdin).get('ok',''))" 2>/dev/null || echo "") -if [ "${OK}" != "True" ]; then +OK=$(echo "${RESPONSE}" | python3 -c "import json,sys; print('true' if json.load(sys.stdin).get('ok') else 'false')" 2>/dev/null || echo "false") +if [ "${OK}" != "true" ]; then🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/build_summary.sh` around lines 144 - 146, The current check uses OK=$(echo "${RESPONSE}" | python3 -c "import json,sys; print(json.load(sys.stdin).get('ok',''))" ...) and then compares if [ "${OK}" != "True" ]; make the comparison robust by extracting the JSON boolean explicitly and normalizing it before comparing: update the python extraction to emit a deterministic lowercase string (e.g., print(str(json.load(...).get('ok')).lower()) ) or use a JSON tool to output true/false, then compare OK to "true" (or use a shell test that checks for non-empty true) so the check on RESPONSE reliably detects a boolean false regardless of Python capitalization; adjust the references to OK and RESPONSE accordingly.
145-149: Consider whether Slack failures should fail the job.Currently the script logs an error but exits successfully (implicit
exit 0) when the Slack API call fails. This is likely intentional to avoid failing CI for notification issues, but worth confirming this is the desired behavior.If Slack failures should fail the build-summary job:
🔧 Optional: Exit with error on Slack failure
if [ "${OK}" != "True" ]; then echo "ERROR: chat.postMessage failed: ${RESPONSE}" >&2 + exit 1 else echo "Build summary posted to Slack." fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/build_summary.sh` around lines 145 - 149, The current block logs Slack API failures (when OK != "True") but still exits successfully; decide whether Slack failures should fail the job and implement it: either append an explicit non-zero exit (e.g., add exit 1) after the processLogger-styled error echo that uses ${RESPONSE} to fail the job, or make the behavior configurable by checking an env var (e.g., SLACK_FAIL_JOB) and call exit 1 only when SLACK_FAIL_JOB is set; update the branch handling the chat.postMessage response (variables OK and RESPONSE) accordingly so the script's exit status reflects the chosen policy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ci/utils/generate_slack_payloads.py`:
- Around line 57-60: The code uses direct dict indexing (e.g., computing
passed_ci_count with j["conclusion"]) which can raise KeyError when fields are
missing; update these to safe access patterns (use j.get("conclusion") ==
"success" or j.get("field", default)) and add sensible defaults where
counts/strings are computed so payload generation never aborts; apply the same
defensive changes to other occurrences noted (places that compute totals or read
summary fields around lines referenced: the blocks computing
total_ci_jobs/passed_ci_count and the other access sites) and ensure any
boolean/number conversions use safe defaults (0 or empty string) so Slack
payload creation proceeds even with partial schemas.
- Around line 145-146: The Slack mrkdwn f-strings are interpolating unescaped
dynamic fields (e.g., name, message, matrix_label, job names) which can break
formatting or create unintended links; add a helper function escape_mrkdwn(text)
in generate_slack_payloads.py that replaces &, <, > and backticks with their
safe equivalents (e.g., &, <, >, and escaped backtick) and use it
wherever dynamic values are injected (for example when building the "text"
f-string and the other mrkdwn lines referenced), i.e., wrap variables like name,
message, matrix_label, job_name with escape_mrkdwn(...) before interpolation so
all mrkdwn fields are safe.
- Around line 115-117: The message "all {passed_ci_count} CI jobs succeeded" can
be misleading when some jobs were cancelled/skipped; update the logic in
generate_slack_payloads.py to only claim "all ... succeeded" when
passed_ci_count equals the number of completed jobs (exclude cancelled/skipped),
e.g. compute completed_jobs_count from the job list (or total_ci_jobs minus
skipped/cancelled) and replace the condition that uses total_ci_jobs with a
check against completed_jobs_count, and change the text to say "all
{passed_ci_count} completed CI jobs succeeded" (update the variables
total_ci_jobs, passed_ci_count, and text accordingly).
---
Nitpick comments:
In `@ci/build_summary.sh`:
- Around line 144-146: The current check uses OK=$(echo "${RESPONSE}" | python3
-c "import json,sys; print(json.load(sys.stdin).get('ok',''))" ...) and then
compares if [ "${OK}" != "True" ]; make the comparison robust by extracting the
JSON boolean explicitly and normalizing it before comparing: update the python
extraction to emit a deterministic lowercase string (e.g.,
print(str(json.load(...).get('ok')).lower()) ) or use a JSON tool to output
true/false, then compare OK to "true" (or use a shell test that checks for
non-empty true) so the check on RESPONSE reliably detects a boolean false
regardless of Python capitalization; adjust the references to OK and RESPONSE
accordingly.
- Around line 145-149: The current block logs Slack API failures (when OK !=
"True") but still exits successfully; decide whether Slack failures should fail
the job and implement it: either append an explicit non-zero exit (e.g., add
exit 1) after the processLogger-styled error echo that uses ${RESPONSE} to fail
the job, or make the behavior configurable by checking an env var (e.g.,
SLACK_FAIL_JOB) and call exit 1 only when SLACK_FAIL_JOB is set; update the
branch handling the chat.postMessage response (variables OK and RESPONSE)
accordingly so the script's exit status reflects the chosen policy.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 92a7ad7d-2de2-4763-8dd8-63cfd1e1b1aa
📒 Files selected for processing (2)
ci/build_summary.shci/utils/generate_slack_payloads.py
| if total_ci_jobs > 0: | ||
| text += f", all {passed_ci_count} CI jobs succeeded" | ||
| mention = "" |
There was a problem hiding this comment.
Fix misleading “all CI jobs succeeded” wording.
Line 116 can report an incorrect “all … succeeded” message when some jobs are cancelled/skipped (non-success but not failure).
✏️ Proposed fix
- if total_ci_jobs > 0:
- text += f", all {passed_ci_count} CI jobs succeeded"
+ if total_ci_jobs > 0:
+ if passed_ci_count == total_ci_jobs:
+ text += f", all {total_ci_jobs} CI jobs succeeded"
+ else:
+ text += f", {passed_ci_count}/{total_ci_jobs} CI jobs succeeded"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ci/utils/generate_slack_payloads.py` around lines 115 - 117, The message "all
{passed_ci_count} CI jobs succeeded" can be misleading when some jobs were
cancelled/skipped; update the logic in generate_slack_payloads.py to only claim
"all ... succeeded" when passed_ci_count equals the number of completed jobs
(exclude cancelled/skipped), e.g. compute completed_jobs_count from the job list
(or total_ci_jobs minus skipped/cancelled) and replace the condition that uses
total_ci_jobs with a check against completed_jobs_count, and change the text to
say "all {passed_ci_count} completed CI jobs succeeded" (update the variables
total_ci_jobs, passed_ci_count, and text accordingly).
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
- Add _esc() helper to escape &, <, > in all dynamic mrkdwn fields
- Add _job_prefix() helper for safe job name extraction
- Replace all j["field"] with j.get("field", default) to prevent
KeyError on malformed input
- Fix "all CI jobs succeeded" to show N/M when some are cancelled
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
ci/run_ctests.sh (1)
149-149: Edge case: Test names with glob pattern characters.If a test name contains
*,?, or other glob pattern characters,--gtest_filtermay not match exactly since it uses glob patterns. However, this is an extremely rare edge case in practice.Potential fix if needed
Escape special characters before passing to
--gtest_filter:-"${gt}" --gtest_filter="${tc}" --gtest_output="xml:${retry_xml}" "$@" || retry_rc=$? +local tc_escaped +tc_escaped=$(printf '%s' "${tc}" | sed 's/[*?:]/\\&/g') +"${gt}" --gtest_filter="${tc_escaped}" --gtest_output="xml:${retry_xml}" "$@" || retry_rc=$?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/run_ctests.sh` at line 149, The current invocation passes the test name variable tc directly to --gtest_filter which treats *, ?, [] as glob patterns; to avoid accidental pattern matching, pre-escape glob characters in tc (e.g., *, ?, [, ], and possibly \) before the line that runs "${gt}" --gtest_filter="${tc}" --gtest_output="xml:${retry_xml}" "$@" || retry_rc=$?; implement a small helper (e.g., escape_gtest_filter) that transforms tc into an escaped variable (like escaped_tc) by prefixing those characters with backslashes and then pass escaped_tc to --gtest_filter so the filter matches the literal test name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ci/run_ctests.sh`:
- Line 44: Rename the misleading variable IS_NIGHTLY to BUILD_TYPE and propagate
that rename where it's set from RAPIDS_BUILD_TYPE and where it's compared to
"nightly"; specifically change the assignment
IS_NIGHTLY="${RAPIDS_BUILD_TYPE:-}" to BUILD_TYPE="${RAPIDS_BUILD_TYPE:-}" and
update any later comparison that checks IS_NIGHTLY == "nightly" to compare
BUILD_TYPE == "nightly" instead (ensure all occurrences of IS_NIGHTLY in this
script are updated).
---
Nitpick comments:
In `@ci/run_ctests.sh`:
- Line 149: The current invocation passes the test name variable tc directly to
--gtest_filter which treats *, ?, [] as glob patterns; to avoid accidental
pattern matching, pre-escape glob characters in tc (e.g., *, ?, [, ], and
possibly \) before the line that runs "${gt}" --gtest_filter="${tc}"
--gtest_output="xml:${retry_xml}" "$@" || retry_rc=$?; implement a small helper
(e.g., escape_gtest_filter) that transforms tc into an escaped variable (like
escaped_tc) by prefixing those characters with backslashes and then pass
escaped_tc to --gtest_filter so the filter matches the literal test name.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: a4db18df-cc30-463c-8cf3-fbb554977f47
📒 Files selected for processing (3)
ci/run_ctests.shci/utils/crash_helpers.shci/utils/generate_slack_payloads.py
✅ Files skipped from review due to trivial changes (1)
- ci/utils/generate_slack_payloads.py
🚧 Files skipped from review as they are similar to previous changes (1)
- ci/utils/crash_helpers.sh
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
PR builds now fail immediately on test failures with no retries, so authors see exactly what broke. Nightly builds retain: - pytest --reruns 2 + rerun XML plugin - gtest per-test retry - crash isolation Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
Summary
End-to-end CI pipeline for detecting, classifying, and reporting flaky tests across all test types (C++ gtests, Python pytest, server pytest).
Flaky detection and retry
--reruns 2via pytest-rerunfailures for all builds (PR + nightly). Customcuopt_rerun_xmlplugin writes supplementary JUnit XML for rerun failures (pytest-rerunfailures v14+ only records final outcome)test_flaky_validation.py/.cpp) included for pipeline validationNightly reporting pipeline
nightly_report.py): classifies tests as passed/failed/flaky/error, tracks failure history on S3 (new vs recurring vs stabilized), generates Markdown + HTML + JSONaggregate_nightly.py): merges all matrix job summaries into a single view with matrix grid, failure details, and flaky test listgenerate_slack_payloads.py): threaded messages grouped by workflow, mentions@ramakrishna_prabhuon new failures or new flaky testsci/dashboard/index.html): self-contained HTML with tabs for Failures, Flaky, Stabilized, Matrix Grid, and TrendsHistory and cross-run tracking
history/{branch_slug}/)CUOPT_BOUNCE_WINDOW_DAYS,CUOPT_BOUNCE_THRESHOLD)release/26.06) inherit failure history from main on first run, so known failures aren't re-reported as newS3 path isolation
GITHUB_RUN_ID(summaries/run-{id}/) — prevents cross-run pollution whenRAPIDS_BRANCHdefaults to "main" in rapidsai containers for feature branchessummaries/{date}/{branch}/) for manual browsingCode quality
crash_helpers.sh(signal_name, was_signal_death, write_crash_xml, xml_escape, pytest_crash_isolate),junit_helpers.py(JUnit XML parsing for failed/passed/gtest-list)generate_slack_payloads.py,generate_step_summary.py<>&")CUOPT_DATASET_S3_URI→CUOPT_S3_URImigration in PDLP and MIPLIB download scriptsbuild_summary.shTest plan
:new: :warning:tagtest_flaky_validationfiles before merging to main