|
| 1 | +# Agentic Test Iteration Architecture |
| 2 | + |
| 3 | +Autonomous multi-agent system for iterating on Cypress test robustness, with visual feedback (screenshots + videos), CI result ingestion, and flakiness detection. |
| 4 | + |
| 5 | +## Goals |
| 6 | + |
| 7 | +| Phase | Objective | |
| 8 | +|-------|-----------| |
| 9 | +| **Phase 1** (current) | Make incident detection tests robust — fix selectors, timing, fixtures, page object gaps | |
| 10 | +| **Phase 2** (future) | Refactor frontend code using tests as a behavioral contract / safety net | |
| 11 | + |
| 12 | +## Architecture Overview |
| 13 | + |
| 14 | +``` |
| 15 | +User: /cypress:test-iteration:iterate-incident-tests target=regression max-iterations=3 |
| 16 | +
|
| 17 | +Coordinator (main Claude Code session) |
| 18 | + | |
| 19 | + |-- [CI Analysis] /cypress:test-iteration:analyze-ci-results (optional first step) |
| 20 | + | Fetches CI artifacts, classifies infra vs test/code failures |
| 21 | + | Correlates failures with git commits for context |
| 22 | + | If all INFRA -> report to user and STOP |
| 23 | + | |
| 24 | + |-- Create branch: test/incident-robustness-<date> |
| 25 | + | |
| 26 | + |-- [Runner] Cypress headless via Bash (inline, not separate terminal) |
| 27 | + | Sources export-env.sh, produces mochawesome JSON + screenshots + videos |
| 28 | + | |
| 29 | + |-- [Parser] Extract failures from mochawesome JSON reports |
| 30 | + | Per failure: test name, error message, stack trace, screenshot path, video path |
| 31 | + | |
| 32 | + |-- For each failure (parallelizable): |
| 33 | + | | |
| 34 | + | |-- [Diagnosis Agent] (Explore-type sub-agent) |
| 35 | + | | Reads: screenshot (multimodal) + error + test code + fixture + page object |
| 36 | + | | Returns: root cause classification + recommended fix |
| 37 | + | | |
| 38 | + | |-- [Fix Agent] (general-purpose sub-agent) |
| 39 | + | | Makes targeted edits based on diagnosis |
| 40 | + | | Returns: diff summary |
| 41 | + | | |
| 42 | + | |-- [Validation] Re-run the specific failing test |
| 43 | + | Pass -> stage fix |
| 44 | + | Fail -> re-diagnose (max 2 retries per test) |
| 45 | + | |
| 46 | + |-- Commit batch of related fixes |
| 47 | + |-- Repeat if failures remain (up to max-iterations) |
| 48 | + |-- [Flakiness Probe] Run full suite 3x even if green |
| 49 | + |-- Report final state to user |
| 50 | +``` |
| 51 | + |
| 52 | +## Agent Roles |
| 53 | + |
| 54 | +### 1. Coordinator (main session) |
| 55 | + |
| 56 | +Owns the iteration loop, branch management, and commit strategy. |
| 57 | + |
| 58 | +Responsibilities: |
| 59 | +- Create and manage the working branch |
| 60 | +- Run Cypress tests inline via Bash |
| 61 | +- Parse mochawesome JSON reports |
| 62 | +- Dispatch Diagnosis and Fix agents |
| 63 | +- Track cumulative pass/fail state across iterations |
| 64 | +- Commit fixes in batches (threshold: **5 commits** before notifying user) |
| 65 | +- Run flakiness probes (multiple runs even when green) |
| 66 | +- Decide when to stop: all green + flakiness probe passed, max iterations, or needs human input |
| 67 | + |
| 68 | +### 2. Diagnosis Agent (Explore-type sub-agent) |
| 69 | + |
| 70 | +Input per failure: |
| 71 | +- Error message and stack trace from mochawesome JSON |
| 72 | +- Screenshot path (read with multimodal Read tool) |
| 73 | +- Video path (reference for user, not directly parseable by agent) |
| 74 | +- Test file path + relevant line numbers |
| 75 | +- Associated fixture YAML |
| 76 | +- Page object methods used |
| 77 | + |
| 78 | +Output — one of these classifications: |
| 79 | + |
| 80 | +| Classification | Description | Action | |
| 81 | +|---------------|-------------|--------| |
| 82 | +| `TEST_BUG` | Wrong selector, incorrect assertion, timing/race condition | Auto-fix | |
| 83 | +| `FIXTURE_ISSUE` | Missing data, wrong structure, edge case not covered | Auto-fix | |
| 84 | +| `PAGE_OBJECT_GAP` | Missing method, stale selector, outdated DOM reference | Auto-fix | |
| 85 | +| `MOCK_ISSUE` | Intercept not matching, response shape wrong | Auto-fix | |
| 86 | +| `REAL_REGRESSION` | Actual UI/code bug — not a test problem | **STOP and report to user** | |
| 87 | +| `INFRA_ISSUE` | Cluster down, cert expired, operator not installed | **STOP and report to user** | |
| 88 | + |
| 89 | +The agent should **read the screenshot first** before looking at code — visual state often reveals the root cause faster than stack traces. |
| 90 | + |
| 91 | +### 3. Fix Agent (general-purpose sub-agent) |
| 92 | + |
| 93 | +Input: |
| 94 | +- Diagnosis classification and details |
| 95 | +- Specific file references and what to change |
| 96 | + |
| 97 | +Scope — may only edit: |
| 98 | +- `cypress/e2e/incidents/**/*.cy.ts` (test files) |
| 99 | +- `cypress/fixtures/incident-scenarios/*.yaml` (fixtures) |
| 100 | +- `cypress/views/incidents-page.ts` (page object) |
| 101 | +- `cypress/support/incidents_prometheus_query_mocks/**` (mock layer) |
| 102 | + |
| 103 | +Must NOT edit: |
| 104 | +- Source code (`src/`) — that's Phase 2 |
| 105 | +- Non-incident test files |
| 106 | +- Cypress config or support infrastructure |
| 107 | + |
| 108 | +### 4. Validation Agent |
| 109 | + |
| 110 | +Re-runs the specific failing test after a fix is applied: |
| 111 | +```bash |
| 112 | +source cypress/export-env.sh && npx cypress run --env grep="<test name>" --spec "<spec file>" |
| 113 | +``` |
| 114 | + |
| 115 | +Reports pass/fail. If still failing, feeds back to Diagnosis Agent (max 2 retries per test). |
| 116 | + |
| 117 | +## Flakiness Detection |
| 118 | + |
| 119 | +Even if the first run is all green, the coordinator runs a **flakiness probe**: |
| 120 | + |
| 121 | +1. Run the full incident test suite 3 times consecutively |
| 122 | +2. Track per-test results across runs |
| 123 | +3. Flag any test that fails in any run as `FLAKY` |
| 124 | +4. For flaky tests: attempt to diagnose the timing/race condition and fix |
| 125 | +5. Report flakiness statistics: `test_name: 2/3 passed` etc. |
| 126 | + |
| 127 | +This catches intermittent failures that a single run would miss. |
| 128 | + |
| 129 | +## CI Result Ingestion |
| 130 | + |
| 131 | +CI analysis is handled by the dedicated `/cypress:test-iteration:analyze-ci-results` skill (`.claude/commands/cypress:test-iteration:analyze-ci-results.md`). |
| 132 | + |
| 133 | +The skill fetches artifacts from OpenShift CI (Prow) runs on GCS, classifies failures as infrastructure vs test/code issues, reads failure screenshots with multimodal vision, and correlates failures with the git commits that triggered them. |
| 134 | + |
| 135 | +### Key Capabilities |
| 136 | + |
| 137 | +- **URL normalization**: Accepts gcsweb or Prow UI URLs at any level of the artifact tree |
| 138 | +- **Structured metadata**: Extracts PR number, author, branch, commit SHAs from `started.json` / `finished.json` / `prowjob.json` |
| 139 | +- **Build log parsing**: Parses Cypress console output from `build-log.txt` for per-spec pass/fail/skip counts and error details |
| 140 | +- **Visual diagnosis**: Fetches and reads failure screenshots (multimodal) to understand UI state at failure time |
| 141 | +- **Failure classification**: Categorizes each failure as `INFRA_*` (cluster, operator, plugin, auth, CI) or test/code (`TEST_BUG`, `FIXTURE_ISSUE`, `PAGE_OBJECT_GAP`, `MOCK_ISSUE`, `CODE_REGRESSION`) |
| 142 | +- **Commit correlation**: Maps failures to specific file changes in the PR using `git diff {base}..{pr_head}` |
| 143 | + |
| 144 | +### Integration with Orchestrator |
| 145 | + |
| 146 | +The orchestrator uses `/cypress:test-iteration:analyze-ci-results` as an optional first step: |
| 147 | + |
| 148 | +1. If all failures are `INFRA_*` -> report to user and STOP (no test changes will help) |
| 149 | +2. If mixed -> report infra issues, proceed with test/code fixes only |
| 150 | +3. If all test/code -> proceed with full iteration loop |
| 151 | +4. Commit correlation tells the orchestrator whether to fix tests or investigate source changes |
| 152 | +5. CI screenshots give the Diagnosis Agent a head start before local reproduction |
| 153 | + |
| 154 | +### Usage |
| 155 | + |
| 156 | +``` |
| 157 | +/cypress:test-iteration:analyze-ci-results ci-url=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/.../{RUN_ID}/ |
| 158 | +/cypress:test-iteration:analyze-ci-results ci-url=https://prow.ci.openshift.org/view/gs/.../{RUN_ID} focus=regression |
| 159 | +``` |
| 160 | + |
| 161 | +## Commit Strategy |
| 162 | + |
| 163 | +- **Branch naming**: `test/incident-robustness-YYYY-MM-DD` |
| 164 | +- **Commit granularity**: Group related fixes (e.g., "fix 3 selector issues in filtering tests") |
| 165 | +- **Review threshold**: Notify user after **5 commits** for review |
| 166 | +- **Never force-push**; always additive commits |
| 167 | +- User merges when ready, or continues iteration |
| 168 | + |
| 169 | +## Test Execution (Inline) |
| 170 | + |
| 171 | +Tests run inline via Bash, not in a separate terminal: |
| 172 | + |
| 173 | +```bash |
| 174 | +cd web && source cypress/export-env.sh && \ |
| 175 | + npx cypress run \ |
| 176 | + --spec "cypress/e2e/incidents/regression/**/*.cy.ts" \ |
| 177 | + --env grepTags="@incidents --@e2e-real --@flaky" \ |
| 178 | + --reporter ./node_modules/cypress-multi-reporters \ |
| 179 | + --reporter-options configFile=reporter-config.json |
| 180 | +``` |
| 181 | + |
| 182 | +Results are collected from: |
| 183 | +- **Exit code**: 0 = all passed, non-zero = failures |
| 184 | +- **Mochawesome JSON**: `screenshots/cypress_report_*.json` — per-test details |
| 185 | +- **Screenshots**: `cypress/screenshots/{spec}/` — failure screenshots |
| 186 | +- **Videos**: `cypress/videos/{spec}.mp4` — kept on failure |
| 187 | + |
| 188 | +### Report Parsing |
| 189 | + |
| 190 | +Mochawesome JSON structure (per report file): |
| 191 | +```json |
| 192 | +{ |
| 193 | + "stats": { "passes": N, "failures": N, "skipped": N }, |
| 194 | + "results": [{ |
| 195 | + "suites": [{ |
| 196 | + "title": "Suite Name", |
| 197 | + "tests": [{ |
| 198 | + "title": "test description", |
| 199 | + "fullTitle": "Suite -- test description", |
| 200 | + "state": "failed|passed|skipped", |
| 201 | + "err": { |
| 202 | + "message": "error description", |
| 203 | + "estack": "full stack trace" |
| 204 | + } |
| 205 | + }] |
| 206 | + }] |
| 207 | + }] |
| 208 | +} |
| 209 | +``` |
| 210 | + |
| 211 | +Use `npx mochawesome-merge screenshots/cypress_report_*.json > merged-report.json` to combine per-spec reports. |
| 212 | + |
| 213 | +## Skills |
| 214 | + |
| 215 | +| Skill | Purpose | Invoked by | |
| 216 | +|-------|---------|------------| |
| 217 | +| `/cypress:test-iteration:iterate-incident-tests` | Main orchestrator — local iteration loop, dispatches agents, manages commits | User | |
| 218 | +| `/cypress:test-iteration:iterate-ci-flaky` | CI-based iteration — push fixes, trigger Prow jobs, wait, analyze, repeat | User | |
| 219 | +| `/cypress:test-iteration:diagnose-test-failure` | Classifies a single test failure using screenshots + code analysis | Orchestrator (as sub-agent prompt) | |
| 220 | +| `/cypress:test-iteration:analyze-ci-results` | Fetches and analyzes OpenShift CI artifacts, classifies infra vs test/code | User or orchestrator | |
| 221 | + |
| 222 | +Skills are defined in `.claude/commands/` and can be invoked as slash commands. |
| 223 | + |
| 224 | +## Existing Infrastructure Leveraged |
| 225 | + |
| 226 | +| Asset | How the agent uses it | |
| 227 | +|-------|----------------------| |
| 228 | +| mochawesome JSON reporter | Structured test results parsing | |
| 229 | +| `@cypress/grep` | Run individual tests by name or tag | |
| 230 | +| `export-env.sh` | Source env vars for inline execution | |
| 231 | +| YAML fixture system | Edit fixtures to fix data issues | |
| 232 | +| Page object (`incidents-page.ts`) | Fix selectors and add missing methods | |
| 233 | +| Mock layer (`incidents_prometheus_query_mocks/`) | Fix intercept patterns | |
| 234 | +| `/cypress:test-development:generate-incident-fixture` skill | Generate new fixtures when needed | |
| 235 | +| `/cypress:test-development:validate-incident-fixtures` skill | Validate fixture edits | |
| 236 | + |
| 237 | +## Phase 2: Frontend Refactoring (Future) |
| 238 | + |
| 239 | +### Concept |
| 240 | + |
| 241 | +Tests become the behavioral contract. The agent refactors frontend code while using the test suite as a safety net. |
| 242 | + |
| 243 | +### Additional Agent Roles |
| 244 | + |
| 245 | +| Agent | Role | |
| 246 | +|-------|------| |
| 247 | +| **Refactor Planner** | Analyzes frontend code, proposes refactoring steps | |
| 248 | +| **Refactor Agent** | Makes code changes (replaces Fix Agent) | |
| 249 | +| **Contract Validator** | Runs tests, classifies failures as regression vs test-coupling | |
| 250 | +| **Test Adapter** | Updates tests that assert implementation details instead of behavior | |
| 251 | + |
| 252 | +### Key Principle |
| 253 | + |
| 254 | +If a test breaks due to refactoring but behavior is preserved, the test needs updating — it was too coupled to implementation. Phase 1 (robustness) reduces this coupling, making Phase 2 more effective. |
| 255 | + |
| 256 | +### Additional Classification |
| 257 | + |
| 258 | +The Diagnosis Agent gains `TEST_TOO_COUPLED` — the test asserts implementation details (specific DOM structure, internal state) rather than observable behavior. The Test Adapter agent rewrites it to be implementation-agnostic. |
0 commit comments