Skip to content

Commit c8c6b38

Browse files
DavidRajnohaclaude
andcommitted
docs: add agentic test iteration architecture and roadmap
Add architecture documentation for the agentic test iteration system and a roadmap with future improvement ideas including Slack notifications, cloud execution options, and interaction models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 0234fa4 commit c8c6b38

2 files changed

Lines changed: 722 additions & 0 deletions

File tree

Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
# Agentic Test Iteration Architecture
2+
3+
Autonomous multi-agent system for iterating on Cypress test robustness, with visual feedback (screenshots + videos), CI result ingestion, and flakiness detection.
4+
5+
## Goals
6+
7+
| Phase | Objective |
8+
|-------|-----------|
9+
| **Phase 1** (current) | Make incident detection tests robust — fix selectors, timing, fixtures, page object gaps |
10+
| **Phase 2** (future) | Refactor frontend code using tests as a behavioral contract / safety net |
11+
12+
## Architecture Overview
13+
14+
```
15+
User: /cypress:test-iteration:iterate-incident-tests target=regression max-iterations=3
16+
17+
Coordinator (main Claude Code session)
18+
|
19+
|-- [CI Analysis] /cypress:test-iteration:analyze-ci-results (optional first step)
20+
| Fetches CI artifacts, classifies infra vs test/code failures
21+
| Correlates failures with git commits for context
22+
| If all INFRA -> report to user and STOP
23+
|
24+
|-- Create branch: test/incident-robustness-<date>
25+
|
26+
|-- [Runner] Cypress headless via Bash (inline, not separate terminal)
27+
| Sources export-env.sh, produces mochawesome JSON + screenshots + videos
28+
|
29+
|-- [Parser] Extract failures from mochawesome JSON reports
30+
| Per failure: test name, error message, stack trace, screenshot path, video path
31+
|
32+
|-- For each failure (parallelizable):
33+
| |
34+
| |-- [Diagnosis Agent] (Explore-type sub-agent)
35+
| | Reads: screenshot (multimodal) + error + test code + fixture + page object
36+
| | Returns: root cause classification + recommended fix
37+
| |
38+
| |-- [Fix Agent] (general-purpose sub-agent)
39+
| | Makes targeted edits based on diagnosis
40+
| | Returns: diff summary
41+
| |
42+
| |-- [Validation] Re-run the specific failing test
43+
| Pass -> stage fix
44+
| Fail -> re-diagnose (max 2 retries per test)
45+
|
46+
|-- Commit batch of related fixes
47+
|-- Repeat if failures remain (up to max-iterations)
48+
|-- [Flakiness Probe] Run full suite 3x even if green
49+
|-- Report final state to user
50+
```
51+
52+
## Agent Roles
53+
54+
### 1. Coordinator (main session)
55+
56+
Owns the iteration loop, branch management, and commit strategy.
57+
58+
Responsibilities:
59+
- Create and manage the working branch
60+
- Run Cypress tests inline via Bash
61+
- Parse mochawesome JSON reports
62+
- Dispatch Diagnosis and Fix agents
63+
- Track cumulative pass/fail state across iterations
64+
- Commit fixes in batches (threshold: **5 commits** before notifying user)
65+
- Run flakiness probes (multiple runs even when green)
66+
- Decide when to stop: all green + flakiness probe passed, max iterations, or needs human input
67+
68+
### 2. Diagnosis Agent (Explore-type sub-agent)
69+
70+
Input per failure:
71+
- Error message and stack trace from mochawesome JSON
72+
- Screenshot path (read with multimodal Read tool)
73+
- Video path (reference for user, not directly parseable by agent)
74+
- Test file path + relevant line numbers
75+
- Associated fixture YAML
76+
- Page object methods used
77+
78+
Output — one of these classifications:
79+
80+
| Classification | Description | Action |
81+
|---------------|-------------|--------|
82+
| `TEST_BUG` | Wrong selector, incorrect assertion, timing/race condition | Auto-fix |
83+
| `FIXTURE_ISSUE` | Missing data, wrong structure, edge case not covered | Auto-fix |
84+
| `PAGE_OBJECT_GAP` | Missing method, stale selector, outdated DOM reference | Auto-fix |
85+
| `MOCK_ISSUE` | Intercept not matching, response shape wrong | Auto-fix |
86+
| `REAL_REGRESSION` | Actual UI/code bug — not a test problem | **STOP and report to user** |
87+
| `INFRA_ISSUE` | Cluster down, cert expired, operator not installed | **STOP and report to user** |
88+
89+
The agent should **read the screenshot first** before looking at code — visual state often reveals the root cause faster than stack traces.
90+
91+
### 3. Fix Agent (general-purpose sub-agent)
92+
93+
Input:
94+
- Diagnosis classification and details
95+
- Specific file references and what to change
96+
97+
Scope — may only edit:
98+
- `cypress/e2e/incidents/**/*.cy.ts` (test files)
99+
- `cypress/fixtures/incident-scenarios/*.yaml` (fixtures)
100+
- `cypress/views/incidents-page.ts` (page object)
101+
- `cypress/support/incidents_prometheus_query_mocks/**` (mock layer)
102+
103+
Must NOT edit:
104+
- Source code (`src/`) — that's Phase 2
105+
- Non-incident test files
106+
- Cypress config or support infrastructure
107+
108+
### 4. Validation Agent
109+
110+
Re-runs the specific failing test after a fix is applied:
111+
```bash
112+
source cypress/export-env.sh && npx cypress run --env grep="<test name>" --spec "<spec file>"
113+
```
114+
115+
Reports pass/fail. If still failing, feeds back to Diagnosis Agent (max 2 retries per test).
116+
117+
## Flakiness Detection
118+
119+
Even if the first run is all green, the coordinator runs a **flakiness probe**:
120+
121+
1. Run the full incident test suite 3 times consecutively
122+
2. Track per-test results across runs
123+
3. Flag any test that fails in any run as `FLAKY`
124+
4. For flaky tests: attempt to diagnose the timing/race condition and fix
125+
5. Report flakiness statistics: `test_name: 2/3 passed` etc.
126+
127+
This catches intermittent failures that a single run would miss.
128+
129+
## CI Result Ingestion
130+
131+
CI analysis is handled by the dedicated `/cypress:test-iteration:analyze-ci-results` skill (`.claude/commands/cypress:test-iteration:analyze-ci-results.md`).
132+
133+
The skill fetches artifacts from OpenShift CI (Prow) runs on GCS, classifies failures as infrastructure vs test/code issues, reads failure screenshots with multimodal vision, and correlates failures with the git commits that triggered them.
134+
135+
### Key Capabilities
136+
137+
- **URL normalization**: Accepts gcsweb or Prow UI URLs at any level of the artifact tree
138+
- **Structured metadata**: Extracts PR number, author, branch, commit SHAs from `started.json` / `finished.json` / `prowjob.json`
139+
- **Build log parsing**: Parses Cypress console output from `build-log.txt` for per-spec pass/fail/skip counts and error details
140+
- **Visual diagnosis**: Fetches and reads failure screenshots (multimodal) to understand UI state at failure time
141+
- **Failure classification**: Categorizes each failure as `INFRA_*` (cluster, operator, plugin, auth, CI) or test/code (`TEST_BUG`, `FIXTURE_ISSUE`, `PAGE_OBJECT_GAP`, `MOCK_ISSUE`, `CODE_REGRESSION`)
142+
- **Commit correlation**: Maps failures to specific file changes in the PR using `git diff {base}..{pr_head}`
143+
144+
### Integration with Orchestrator
145+
146+
The orchestrator uses `/cypress:test-iteration:analyze-ci-results` as an optional first step:
147+
148+
1. If all failures are `INFRA_*` -> report to user and STOP (no test changes will help)
149+
2. If mixed -> report infra issues, proceed with test/code fixes only
150+
3. If all test/code -> proceed with full iteration loop
151+
4. Commit correlation tells the orchestrator whether to fix tests or investigate source changes
152+
5. CI screenshots give the Diagnosis Agent a head start before local reproduction
153+
154+
### Usage
155+
156+
```
157+
/cypress:test-iteration:analyze-ci-results ci-url=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/.../{RUN_ID}/
158+
/cypress:test-iteration:analyze-ci-results ci-url=https://prow.ci.openshift.org/view/gs/.../{RUN_ID} focus=regression
159+
```
160+
161+
## Commit Strategy
162+
163+
- **Branch naming**: `test/incident-robustness-YYYY-MM-DD`
164+
- **Commit granularity**: Group related fixes (e.g., "fix 3 selector issues in filtering tests")
165+
- **Review threshold**: Notify user after **5 commits** for review
166+
- **Never force-push**; always additive commits
167+
- User merges when ready, or continues iteration
168+
169+
## Test Execution (Inline)
170+
171+
Tests run inline via Bash, not in a separate terminal:
172+
173+
```bash
174+
cd web && source cypress/export-env.sh && \
175+
npx cypress run \
176+
--spec "cypress/e2e/incidents/regression/**/*.cy.ts" \
177+
--env grepTags="@incidents --@e2e-real --@flaky" \
178+
--reporter ./node_modules/cypress-multi-reporters \
179+
--reporter-options configFile=reporter-config.json
180+
```
181+
182+
Results are collected from:
183+
- **Exit code**: 0 = all passed, non-zero = failures
184+
- **Mochawesome JSON**: `screenshots/cypress_report_*.json` — per-test details
185+
- **Screenshots**: `cypress/screenshots/{spec}/` — failure screenshots
186+
- **Videos**: `cypress/videos/{spec}.mp4` — kept on failure
187+
188+
### Report Parsing
189+
190+
Mochawesome JSON structure (per report file):
191+
```json
192+
{
193+
"stats": { "passes": N, "failures": N, "skipped": N },
194+
"results": [{
195+
"suites": [{
196+
"title": "Suite Name",
197+
"tests": [{
198+
"title": "test description",
199+
"fullTitle": "Suite -- test description",
200+
"state": "failed|passed|skipped",
201+
"err": {
202+
"message": "error description",
203+
"estack": "full stack trace"
204+
}
205+
}]
206+
}]
207+
}]
208+
}
209+
```
210+
211+
Use `npx mochawesome-merge screenshots/cypress_report_*.json > merged-report.json` to combine per-spec reports.
212+
213+
## Skills
214+
215+
| Skill | Purpose | Invoked by |
216+
|-------|---------|------------|
217+
| `/cypress:test-iteration:iterate-incident-tests` | Main orchestrator — local iteration loop, dispatches agents, manages commits | User |
218+
| `/cypress:test-iteration:iterate-ci-flaky` | CI-based iteration — push fixes, trigger Prow jobs, wait, analyze, repeat | User |
219+
| `/cypress:test-iteration:diagnose-test-failure` | Classifies a single test failure using screenshots + code analysis | Orchestrator (as sub-agent prompt) |
220+
| `/cypress:test-iteration:analyze-ci-results` | Fetches and analyzes OpenShift CI artifacts, classifies infra vs test/code | User or orchestrator |
221+
222+
Skills are defined in `.claude/commands/` and can be invoked as slash commands.
223+
224+
## Existing Infrastructure Leveraged
225+
226+
| Asset | How the agent uses it |
227+
|-------|----------------------|
228+
| mochawesome JSON reporter | Structured test results parsing |
229+
| `@cypress/grep` | Run individual tests by name or tag |
230+
| `export-env.sh` | Source env vars for inline execution |
231+
| YAML fixture system | Edit fixtures to fix data issues |
232+
| Page object (`incidents-page.ts`) | Fix selectors and add missing methods |
233+
| Mock layer (`incidents_prometheus_query_mocks/`) | Fix intercept patterns |
234+
| `/cypress:test-development:generate-incident-fixture` skill | Generate new fixtures when needed |
235+
| `/cypress:test-development:validate-incident-fixtures` skill | Validate fixture edits |
236+
237+
## Phase 2: Frontend Refactoring (Future)
238+
239+
### Concept
240+
241+
Tests become the behavioral contract. The agent refactors frontend code while using the test suite as a safety net.
242+
243+
### Additional Agent Roles
244+
245+
| Agent | Role |
246+
|-------|------|
247+
| **Refactor Planner** | Analyzes frontend code, proposes refactoring steps |
248+
| **Refactor Agent** | Makes code changes (replaces Fix Agent) |
249+
| **Contract Validator** | Runs tests, classifies failures as regression vs test-coupling |
250+
| **Test Adapter** | Updates tests that assert implementation details instead of behavior |
251+
252+
### Key Principle
253+
254+
If a test breaks due to refactoring but behavior is preserved, the test needs updating — it was too coupled to implementation. Phase 1 (robustness) reduces this coupling, making Phase 2 more effective.
255+
256+
### Additional Classification
257+
258+
The Diagnosis Agent gains `TEST_TOO_COUPLED` — the test asserts implementation details (specific DOM structure, internal state) rather than observable behavior. The Test Adapter agent rewrites it to be implementation-agnostic.

0 commit comments

Comments
 (0)