|
| 1 | +# AGENTS.md |
| 2 | + |
| 3 | +## Core Rules |
| 4 | + |
| 5 | +- Docker-first and Docker-only unless user asks otherwise. |
| 6 | +- Keep repo focused: stable Botasaurus scrape API wrapper, not generic framework. |
| 7 | + |
| 8 | +## Contract (Do Not Break) |
| 9 | + |
| 10 | +- Endpoints: `GET /health`, `POST /scrape`. |
| 11 | +- Stable legacy `/scrape` fields: `url`, `final_url`, `status_code`, `headers`, `html`, `error`, `metadata_error`. |
| 12 | +- Additive diagnostics fields (current contract): `request_id`, `attempts`, `strategy_used`, `render_ms`, `blocked_detected`, `challenge_detected`, `error_category`. |
| 13 | +- Request options (current contract): `navigation_mode`, `max_retries`, `wait_for_selector`, `wait_timeout_seconds`, `block_images`. |
| 14 | +- Error codes: |
| 15 | + - `400` validation/resolution failure |
| 16 | + - `403` SSRF guardrail block |
| 17 | + - `422` request schema validation |
| 18 | + - `502` scrape execution failure |
| 19 | + - `504` timeout |
| 20 | + |
| 21 | +## Runtime + Browser Constraints |
| 22 | + |
| 23 | +- `POST /scrape` is async API over sync browser work (threadpool). |
| 24 | +- Each scrape request must use isolated runtime state: |
| 25 | + - request-scoped runtime dir `/tmp/scrape/<request_id>` |
| 26 | + - request-scoped browser profile |
| 27 | + - no cache/profile/driver reuse across requests |
| 28 | +- Cleanup is mandatory in `finally`: |
| 29 | + - close browser driver |
| 30 | + - delete request runtime dir |
| 31 | + - remove in-memory active request id |
| 32 | +- Keep request-id collision/invariant guard (`_active_request_ids`) intact. |
| 33 | +- `driver.requests.get` metadata is best-effort; metadata failure must not fail HTML success. |
| 34 | +- Keep strategy engine behavior: |
| 35 | + - `auto` mode attempt order: `google_get` -> `google_get_bypass` -> `get` |
| 36 | + - do not alter retry semantics without docs/tests update |
| 37 | +- Multi-arch image required: |
| 38 | + - all architectures: Chromium install |
| 39 | + - keep `/usr/bin/google-chrome` symlink to Chromium for compatibility |
| 40 | +- If browser install logic changes, re-verify binary path and Botasaurus startup. |
| 41 | + |
| 42 | +## Safety |
| 43 | + |
| 44 | +- Keep SSRF guardrails: localhost/domain checks and blocked IP classes (loopback/private/link-local/multicast/reserved/unspecified). |
| 45 | +- Do not weaken URL validation without explicit request plus docs/tests updates. |
| 46 | + |
| 47 | +## Done Criteria |
| 48 | + |
| 49 | +- Run `make smoke` before finish. |
| 50 | +- `make smoke` must cover build, boot, `/health`, `/scrape` happy path, strategy override, retry path, isolation check, localhost guardrail. |
| 51 | +- If API contract, Docker behavior, or error semantics changed, update README in same change. |
| 52 | +- Keep commits scoped (infra vs API vs docs). |
0 commit comments