Skip to content

Commit db56e0c

Browse files
committed
Align scrape and wait timeout defaults with html2rss pipeline
1 parent 3fe981f commit db56e0c

3 files changed

Lines changed: 7 additions & 5 deletions

File tree

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Docker-only FastAPI service that uses [Botasaurus](https://github.com/omkarcloud
88
- `GET /health`
99
- `POST /scrape`
1010
- Intended usage: run and test through Docker only.
11-
- Runtime boundary: async FastAPI handler delegates sync browser work to a bounded threadpool (`SCRAPE_MAX_WORKERS`), with a per-request timeout (`SCRAPE_TIMEOUT_SECONDS`).
11+
- Runtime boundary: async FastAPI handler delegates sync browser work to a bounded threadpool (`SCRAPE_MAX_WORKERS`, default `4`), with a per-request timeout (`SCRAPE_TIMEOUT_SECONDS`, default `25`).
1212
- On-demand isolation-first runtime: every scrape request runs with an ephemeral browser profile and request-scoped runtime dir, then gets fully cleaned up.
1313

1414
## Prerequisites
@@ -125,7 +125,7 @@ Request options (contract):
125125
- `google_get_bypass`: only `google_get(bypass_cloudflare=true)`
126126
- `max_retries`: `0..3`, default `2` (attempts = `1 + max_retries`, with `auto` capped by 3 strategy steps).
127127
- `wait_for_selector`: if set, response waits for selector before capture.
128-
- `wait_timeout_seconds`: selector wait timeout (capped by service timeout).
128+
- `wait_timeout_seconds`: selector wait timeout (default `15`, capped by service timeout).
129129
- `block_images`: pass image blocking to driver. Default `true`.
130130

131131
Currently accepted passthrough options (implemented, not part of stable request-options contract):
@@ -255,7 +255,7 @@ curl -s -X POST http://localhost:4010/scrape \
255255
"url":"https://truthsocial.com/@realDonaldTrump",
256256
"navigation_mode":"auto",
257257
"max_retries":2,
258-
"wait_timeout_seconds":20,
258+
"wait_timeout_seconds":15,
259259
"headless":false
260260
}'
261261
```

app/main.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@
1818
from fastapi.responses import JSONResponse
1919
from pydantic import BaseModel, Field, HttpUrl, field_validator
2020

21-
DEFAULT_SCRAPE_TIMEOUT_SECONDS = int(os.getenv("SCRAPE_TIMEOUT_SECONDS", "60"))
21+
DEFAULT_SCRAPE_TIMEOUT_SECONDS = int(os.getenv("SCRAPE_TIMEOUT_SECONDS", "25"))
22+
DEFAULT_WAIT_TIMEOUT_SECONDS = min(15, DEFAULT_SCRAPE_TIMEOUT_SECONDS)
2223
_MAX_WORKERS = int(os.getenv("SCRAPE_MAX_WORKERS", "4"))
2324
_RUNTIME_ROOT = Path("/tmp/scrape")
2425

@@ -60,7 +61,7 @@ class ScrapeRequest(BaseModel):
6061
max_retries: int = Field(default=2, ge=0, le=3)
6162
wait_for_selector: Optional[str] = None
6263
wait_timeout_seconds: int = Field(
63-
default=DEFAULT_SCRAPE_TIMEOUT_SECONDS,
64+
default=DEFAULT_WAIT_TIMEOUT_SECONDS,
6465
ge=1,
6566
le=DEFAULT_SCRAPE_TIMEOUT_SECONDS,
6667
)

tests/test_api_contract.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ def test_request_defaults(self):
5959
payload = main.ScrapeRequest(url="https://example.com")
6060
self.assertEqual(payload.navigation_mode, "auto")
6161
self.assertEqual(payload.max_retries, 2)
62+
self.assertEqual(payload.wait_timeout_seconds, 15)
6263
self.assertTrue(payload.block_images)
6364
self.assertFalse(payload.block_images_and_css)
6465
self.assertTrue(payload.wait_for_complete_page_load)

0 commit comments

Comments
 (0)