Skip to content

Commit 092a216

Browse files
authored
feat: set baseline scrape defaults for reliability and integration (#1)
* Set scrape baseline to block images by default * Align scrape and wait timeout defaults with html2rss pipeline
1 parent eaec3cc commit 092a216

3 files changed

Lines changed: 11 additions & 9 deletions

File tree

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Docker-only FastAPI service that uses [Botasaurus](https://github.com/omkarcloud
88
- `GET /health`
99
- `POST /scrape`
1010
- Intended usage: run and test through Docker only.
11-
- Runtime boundary: async FastAPI handler delegates sync browser work to a bounded threadpool (`SCRAPE_MAX_WORKERS`), with a per-request timeout (`SCRAPE_TIMEOUT_SECONDS`).
11+
- Runtime boundary: async FastAPI handler delegates sync browser work to a bounded threadpool (`SCRAPE_MAX_WORKERS`, default `4`), with a per-request timeout (`SCRAPE_TIMEOUT_SECONDS`, default `25`).
1212
- On-demand isolation-first runtime: every scrape request runs with an ephemeral browser profile and request-scoped runtime dir, then gets fully cleaned up.
1313

1414
## Prerequisites
@@ -105,7 +105,7 @@ Request body (full options):
105105
"max_retries": 2,
106106
"wait_for_selector": "h1",
107107
"wait_timeout_seconds": 15,
108-
"block_images": false,
108+
"block_images": true,
109109
"block_images_and_css": false,
110110
"wait_for_complete_page_load": true,
111111
"user_agent": "Mozilla/5.0 ...",
@@ -125,8 +125,8 @@ Request options (contract):
125125
- `google_get_bypass`: only `google_get(bypass_cloudflare=true)`
126126
- `max_retries`: `0..3`, default `2` (attempts = `1 + max_retries`, with `auto` capped by 3 strategy steps).
127127
- `wait_for_selector`: if set, response waits for selector before capture.
128-
- `wait_timeout_seconds`: selector wait timeout (capped by service timeout).
129-
- `block_images`: pass image blocking to driver.
128+
- `wait_timeout_seconds`: selector wait timeout (default `15`, capped by service timeout).
129+
- `block_images`: pass image blocking to driver. Default `true`.
130130

131131
Currently accepted passthrough options (implemented, not part of stable request-options contract):
132132

@@ -255,7 +255,7 @@ curl -s -X POST http://localhost:4010/scrape \
255255
"url":"https://truthsocial.com/@realDonaldTrump",
256256
"navigation_mode":"auto",
257257
"max_retries":2,
258-
"wait_timeout_seconds":20,
258+
"wait_timeout_seconds":15,
259259
"headless":false
260260
}'
261261
```

app/main.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@
1818
from fastapi.responses import JSONResponse
1919
from pydantic import BaseModel, Field, HttpUrl, field_validator
2020

21-
DEFAULT_SCRAPE_TIMEOUT_SECONDS = int(os.getenv("SCRAPE_TIMEOUT_SECONDS", "60"))
21+
DEFAULT_SCRAPE_TIMEOUT_SECONDS = int(os.getenv("SCRAPE_TIMEOUT_SECONDS", "25"))
22+
DEFAULT_WAIT_TIMEOUT_SECONDS = min(15, DEFAULT_SCRAPE_TIMEOUT_SECONDS)
2223
_MAX_WORKERS = int(os.getenv("SCRAPE_MAX_WORKERS", "4"))
2324
_RUNTIME_ROOT = Path("/tmp/scrape")
2425

@@ -60,11 +61,11 @@ class ScrapeRequest(BaseModel):
6061
max_retries: int = Field(default=2, ge=0, le=3)
6162
wait_for_selector: Optional[str] = None
6263
wait_timeout_seconds: int = Field(
63-
default=DEFAULT_SCRAPE_TIMEOUT_SECONDS,
64+
default=DEFAULT_WAIT_TIMEOUT_SECONDS,
6465
ge=1,
6566
le=DEFAULT_SCRAPE_TIMEOUT_SECONDS,
6667
)
67-
block_images: bool = False
68+
block_images: bool = True
6869
block_images_and_css: bool = False
6970
wait_for_complete_page_load: bool = True
7071
user_agent: Optional[str] = None

tests/test_api_contract.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,8 @@ def test_request_defaults(self):
5959
payload = main.ScrapeRequest(url="https://example.com")
6060
self.assertEqual(payload.navigation_mode, "auto")
6161
self.assertEqual(payload.max_retries, 2)
62-
self.assertFalse(payload.block_images)
62+
self.assertEqual(payload.wait_timeout_seconds, 15)
63+
self.assertTrue(payload.block_images)
6364
self.assertFalse(payload.block_images_and_css)
6465
self.assertTrue(payload.wait_for_complete_page_load)
6566
self.assertIsNone(payload.user_agent)

0 commit comments

Comments
 (0)