Skip to content

Commit 5fc67c3

Browse files
docs: align botasaurus + auto fallback + web release model (#1136)
* docs: align strategy docs with botasaurus-first auto fallback * docs: tighten strategy UX wording for end users * docs: document versioned web image releases * docs: reduce feed directory repetition * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * docs: add botasaurus-scrape-api to docker-compose examples --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
1 parent 0c0e723 commit 5fc67c3

20 files changed

Lines changed: 243 additions & 194 deletions

examples/deployment/docker-compose.yml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,17 @@
11
services:
22
html2rss-web:
3-
image: html2rss/web:latest
3+
image: html2rss/web:1
44
restart: unless-stopped
55
env_file:
66
- path: .env
77
required: false
88
environment:
99
PORT: 4000
10+
BOTASAURUS_SCRAPER_URL: http://botasaurus:4010
11+
12+
botasaurus:
13+
image: html2rss/botasaurus-scrape-api:latest
14+
restart: unless-stopped
1015

1116
caddy:
1217
image: caddy:2-alpine
@@ -30,6 +35,7 @@ services:
3035
depends_on:
3136
- html2rss-web
3237
- caddy
38+
- botasaurus
3339
command:
3440
- --cleanup
3541
- --interval

src/components/docs/DockerComposeSnippet.astro

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
import { Code } from "@astrojs/starlight/components";
3-
import { browserlessImage, caddyImage, watchtowerImage, webImage } from "../../data/docker";
3+
import { botasaurusImage, browserlessImage, caddyImage, watchtowerImage, webImage } from "../../data/docker";
44
55
interface Props {
66
variant: "minimal" | "productionCaddy" | "secure" | "watchtower" | "resourceGuardrails";
@@ -21,13 +21,16 @@ const snippets: Record<Props["variant"], string> = {
2121
environment:
2222
RACK_ENV: production
2323
PORT: 4000
24-
BUILD_TAG: \${BUILD_TAG:-local}
25-
GIT_SHA: \${GIT_SHA:-local}
2624
HTML2RSS_SECRET_KEY: \${HTML2RSS_SECRET_KEY:?set HTML2RSS_SECRET_KEY}
2725
HEALTH_CHECK_TOKEN: \${HEALTH_CHECK_TOKEN:?set HEALTH_CHECK_TOKEN}
2826
SENTRY_DSN: \${SENTRY_DSN:-}
2927
BROWSERLESS_IO_WEBSOCKET_URL: ws://browserless:4002
3028
BROWSERLESS_IO_API_TOKEN: \${BROWSERLESS_IO_API_TOKEN:?set BROWSERLESS_IO_API_TOKEN}
29+
BOTASAURUS_SCRAPER_URL: http://botasaurus:4010
30+
31+
botasaurus:
32+
image: ${botasaurusImage}
33+
restart: unless-stopped
3134
3235
browserless:
3336
image: "${browserlessImage}"
@@ -64,13 +67,16 @@ const snippets: Record<Props["variant"], string> = {
6467
environment:
6568
RACK_ENV: production
6669
PORT: 4000
67-
BUILD_TAG: \${BUILD_TAG:-local}
68-
GIT_SHA: \${GIT_SHA:-local}
6970
HTML2RSS_SECRET_KEY: \${HTML2RSS_SECRET_KEY:?set HTML2RSS_SECRET_KEY}
7071
HEALTH_CHECK_TOKEN: \${HEALTH_CHECK_TOKEN:?set HEALTH_CHECK_TOKEN}
7172
SENTRY_DSN: \${SENTRY_DSN:-}
7273
BROWSERLESS_IO_WEBSOCKET_URL: ws://browserless:4002
7374
BROWSERLESS_IO_API_TOKEN: \${BROWSERLESS_IO_API_TOKEN:?set BROWSERLESS_IO_API_TOKEN}
75+
BOTASAURUS_SCRAPER_URL: http://botasaurus:4010
76+
77+
botasaurus:
78+
image: ${botasaurusImage}
79+
restart: unless-stopped
7480
7581
browserless:
7682
image: "${browserlessImage}"
@@ -92,13 +98,16 @@ volumes:
9298
environment:
9399
RACK_ENV: production
94100
PORT: 4000
95-
BUILD_TAG: \${BUILD_TAG:-local}
96-
GIT_SHA: \${GIT_SHA:-local}
97101
HTML2RSS_SECRET_KEY: \${HTML2RSS_SECRET_KEY:?set HTML2RSS_SECRET_KEY}
98102
HEALTH_CHECK_TOKEN: \${HEALTH_CHECK_TOKEN:?set HEALTH_CHECK_TOKEN}
99103
SENTRY_DSN: \${SENTRY_DSN:-}
100104
BROWSERLESS_IO_WEBSOCKET_URL: ws://browserless:4002
101105
BROWSERLESS_IO_API_TOKEN: \${BROWSERLESS_IO_API_TOKEN:?set BROWSERLESS_IO_API_TOKEN}
106+
BOTASAURUS_SCRAPER_URL: http://botasaurus:4010
107+
108+
botasaurus:
109+
image: ${botasaurusImage}
110+
restart: unless-stopped
102111
103112
browserless:
104113
image: "${browserlessImage}"
@@ -115,7 +124,7 @@ volumes:
115124
- /var/run/docker.sock:/var/run/docker.sock:ro
116125
# Optional for private registries only:
117126
# - "\${HOME}/.docker/config.json:/config.json:ro"
118-
command: --cleanup --interval 7200 html2rss-web browserless caddy`,
127+
command: --cleanup --interval 7200 html2rss-web botasaurus browserless caddy`,
119128
resourceGuardrails: `services:
120129
html2rss-web:
121130
image: ${webImage}
Lines changed: 16 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,47 @@
11
---
22
title: "Common Use Cases"
3-
description: "See how people use html2rss to stay updated with their favorite websites. Real examples for personal and business use cases."
3+
description: "Use html2rss for common tracking and monitoring workflows."
44
---
55

6-
Discover how people are using html2rss to take control of their web content consumption. These real-world examples show the power and flexibility of creating custom RSS feeds.
7-
8-
---
6+
Use html2rss when you want updates in a reader instead of checking websites by hand.
97

108
## Personal Use Cases
119

1210
### Following Your Favorite Bloggers
1311

14-
Many bloggers don't offer RSS feeds, but you can create them with html2rss. Follow writers you love without relying on social media algorithms.
12+
Many blogs and creator sites do not publish feeds.
1513

16-
**Example:** Create a feed for a personal blog that only posts to social media.
14+
**Example:** Follow a newsroom, company blog, or publication section from your own `html2rss-web` deployment.
1715

1816
### Job Hunting
1917

2018
Track job postings from multiple company websites in one place. Never miss an opportunity again.
2119

22-
**Example:** Follow job boards, company career pages, and industry-specific job sites.
20+
**Example:** Track a company careers page or a narrower role-specific listing.
2321

2422
### Local News
2523

2624
Follow your local newspaper or community website to stay informed about your neighborhood.
2725

28-
**Example:** Create feeds for local news sites, community forums, and city government updates.
26+
**Example:** Subscribe to local news sites, community forums, and city government updates from one reader.
2927

3028
### Academic Research
3129

3230
Follow new papers and research in your field from multiple sources.
3331

34-
**Example:** Track arXiv submissions, journal publications, and conference proceedings.
32+
**Example:** Track publication pages, research blogs, and conference updates.
3533

3634
### Product Updates
3735

3836
Get notified when software you use releases updates, new features, or security patches.
3937

40-
**Example:** Follow product blogs, changelog pages, and release notes.
38+
**Example:** Track release notes, changelog pages, and product blogs.
4139

4240
### Hobby Communities
4341

4442
Follow forums, communities, and websites related to your hobbies and interests.
4543

46-
**Example:** Track gaming forums, photography communities, or cooking blogs.
47-
48-
---
44+
**Example:** Track gaming forums, photography communities, or cooking blogs without manually checking each site.
4945

5046
## Business Use Cases
5147

@@ -59,21 +55,19 @@ Track what your competitors are posting about - new products, features, or annou
5955

6056
Follow multiple industry publications in one feed to stay ahead of trends.
6157

62-
**Example:** Aggregate news from industry blogs, trade publications, and thought leaders.
58+
**Example:** Aggregate trade publications, company blogs, and research updates in one reader.
6359

6460
### Customer Support
6561

6662
Monitor customer feedback and support requests across different platforms.
6763

68-
**Example:** Track support forums, review sites, and social media mentions.
64+
**Example:** Track support forums, review sites, and product-update pages that affect your users.
6965

7066
### Content Marketing
7167

7268
Follow industry influencers and competitors for content inspiration.
7369

74-
**Example:** Track competitor blogs, industry newsletters, and thought leadership content.
75-
76-
---
70+
**Example:** Track competitor blogs, industry newsletters, and thought leadership content in one place.
7771

7872
## Technical Use Cases
7973

@@ -95,20 +89,8 @@ Follow multiple open source projects and their updates.
9589

9690
**Example:** Track project blogs, release notes, and community discussions.
9791

98-
---
99-
100-
## Getting Started with Your Use Case
101-
102-
1. **Identify the websites** you want to follow
103-
2. **Check our [Feed Directory](/feed-directory/)** to see if feeds already exist
104-
3. **Try the [Web App](/web-application/getting-started)** to create feeds easily
105-
4. **Learn advanced techniques** with our [Config Guide](/creating-custom-feeds/)
106-
107-
---
108-
109-
## Need Help?
92+
## Next Steps
11093

111-
- **Can't find what you're looking for?** [Browse our Feed Directory](/feed-directory/)
112-
- **Want to create custom feeds?** [Try the Web App](/web-application/getting-started)
113-
- **Need advanced features?** [Check our Ruby Gem docs](/ruby-gem/)
114-
- **Have questions?** [Join our community discussions](https://github.com/orgs/html2rss/discussions)
94+
- **[Run html2rss-web with Docker](/web-application/getting-started)** to verify your own instance.
95+
- **[Use automatic feed generation](/web-application/how-to/use-automatic-feed-generation/)** when you want direct page-URL conversion.
96+
- **[Create custom feeds](/creating-custom-feeds/)** when you need stable, reviewable extraction rules.

src/content/docs/creating-custom-feeds.mdx

Lines changed: 6 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -7,22 +7,13 @@ sidebar:
77

88
import { Aside, Code } from "@astrojs/starlight/components";
99

10-
When auto-sourcing isn't enough, you can write your own configuration files to create custom RSS feeds for any website. This guide shows you how to take full control with YAML configs.
10+
When existing feeds or auto-sourcing are not enough, write a YAML config for the site you want to follow.
1111

1212
**Prerequisites:** You should be familiar with the [Getting Started](/getting-started) guide before diving into custom configurations.
1313

14-
<Aside type="note" title="Release note">
15-
This guide tracks the current documentation tree and may describe features that have not yet shipped in the
16-
latest released `html2rss` gem. If you want the newest integrated behavior, prefer running
17-
[`html2rss-web`](/web-application/getting-started) via Docker. The web application ships as a rolling
18-
release and usually reflects the latest development state of the gem first. See [Versioning and
19-
releases](/web-application/reference/versioning-and-releases/) for details.
20-
</Aside>
21-
2214
<Aside type="tip" title="Use this guide when you need more control">
23-
Start with included feeds first. If your site is not covered, try [automatic feed
24-
generation](/web-application/how-to/use-automatic-feed-generation/) next. Reach for a custom config when you
25-
need a stable, reviewable setup or the generated feed misses important content.
15+
Reach for a custom config when you need stable, reviewable extraction rules or generated output misses
16+
important content.
2617
</Aside>
2718

2819
---
@@ -37,18 +28,14 @@ When auto-sourcing isn't enough, you can write your own configuration files to c
3728
- **The website has complex structure** that requires custom selectors
3829
- **You want to combine data** from multiple sources
3930

40-
**Don't need custom configs?** Check the [Feed Directory](/feed-directory/) first - there might already be a working feed for your website.
41-
42-
---
43-
4431
## Recommended Workflow
4532

4633
1. **Inspect the live page** in your browser developer tools
4734
2. **Write the smallest useful config** that extracts items, titles, and links
4835
3. **Validate the config** with `html2rss validate your-config.yml`
4936
4. **Render the feed** with `html2rss feed your-config.yml`
5037
5. **Add it to `html2rss-web`** so you can use it through your normal instance
51-
6. **Escalate to `browserless`** if the content is rendered by JavaScript
38+
6. **Escalate request strategy when needed**: use a browser-based rendering strategy only when troubleshooting requires it
5239

5340
This order keeps iteration fast and makes it easier to see whether the problem is the page structure, your
5441
selectors, or the fetch strategy.
@@ -210,7 +197,7 @@ there.
210197
- **No items found?** Check your selectors with browser tools (F12) - the `items.selector` might not match the page structure
211198
- **Invalid YAML?** Use spaces, not tabs, and ensure proper indentation
212199
- **Website not loading?** Check the URL and try accessing it in your browser
213-
- **Missing content?** Some websites load content with JavaScript - you may need to use the `browserless` strategy
200+
- **Missing content?** Try a browser-based rendering strategy during troubleshooting
214201
- **Wrong data extracted?** Verify your selectors are pointing to the right elements
215202

216203
**Need more help?** See our [comprehensive troubleshooting guide](/troubleshooting/troubleshooting) or ask in [GitHub Discussions](https://github.com/orgs/html2rss/discussions).
@@ -225,7 +212,6 @@ there.
225212

226213
**For Beginners:**
227214

228-
- **[Browse the Feed Directory](/feed-directory/)** - See real-world examples
229215
- **[Run html2rss-web with Docker](/web-application/getting-started)** - Use the newest integrated behavior
230216
- **[Learn more about selectors](/ruby-gem/reference/selectors/)** - Master CSS selectors
231217
- **[Submit your config via GitHub Web](https://github.com/html2rss/html2rss-configs)** - No Git knowledge required!
@@ -234,5 +220,5 @@ there.
234220

235221
- **[Browse existing configs](https://github.com/html2rss/html2rss-configs/tree/master/lib/html2rss/configs)** - See real examples
236222
- **[Join discussions](https://github.com/orgs/html2rss/discussions)** - Connect with other users
237-
- **[Learn about strategies](/ruby-gem/reference/strategy/)** - Decide when to use `browserless`
223+
- **[Learn about strategies](/ruby-gem/reference/strategy/)** - Decide when to use static vs JavaScript/browser-based extraction
238224
- **[Learn advanced features](/ruby-gem/how-to/advanced-features/)** - Take your configs to the next level

src/content/docs/getting-started.mdx

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "Getting Started"
3-
description: "Start html2rss-web locally, verify a working included feed from your self-hosted instance, and decide when to enable automatic generation or move to custom configs."
3+
description: "Start html2rss-web locally, verify one feed, and decide when to enable automatic generation or move to custom configs."
44
sidebar:
55
order: 1
66
---
@@ -17,13 +17,12 @@ That guide is the canonical setup flow for:
1717

1818
- running `html2rss-web` locally
1919
- confirming the interface is working
20-
- opening a first included feed URL
20+
- opening a known feed URL
2121
- deciding when to use automatic generation or custom configs
2222

2323
## Quick Shortcuts
2424

2525
- **[Run html2rss-web with Docker](/web-application/getting-started)**: recommended first step
26-
- **[Browse working feed examples](/feed-directory/)**: see what successful outputs look like
2726
- **[Use automatic feed generation](/web-application/how-to/use-automatic-feed-generation/)**: enable direct feed creation from a page URL when you want that workflow
2827
- **[Create Custom Feeds](/creating-custom-feeds)**: write configs when you need more control
2928
- **[Troubleshooting Guide](/troubleshooting/troubleshooting)**: fix startup or extraction problems
@@ -34,6 +33,8 @@ If you are working directly with the gem instead of `html2rss-web`, start with:
3433

3534
<Code code={`html2rss auto https://example.com/blog`} lang="bash" />
3635

36+
For strategy behavior and manual overrides, see the [Strategy reference](/ruby-gem/reference/strategy).
37+
3738
If the target site is unusually redirect-heavy or needs extra follow-up requests, the CLI also supports:
3839

3940
<Code code={`html2rss auto https://example.com/blog --max-redirects 10 --max-requests 5`} lang="bash" />

src/content/docs/index.mdx

Lines changed: 7 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
---
22
title: "Turn Any Website Into an RSS Feed"
3-
description: "Run html2rss-web with Docker, verify a working included feed from your self-hosted instance, then consciously enable automatic generation or move to custom configs when you need more control."
3+
description: "Run html2rss-web with Docker, verify one feed, then enable automatic generation or move to custom configs when you need more control."
44
---
55

6-
Run `html2rss-web` with Docker, verify a working included feed from your self-hosted instance, and only then decide whether to enable automatic generation or move to custom configs.
6+
Run `html2rss-web` with Docker, verify one feed from your own instance, then decide whether you need automatic generation or custom configs.
77

88
## Start Here
99

@@ -13,14 +13,8 @@ That guide is the canonical onboarding flow for:
1313

1414
- starting a local instance
1515
- verifying the web interface
16-
- opening a first included feed URL
17-
- deciding when to consciously enable automatic generation or move to custom configs
18-
19-
## How It Works
20-
21-
1. **Run your own local instance** with Docker
22-
2. **Open a built-in feed URL** from your own instance
23-
3. **Copy the feed URL into your reader**
16+
- opening a known feed URL
17+
- choosing the next path
2418

2519
## What is html2rss?
2620

@@ -36,14 +30,13 @@ Most people should start with the web application:
3630
### I want a working instance first
3731

3832
1. **[Run html2rss-web with Docker](/web-application/getting-started)**: recommended starting path
39-
2. **[Use the included configs](/web-application/how-to/use-included-configs/)**: use real embedded feeds from your own instance
40-
3. **[Browse working feed examples](/feed-directory/)**: see what working outputs look like
33+
2. **[Use the included configs](/web-application/how-to/use-included-configs/)**: optional guide for the embedded feed set
4134

4235
### I need more control
4336

4437
1. **[Creating Custom Feeds](/creating-custom-feeds)**: write and test your own configs
4538
2. **[Selectors Reference](/ruby-gem/reference/selectors/)**: learn the matching rules
46-
3. **[Strategy Reference](/ruby-gem/reference/strategy/)**: decide when `browserless` is justified
39+
3. **[Strategy Reference](/ruby-gem/reference/strategy/)**: choose the right extraction strategy for static vs JavaScript-heavy pages
4740

4841
### I'm building or integrating
4942

@@ -62,7 +55,7 @@ Most people should start with the web application:
6255
## Practical Notes
6356

6457
- Start with Docker, not a public instance.
65-
- Use an included feed to verify the deployment first.
58+
- Verify the deployment with one known feed first.
6659
- Enable automatic generation only when you want the direct page-URL workflow and are ready to allow it on your self-hosted instance.
6760
- Move to custom configs when you need a stable, reviewable setup.
6861

src/content/docs/ruby-gem/how-to/advanced-features.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ html2rss uses parallel processing in auto-source discovery. This happens automat
1616
1. **Use appropriate selectors:** More specific selectors reduce processing time
1717
2. **Limit items when possible:** Use CSS selectors that target only the content you need
1818
3. **Cache responses:** The web application caches responses automatically
19-
4. **Choose the right strategy:** Use `faraday` for static content, `browserless` only when JavaScript is required
19+
4. **Choose the right strategy:** Use static HTTP fetching for simple pages, and move to a JavaScript/browser-based extraction strategy when rendering or anti-bot handling is required
2020

2121
## Memory Optimization
2222

src/content/docs/ruby-gem/how-to/custom-http-requests.mdx

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Keep this structure in mind:
1111

1212
- `headers` stays top-level
1313
- `strategy` stays top-level
14-
- request-specific controls such as budgets and Browserless options live under `request`
14+
- request-specific controls such as budgets and strategy-specific options live under `request`
1515

1616
## When You Need Custom Headers
1717

@@ -74,6 +74,7 @@ Request budgets are configured under `request`, not as top-level keys:
7474
- `request.max_redirects` limits redirect hops
7575
- `request.max_requests` limits the total request budget for the feed build
7676
- `request.browserless.*` is reserved for Browserless-only behavior such as preload actions
77+
- `request.botasaurus.*` is reserved for Botasaurus-only behavior such as navigation mode and retries
7778

7879
## Common Use Cases
7980

0 commit comments

Comments
 (0)