| title | Strategy |
|---|---|
| description | Learn how html2rss chooses request strategies by default with auto fallback, and when to override with faraday, botasaurus, or browserless. |
import { Code } from "@astrojs/starlight/components";
The strategy key defines how html2rss fetches a website's content.
auto(default): Tries concrete strategies in order:faraday->botasaurus->browserless.faraday: Makes a direct HTTP request. It is fast but does not execute JavaScript.browserless: Renders the website in a headless Chrome browser, which is necessary for JavaScript-heavy sites.botasaurus: Delegates fetching to a Botasaurus scrape API. This is opt-in and requiresBOTASAURUS_SCRAPER_URL.
strategy is a top-level config key. Request-specific controls live under request.
auto falls back to the next strategy when the current attempt errors or extracts zero items. Use explicit --strategy ... only when you need to force a specific transport for troubleshooting or reproducibility.
The default strategy chain is:
faraday -> botasaurus -> browserless
To use the browserless strategy, you need a running instance of Browserless.io.
You can run a local Browserless.io instance using Docker:
<Code
code={docker run \ --rm \ -p 3000:3000 \ -e "CONCURRENT=10" \ -e "TOKEN=6R0W53R135510" \ ghcr.io/browserless/chromium}
lang="sh"
/>
Set the strategy at the top level of your feed configuration and put request controls under request:
<Code
code={strategy: browserless request: max_redirects: 5 max_requests: 6 channel: url: "https://example.com/app" selectors: items: selector: ".article" title: selector: "h2" url: selector: "a" extractor: "href"}
lang="yml"
/>
Use this split consistently:
strategy: selectsauto,faraday,browserless, orbotasaurusheaders: top-level headers shared by all strategiesrequest.max_redirects: redirect limit for the request sessionrequest.max_requests: total request budget for the whole feed buildrequest.browserless.*: Browserless-only optionsrequest.botasaurus.*: Botasaurus-only options
Example:
<Code
code={strategy: browserless headers: User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)" request: max_redirects: 5 max_requests: 6 browserless: preload: wait_after_ms: 5000 channel: url: "https://example.com/app" selectors: items: selector: ".article" title: selector: "h2" url: selector: "a" extractor: "href"}
lang="yml"
/>
Browserless can interact with the page before html2rss captures the final HTML. Configure preload steps under
request.browserless.preload.
<Code
code={strategy: browserless request: browserless: preload: wait_after_ms: 5000 click_selectors: - selector: ".load-more" max_clicks: 3 wait_after_ms: 250 scroll_down: iterations: 5 wait_after_ms: 200}
lang="yml"
/>
wait_after_ms: inserts a fixed wait before or after preload stepsclick_selectors: clicks matching elements until they disappear ormax_clicksis reachedscroll_down: scrolls until the page height stops growing oriterationsis reached
If preload triggers a real navigation or redirect, html2rss keeps the final document metadata. Relative links and follow-up pagination therefore resolve against the page that was actually rendered after preload completed.
You can also specify the strategy on the command line:
<Code code={`
BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000"
BROWSERLESS_IO_API_TOKEN="6R0W53R135510"
html2rss feed my_config.yml --strategy browserless ;
html2rss feed my_config.yml --max-redirects 5 --max-requests 6 ;
html2rss feed my_config.yml
`}
lang="sh"
/>
If Browserless cannot connect, html2rss surfaces a Browserless connection failed (...) error with endpoint/token hints.
Check these first:
BROWSERLESS_IO_WEBSOCKET_URLis reachable from where html2rss runsBROWSERLESS_IO_API_TOKENmatches your BrowserlessTOKEN- your Browserless service is running and accepting connections
For custom Browserless websocket endpoints, BROWSERLESS_IO_API_TOKEN is mandatory. The local default endpoint (ws://127.0.0.1:3000) can use the default local token 6R0W53R135510.
botasaurus delegates page fetching to a Botasaurus scrape API endpoint. This strategy is explicit opt-in and requires:
strategy: botasaurusBOTASAURUS_SCRAPER_URLset to your Botasaurus scrape API base URL (for examplehttp://localhost:4010)
<Code
code={strategy: botasaurus request: max_redirects: 5 max_requests: 6 botasaurus: navigation_mode: auto max_retries: 2 headless: false channel: url: "https://example.com/protected-listing" auto_source: {}}
lang="yml"
/>
Supported request.botasaurus options:
navigation_mode(auto,get,google_get,google_get_bypass)max_retries(0..3)wait_for_selectorwait_timeout_secondsblock_imagesblock_images_and_csswait_for_complete_page_loadheadlessproxyuser_agentwindow_size(two integers, for example[1920, 1080])lang
<Code
code={BOTASAURUS_SCRAPER_URL="http://localhost:4010" \ html2rss auto https://example.com/updates --strategy botasaurus ; \ html2rss feed my_config.yml --strategy botasaurus}
lang="sh"
/>
For detailed documentation on the Ruby API, see the official YARD documentation.