From 7dedfbbb3abbcc291a431f59778401626d5ed7cf Mon Sep 17 00:00:00 2001 From: Gil Desmarais Date: Fri, 17 Apr 2026 17:59:08 +0200 Subject: [PATCH 1/2] docs: align strategy docs with botasaurus-first auto fallback --- src/content/docs/creating-custom-feeds.mdx | 6 +- src/content/docs/index.mdx | 2 +- .../ruby-gem/how-to/advanced-features.mdx | 2 +- .../ruby-gem/how-to/custom-http-requests.mdx | 3 +- .../how-to/handling-dynamic-content.mdx | 12 ++-- .../docs/ruby-gem/reference/cli-reference.mdx | 24 +++++++ .../docs/ruby-gem/reference/strategy.mdx | 62 ++++++++++++++++++- .../docs/troubleshooting/troubleshooting.mdx | 10 ++- 8 files changed, 103 insertions(+), 18 deletions(-) diff --git a/src/content/docs/creating-custom-feeds.mdx b/src/content/docs/creating-custom-feeds.mdx index 24a38fed..4d456f58 100644 --- a/src/content/docs/creating-custom-feeds.mdx +++ b/src/content/docs/creating-custom-feeds.mdx @@ -48,7 +48,7 @@ When auto-sourcing isn't enough, you can write your own configuration files to c 3. **Validate the config** with `html2rss validate your-config.yml` 4. **Render the feed** with `html2rss feed your-config.yml` 5. **Add it to `html2rss-web`** so you can use it through your normal instance -6. **Escalate to `browserless`** if the content is rendered by JavaScript +6. **Escalate strategy when needed**: if static fetching is insufficient, switch to a JavaScript/browser-based extraction strategy This order keeps iteration fast and makes it easier to see whether the problem is the page structure, your selectors, or the fetch strategy. @@ -210,7 +210,7 @@ there. - **No items found?** Check your selectors with browser tools (F12) - the `items.selector` might not match the page structure - **Invalid YAML?** Use spaces, not tabs, and ensure proper indentation - **Website not loading?** Check the URL and try accessing it in your browser -- **Missing content?** Some websites load content with JavaScript - you may need to use the `browserless` strategy +- **Missing content?** Some websites load content with JavaScript - you may need a JavaScript/browser-based extraction strategy instead of plain HTTP fetching - **Wrong data extracted?** Verify your selectors are pointing to the right elements **Need more help?** See our [comprehensive troubleshooting guide](/troubleshooting/troubleshooting) or ask in [GitHub Discussions](https://github.com/orgs/html2rss/discussions). @@ -234,5 +234,5 @@ there. - **[Browse existing configs](https://github.com/html2rss/html2rss-configs/tree/master/lib/html2rss/configs)** - See real examples - **[Join discussions](https://github.com/orgs/html2rss/discussions)** - Connect with other users -- **[Learn about strategies](/ruby-gem/reference/strategy/)** - Decide when to use `browserless` +- **[Learn about strategies](/ruby-gem/reference/strategy/)** - Decide when to use static vs JavaScript/browser-based extraction - **[Learn advanced features](/ruby-gem/how-to/advanced-features/)** - Take your configs to the next level diff --git a/src/content/docs/index.mdx b/src/content/docs/index.mdx index 8374aca4..bf9e7eca 100644 --- a/src/content/docs/index.mdx +++ b/src/content/docs/index.mdx @@ -43,7 +43,7 @@ Most people should start with the web application: 1. **[Creating Custom Feeds](/creating-custom-feeds)**: write and test your own configs 2. **[Selectors Reference](/ruby-gem/reference/selectors/)**: learn the matching rules -3. **[Strategy Reference](/ruby-gem/reference/strategy/)**: decide when `browserless` is justified +3. **[Strategy Reference](/ruby-gem/reference/strategy/)**: choose the right extraction strategy for static vs JavaScript-heavy pages ### I'm building or integrating diff --git a/src/content/docs/ruby-gem/how-to/advanced-features.mdx b/src/content/docs/ruby-gem/how-to/advanced-features.mdx index ab429c29..f89cc46c 100644 --- a/src/content/docs/ruby-gem/how-to/advanced-features.mdx +++ b/src/content/docs/ruby-gem/how-to/advanced-features.mdx @@ -16,7 +16,7 @@ html2rss uses parallel processing in auto-source discovery. This happens automat 1. **Use appropriate selectors:** More specific selectors reduce processing time 2. **Limit items when possible:** Use CSS selectors that target only the content you need 3. **Cache responses:** The web application caches responses automatically -4. **Choose the right strategy:** Use `faraday` for static content, `browserless` only when JavaScript is required +4. **Choose the right strategy:** Use static HTTP fetching for simple pages, and move to a JavaScript/browser-based extraction strategy when rendering or anti-bot handling is required ## Memory Optimization diff --git a/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx b/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx index e361f410..365ef6ca 100644 --- a/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx +++ b/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx @@ -11,7 +11,7 @@ Keep this structure in mind: - `headers` stays top-level - `strategy` stays top-level -- request-specific controls such as budgets and Browserless options live under `request` +- request-specific controls such as budgets and strategy-specific options live under `request` ## When You Need Custom Headers @@ -74,6 +74,7 @@ Request budgets are configured under `request`, not as top-level keys: - `request.max_redirects` limits redirect hops - `request.max_requests` limits the total request budget for the feed build - `request.browserless.*` is reserved for Browserless-only behavior such as preload actions +- `request.botasaurus.*` is reserved for Botasaurus-only behavior such as navigation mode and retries ## Common Use Cases diff --git a/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx b/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx index f7b739ad..fc998ce9 100644 --- a/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx +++ b/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx @@ -1,6 +1,6 @@ --- title: Handling Dynamic Content -description: "Learn how to handle JavaScript-heavy websites and dynamic content with html2rss. Use browserless strategy for sites that load content dynamically." +description: "Learn how to handle JavaScript-heavy websites and dynamic content with html2rss using browser-based extraction strategies." --- import { Code } from "@astrojs/starlight/components"; @@ -9,7 +9,7 @@ Some websites load their content dynamically using JavaScript. The default `html ## Solution -Use the [`browserless` strategy](/ruby-gem/reference/strategy) to render JavaScript-heavy websites with a headless browser. +Use a [browser-based extraction strategy](/ruby-gem/reference/strategy) when JavaScript-heavy pages do not work with default static fetching. Keep the strategy at the top level and put request-specific options under `request`: @@ -36,9 +36,9 @@ Keep the strategy at the top level and put request-specific options under `reque lang="yaml" /> -## When to Use Browserless +## When to Use Browser-Based Extraction -The `browserless` strategy is necessary when: +A browser-based extraction strategy is necessary when: - **Content loads after page load** - JavaScript fetches data from APIs - **Single Page Applications (SPAs)** - React, Vue, Angular apps @@ -100,13 +100,13 @@ These preload steps can be combined in a single config when a site needs several ## Performance Considerations -The `browserless` strategy is slower than the default `faraday` strategy because it: +Browser-based extraction is slower than default static HTTP fetching because it: - Launches a headless Chrome browser - Renders the full page with JavaScript - Takes more memory and CPU resources -**Use `faraday` for static content** and only switch to `browserless` when necessary. +**Use static HTTP fetching for static content** and switch to browser-based extraction when needed. See the [Strategy Reference](/ruby-gem/reference/strategy) for concrete transports, defaults, and environment requirements. ## Related Topics diff --git a/src/content/docs/ruby-gem/reference/cli-reference.mdx b/src/content/docs/ruby-gem/reference/cli-reference.mdx index 8e4ba22f..9e94d3d1 100644 --- a/src/content/docs/ruby-gem/reference/cli-reference.mdx +++ b/src/content/docs/ruby-gem/reference/cli-reference.mdx @@ -23,6 +23,7 @@ Automatically discovers items from a page and prints the generated RSS feed to s code={` html2rss auto https://example.com/articles ; \ html2rss auto https://example.com/app --strategy browserless ; \ + BOTASAURUS_SCRAPER_URL="http://localhost:4010" html2rss auto https://example.com/protected --strategy botasaurus ; \ html2rss auto https://example.com/app --strategy browserless --max-redirects 5 --max-requests 6 ; \ html2rss auto https://example.com/articles --items_selector ".post-card" `} @@ -31,6 +32,8 @@ Automatically discovers items from a page and prints the generated RSS feed to s Command: `html2rss auto URL` +Default behavior uses `--strategy auto`, which tries `faraday` then `botasaurus` then `browserless`. + #### URL Surface Guidance For `auto` `auto` works best when the input URL already exposes a server-rendered list of entries. @@ -63,6 +66,8 @@ When no extractable items are found, `auto` now classifies likely causes instead Known anti-bot interstitial responses (for example Cloudflare challenge pages) are surfaced explicitly as blocked-surface errors. +If you run with the default `--strategy auto`, no manual strategy override is required for fallback ordering. + #### Browserless Setup And Diagnostics (CLI) `browserless` is opt-in for CLI usage. @@ -97,6 +102,24 @@ If you see `Browserless connection failed`, check: For custom Browserless endpoints, `BROWSERLESS_IO_API_TOKEN` is required. +#### Botasaurus Environment Requirement (CLI) + +`botasaurus` is opt-in for CLI usage and requires `BOTASAURUS_SCRAPER_URL`: + + + +If you see a Botasaurus configuration error, check: + +- `BOTASAURUS_SCRAPER_URL` is set +- `BOTASAURUS_SCRAPER_URL` is a valid URL +- the Botasaurus scrape API is reachable from the shell environment running `html2rss` + ### Feed Loads a YAML config, builds the feed, and prints the RSS XML to stdout. @@ -106,6 +129,7 @@ Loads a YAML config, builds the feed, and prints the RSS XML to stdout. html2rss feed single.yml ; \ html2rss feed feeds.yml my-first-feed ; \ html2rss feed single.yml --strategy browserless ; \ + BOTASAURUS_SCRAPER_URL="http://localhost:4010" html2rss feed single.yml --strategy botasaurus ; \ html2rss feed single.yml --max-redirects 5 --max-requests 6 ; \ html2rss feed single.yml --params id:42 foo:bar `} diff --git a/src/content/docs/ruby-gem/reference/strategy.mdx b/src/content/docs/ruby-gem/reference/strategy.mdx index f35f383b..241b03a4 100644 --- a/src/content/docs/ruby-gem/reference/strategy.mdx +++ b/src/content/docs/ruby-gem/reference/strategy.mdx @@ -1,6 +1,6 @@ --- title: Strategy -description: "Learn about different strategies for fetching website content with html2rss. Choose between faraday and browserless strategies for optimal performance." +description: "Learn about different strategies for fetching website content with html2rss. Choose between faraday, browserless, and botasaurus strategies for optimal performance." --- import { Code } from "@astrojs/starlight/components"; @@ -9,10 +9,13 @@ The `strategy` key defines how `html2rss` fetches a website's content. - **`faraday`** (default): Makes a direct HTTP request. It is fast but does not execute JavaScript. - **`browserless`**: Renders the website in a headless Chrome browser, which is necessary for JavaScript-heavy sites. +- **`botasaurus`**: Delegates fetching to a Botasaurus scrape API. This is opt-in and requires `BOTASAURUS_SCRAPER_URL`. `strategy` is a top-level config key. Request-specific controls live under `request`. -Use `faraday` first for direct newsroom/listing/changelog pages. Prefer `browserless` when the target is client-rendered, protected by anti-bot checks, or otherwise requires JavaScript to expose article links. +If you use CLI `--strategy auto` (default), html2rss tries `faraday` then `botasaurus` then `browserless`. + +Use `faraday` first for direct newsroom/listing/changelog pages. Prefer `botasaurus` as the first explicit browser-based strategy when you have a Botasaurus scrape API. Use `browserless` when you specifically need Browserless preload actions. ## `browserless` @@ -62,11 +65,12 @@ Set the `strategy` at the top level of your feed configuration and put request c Use this split consistently: -- `strategy`: selects `faraday` or `browserless` +- `strategy`: selects `faraday`, `browserless`, or `botasaurus` - `headers`: top-level headers shared by all strategies - `request.max_redirects`: redirect limit for the request session - `request.max_requests`: total request budget for the whole feed build - `request.browserless.*`: Browserless-only options +- `request.botasaurus.*`: Botasaurus-only options Example: @@ -153,6 +157,58 @@ Check these first: For custom Browserless websocket endpoints, `BROWSERLESS_IO_API_TOKEN` is mandatory. The local default endpoint (`ws://127.0.0.1:3000`) can use the default local token `6R0W53R135510`. +## `botasaurus` + +`botasaurus` delegates page fetching to a Botasaurus scrape API endpoint. This strategy is explicit opt-in and requires: + +- `strategy: botasaurus` +- `BOTASAURUS_SCRAPER_URL` set to your Botasaurus scrape API base URL (for example `http://localhost:4010`) + +### Configuration + + + +Supported `request.botasaurus` options: + +- `navigation_mode` (`auto`, `get`, `google_get`, `google_get_bypass`) +- `max_retries` (`0..3`) +- `wait_for_selector` +- `wait_timeout_seconds` +- `block_images` +- `block_images_and_css` +- `wait_for_complete_page_load` +- `headless` +- `proxy` +- `user_agent` +- `window_size` (two integers, for example `[1920, 1080]`) +- `lang` + +### Command-Line Usage + + + --- For detailed documentation on the Ruby API, see the [official YARD documentation](https://www.rubydoc.info/gems/html2rss). diff --git a/src/content/docs/troubleshooting/troubleshooting.mdx b/src/content/docs/troubleshooting/troubleshooting.mdx index a02a7290..3beb0dbc 100644 --- a/src/content/docs/troubleshooting/troubleshooting.mdx +++ b/src/content/docs/troubleshooting/troubleshooting.mdx @@ -32,6 +32,8 @@ The `auto` flow is URL-surface sensitive. If extraction quality is poor, switch to a more specific listing/update URL before tuning selectors. +If you use CLI defaults, `--strategy auto` is already active and attempts `faraday` then `botasaurus` then `browserless`. + ### Empty Feeds If your feed is empty, check the following: @@ -39,7 +41,7 @@ If your feed is empty, check the following: - **URL:** Ensure the `url` in your configuration is correct and accessible. - **`items.selector`:** Verify that the `items.selector` matches the elements on the page. - **Website Changes:** Websites change their HTML structure frequently. Your selectors may be outdated. -- **JavaScript Content:** If the content is loaded via JavaScript, use the `browserless` strategy instead of `faraday`. +- **JavaScript Content:** If the content is loaded via JavaScript, move from `faraday` to a rendering strategy such as `browserless` (or `botasaurus` when you use a Botasaurus scrape API). - **Authentication:** Some sites require authentication — check if you need to add headers or use a different strategy. ### `No scrapers found` Failure Taxonomy (`auto`) @@ -91,7 +93,9 @@ For custom websocket endpoints, `BROWSERLESS_IO_API_TOKEN` is required. Common configuration-related errors: - **`UnsupportedResponseContentType`:** The website returned content that html2rss can't parse (not HTML or JSON). -- **`UnsupportedStrategy`:** The specified strategy is not available. Use `faraday` or `browserless`. +- **`UnsupportedStrategy`:** The specified strategy is not available. Use `faraday`, `browserless`, or `botasaurus`. +- **`BOTASAURUS_SCRAPER_URL is required for strategy=botasaurus.`:** Set `BOTASAURUS_SCRAPER_URL` to your Botasaurus scrape API base URL when using `--strategy botasaurus`. +- **`BOTASAURUS_SCRAPER_URL is invalid`:** Fix the URL format and retry. - **`Configuration must include at least 'selectors' or 'auto_source'`:** You need to specify either manual selectors or enable auto-source. - **`stylesheet.type invalid`:** Only `text/css` and `text/xsl` are supported for stylesheets. @@ -101,7 +105,7 @@ If parts of your items (e.g., title, link) are missing, check the following: - **Selector:** Ensure the selector for the missing part is correct and relative to the `items.selector`. - **Extractor:** Verify that you are using the correct `extractor` (e.g., `text`, `href`, `attribute`). -- **Dynamic Content:** `faraday` does not render JavaScript. If content loads dynamically, run with `--strategy browserless` (with the Browserless service available) so the page can be rendered before extraction. +- **Dynamic Content:** `faraday` does not render JavaScript. If content loads dynamically, run with `--strategy browserless` (with Browserless available) or `--strategy botasaurus` (with `BOTASAURUS_SCRAPER_URL` configured) so the page can be rendered before extraction. ### Date/Time Parsing Errors From ce3b3f768ec59de3ca5d075e2612c72a4fa2651f Mon Sep 17 00:00:00 2001 From: Gil Desmarais Date: Mon, 20 Apr 2026 18:48:04 +0200 Subject: [PATCH 2/2] docs: tighten strategy UX wording for end users --- src/content/docs/creating-custom-feeds.mdx | 12 ++-------- src/content/docs/getting-started.mdx | 2 ++ .../how-to/handling-dynamic-content.mdx | 4 +++- .../docs/ruby-gem/reference/cli-reference.mdx | 22 +++++++++++-------- .../docs/ruby-gem/reference/strategy.mdx | 15 ++++++++----- .../docs/troubleshooting/troubleshooting.mdx | 12 +++++----- .../how-to/use-automatic-feed-generation.mdx | 20 ++++++++--------- 7 files changed, 45 insertions(+), 42 deletions(-) diff --git a/src/content/docs/creating-custom-feeds.mdx b/src/content/docs/creating-custom-feeds.mdx index 4d456f58..028edbaa 100644 --- a/src/content/docs/creating-custom-feeds.mdx +++ b/src/content/docs/creating-custom-feeds.mdx @@ -11,14 +11,6 @@ When auto-sourcing isn't enough, you can write your own configuration files to c **Prerequisites:** You should be familiar with the [Getting Started](/getting-started) guide before diving into custom configurations. - -