docs: clarify Ruby request budgets and auto_source (#1100)

gildesmarais · web-flow · commit bd9e5e82bf47 · 2026-03-21T13:31:34.000+01:00
diff --git a/src/content/docs/creating-custom-feeds.mdx b/src/content/docs/creating-custom-feeds.mdx
@@ -6,6 +6,7 @@ sidebar:
 ---
 
 import { Aside } from "@astrojs/starlight/components";
+import Code from "astro/components/Code.astro";
 
 When auto-sourcing isn't enough, you can write your own configuration files to create custom RSS feeds for any website. This guide shows you how to take full control with YAML configs.
 
@@ -160,6 +161,22 @@ html2rss supports many configuration options:
 
 4. **Check the output:** Make sure all items have titles, links, and descriptions
 
+### Useful CLI flags when a site is difficult
+
+Some sites need a little more request budget than the defaults.
+
+- Use `--max-redirects` when the site bounces through several canonicalization or tracking redirects before the real page loads.
+- Use `--max-requests` when your config needs more than one request, for example pagination or other follow-up fetches.
+
+<Code
+  code={`html2rss feed your-config.yml --max-redirects 10
+html2rss feed your-config.yml --max-requests 5
+html2rss auto https://example.com/blog --max-redirects 10 --max-requests 5`}
+  lang="bash"
+/>
+
+Keep these values tight. Raise them only when the site proves it needs more.
+
 ## Add It To html2rss-web
 
 Once the config works locally, add it to your `feeds.yml` or shared config repository and restart your
diff --git a/src/content/docs/getting-started.mdx b/src/content/docs/getting-started.mdx
@@ -5,6 +5,8 @@ sidebar:
   order: 1
 ---
 
+import Code from "astro/components/Code.astro";
+
 This page points to the main onboarding flow.
 
 ## Start Here
@@ -23,3 +25,15 @@ That guide is the canonical setup flow for:
 - **[Browse working feed examples](/feed-directory/)** - See what success looks like
 - **[Create Custom Feeds](/creating-custom-feeds)** - Write configs when you need more control
 - **[Troubleshooting Guide](/troubleshooting/troubleshooting)** - Fix startup or extraction problems
+
+## Using the Ruby CLI
+
+If you are working directly with the gem instead of `html2rss-web`, start with:
+
+<Code code={`html2rss auto https://example.com/blog`} lang="bash" />
+
+If the target site is unusually redirect-heavy or needs extra follow-up requests, the CLI also supports:
+
+<Code code={`html2rss auto https://example.com/blog --max-redirects 10 --max-requests 5`} lang="bash" />
+
+For config-driven runs, the same flags are available on `html2rss feed`.
diff --git a/src/content/docs/ruby-gem/how-to/advanced-features.mdx b/src/content/docs/ruby-gem/how-to/advanced-features.mdx
@@ -7,13 +7,7 @@ This guide covers advanced features and performance optimizations for html2rss.
 
 ## Parallel Processing
 
-html2rss uses parallel processing to improve performance when scraping multiple items. This happens automatically and doesn't require any configuration.
-
-### How It Works
-
-- **Auto-source scraping:** Multiple scrapers run in parallel to analyze the page
-- **Item processing:** Each scraped item is processed in parallel
-- **Performance benefit:** Significantly faster when dealing with many items
+html2rss uses parallel processing in auto-source discovery. This happens automatically and doesn't require any configuration.
 
 ### Performance Tips
 
@@ -88,7 +82,7 @@ LOG_LEVEL=debug html2rss feed config.yml
 Use the health check endpoint to monitor feed generation:
 
 ```bash
-curl -u username:password http://localhost:3000/health_check.txt
+curl -u username:password http://localhost:4000/health_check.txt
 ```
 
 ## Article Validation
diff --git a/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx b/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx
@@ -3,7 +3,15 @@ title: "Custom HTTP Requests"
 description: "Learn how to customize HTTP requests with custom headers, authentication, and API interactions for html2rss."
 ---
 
-Some websites require custom HTTP headers, authentication, or other request settings to access their content. `html2rss` lets you customize requests for those cases.
+import Code from "astro/components/Code.astro";
+
+Some sites only work when requests carry the headers, tokens, or cookies your browser uses. `html2rss` supports those cases without changing the rest of your feed workflow.
+
+Keep this structure in mind:
+
+- `headers` stays top-level
+- `strategy` stays top-level
+- request-specific controls such as budgets and Browserless options live under `request`
 
 ## When You Need Custom Headers
 
@@ -19,8 +27,8 @@ You might need custom HTTP requests when:
 
 Add a `headers` section to your feed configuration. This example is a complete, valid config:
 
-```yaml
-headers:
+<Code
+  code={`headers:
   User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)"
   Authorization: "Bearer YOUR_API_TOKEN"
   Accept: "application/json"
@@ -32,8 +40,36 @@ selectors:
   title:
     selector: "title"
   url:
-    selector: "url"
-```
+    selector: "url"`}
+  lang="yaml"
+/>
+
+## Request Controls
+
+Request budgets are configured under `request`, not as top-level keys:
+
+<Code
+  code={`headers:
+  User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)"
+request:
+  max_redirects: 5
+  max_requests: 6
+channel:
+  url: https://example.com/articles
+selectors:
+  items:
+    selector: article
+  title:
+    selector: h2
+  url:
+    selector: a
+    extractor: href`}
+  lang="yaml"
+/>
+
+- `request.max_redirects` limits redirect hops
+- `request.max_requests` limits the total request budget for the feed build
+- `request.browserless.*` is reserved for Browserless-only behavior such as preload actions
 
 ## Common Use Cases
 
diff --git a/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx b/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx
@@ -3,12 +3,38 @@ title: Handling Dynamic Content
 description: "Learn how to handle JavaScript-heavy websites and dynamic content with html2rss. Use browserless strategy for sites that load content dynamically."
 ---
 
+import Code from "astro/components/Code.astro";
+
 Some websites load their content dynamically using JavaScript. The default `html2rss` strategy might not see this content.
 
 ## Solution
 
 Use the [`browserless` strategy](/ruby-gem/reference/strategy) to render JavaScript-heavy websites with a headless browser.
 
+Keep the strategy at the top level and put request-specific options under `request`:
+
+<Code
+  code={`strategy: browserless
+request:
+  max_redirects: 5
+  max_requests: 6
+  browserless:
+    preload:
+      wait_for_network_idle:
+        timeout_ms: 5000
+channel:
+  url: https://example.com/app
+selectors:
+  items:
+    selector: .article
+  title:
+    selector: h2
+  url:
+    selector: a
+    extractor: href`}
+  lang="yaml"
+/>
+
 ## When to Use Browserless
 
 The `browserless` strategy is necessary when:
@@ -18,6 +44,53 @@ The `browserless` strategy is necessary when:
 - **Infinite scroll** - Content loads as you scroll
 - **Dynamic forms** - Content changes based on user interaction
 
+## Preload Actions
+
+For dynamic sites, rendering once is often not enough. Use `request.browserless.preload` to wait, click, or scroll before the
+HTML snapshot is taken.
+
+### Wait for JavaScript Requests
+
+```yaml
+strategy: browserless
+request:
+  browserless:
+    preload:
+      wait_for_network_idle:
+        timeout_ms: 4000
+```
+
+### Click "Load More" Buttons
+
+```yaml
+strategy: browserless
+request:
+  browserless:
+    preload:
+      click_selectors:
+        - selector: ".load-more"
+          max_clicks: 3
+          delay_ms: 250
+          wait_for_network_idle:
+            timeout_ms: 3000
+```
+
+### Scroll Infinite Lists
+
+```yaml
+strategy: browserless
+request:
+  browserless:
+    preload:
+      scroll_down:
+        iterations: 5
+        delay_ms: 200
+        wait_for_network_idle:
+          timeout_ms: 2500
+```
+
+These preload steps can be combined in a single config when a site needs several interactions before all items appear.
+
 ## Performance Considerations
 
 The `browserless` strategy is slower than the default `faraday` strategy because it:
diff --git a/src/content/docs/ruby-gem/reference/auto-source.mdx b/src/content/docs/ruby-gem/reference/auto-source.mdx
@@ -17,16 +17,19 @@ auto_source: {}
 
 `auto_source` uses the following strategies to find content:
 
-1.  **`schema`:** Parses `<script type="json/ld">` tags containing structured data (e.g., [Schema.org](https://schema.org/)).
-2.  **`semantic_html`:** Searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`.
-3.  **`html`:** Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
-4.  **json_state:** Single-page applications often stash pre-rendered article data in `<script type="application/json">` tags or global variables
+1.  **`wordpress_api`:** Detects the `<link rel="https://api.w.org/">` tag used by WordPress and pulls posts from the REST API without parsing article HTML. See [WordPress API](/ruby-gem/reference/wordpress-api/).
+2.  **`schema`:** Parses `<script type="json/ld">` tags containing structured data (e.g., [Schema.org](https://schema.org/)).
+3.  **`semantic_html`:** Searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`.
+4.  **`html`:** Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
+5.  **json_state:** Single-page applications often stash pre-rendered article data in `<script type="application/json">` tags or global variables
     such as `window.__NEXT_DATA__`, `window.__NUXT__`, or `window.STATE`. The JSON-state scraper walks those blobs, finds arrays with
     `title`/`url` pairs, and converts them into the same hashes produced by `HtmlExtractor`.
 
 **`json_state` Limitations:** the scraper requires discoverable arrays of hashes containing clear `title` and `url` fields. Minified or
 obfuscated state objects, heavily encoded values, or blobs that require executing embedded functions are ignored.
 
+**`wordpress_api` Limitations:** this scraper depends on the page exposing a public WordPress REST API root. The current implementation fetches post records directly, but it does not yet resolve category names or featured media metadata.
+
 ## Fine-Tuning
 
 You can customize `auto_source` to improve its accuracy.
@@ -40,6 +43,8 @@ channel:
   url: https://example.com
 auto_source:
   scraper:
+    wordpress_api:
+      enabled: false # default: true
     schema:
       enabled: false # default: true
     semantic_html:
diff --git a/src/content/docs/ruby-gem/reference/cli-reference.mdx b/src/content/docs/ruby-gem/reference/cli-reference.mdx
@@ -24,6 +24,9 @@ html2rss auto https://example.com/articles
 # Force browserless for JavaScript-heavy pages
 html2rss auto https://example.com/app --strategy browserless
 
+# Set custom request budgets
+html2rss auto https://example.com/app --strategy browserless --max-redirects 5 --max-requests 6
+
 # Hint the item selector while keeping auto enhancement
 html2rss auto https://example.com/articles --items_selector ".post-card"
 ```
@@ -44,12 +47,17 @@ html2rss feed feeds.yml my-first-feed
 # Override the request strategy at runtime
 html2rss feed single.yml --strategy browserless
 
+# Override request budgets at runtime
+html2rss feed single.yml --max-redirects 5 --max-requests 6
+
 # Pass dynamic parameters into %<param>s placeholders
 html2rss feed single.yml --params id:42 foo:bar
 ```
 
 Command: `html2rss feed YAML_FILE [feed_name]`
 
+The CLI keeps `strategy` as a top-level override and writes runtime request limits into the generated config under `request`.
+
 ### Schema
 
 Prints the exported JSON Schema for the current gem version.
diff --git a/src/content/docs/ruby-gem/reference/selectors.mdx b/src/content/docs/ruby-gem/reference/selectors.mdx
@@ -70,7 +70,9 @@ selectors:
 Behavior:
 
 - `max_pages` is the total page budget for the item selector chain, including the initial page.
+- `max_pages` is capped by the system request ceiling of 10 pages per feed build.
 - Pagination follows strict `link[rel~="next"]` or `a[rel~="next"]` targets only.
+- Follow-up pages use the current page's effective origin after redirects.
 - Pagination stops when there is no next link, a page repeats, or the shared request budget is exhausted.
 - The same request safeguards apply to pagination and Browserless navigation, including timeout limits, redirect limits, response-size guards, and private-network denial.
 
@@ -120,10 +122,10 @@ Post-processors manipulate the extracted value.
 - `html_to_markdown`: Converts HTML to Markdown.
 - `markdown_to_html`: Converts Markdown to HTML.
 - `parse_time`: Parses a string into a `Time` object.
-- `parse_uri`: Parses a string into a `URI` object.
+- `parse_uri`: Resolves a relative URL against `channel.url` and returns the normalized URL string.
 - `sanitize_html`: Sanitizes HTML to prevent security vulnerabilities.
 - `substring`: Extracts a substring from a string.
-- `template`: Creates a new string from a template and other selector values.
+- `template`: Creates a new string from a template and other selector values. Use `%{self}` for the current selector value.
 
 > Always use the `sanitize_html` post-processor for any HTML content to prevent security risks.
 
diff --git a/src/content/docs/ruby-gem/reference/strategy.mdx b/src/content/docs/ruby-gem/reference/strategy.mdx
diff --git a/src/content/docs/ruby-gem/reference/wordpress-api.mdx b/src/content/docs/ruby-gem/reference/wordpress-api.mdx