docs: add wordpress-api and update gem docs

gildesmarais · gildesmarais · commit ab7296c56460 · 2026-03-18T11:54:15.000+01:00
diff --git a/src/content/docs/creating-custom-feeds.mdx b/src/content/docs/creating-custom-feeds.mdx
@@ -160,6 +160,21 @@ html2rss supports many configuration options:
 
 4. **Check the output:** Make sure all items have titles, links, and descriptions
 
+### Useful CLI flags when a site is difficult
+
+Some sites need a little more request budget than the defaults.
+
+- Use `--max-redirects` when the site bounces through several canonicalization or tracking redirects before the real page loads.
+- Use `--max-requests` when your config needs more than one request, for example pagination or other follow-up fetches.
+
+```bash
+html2rss feed your-config.yml --max-redirects 10
+html2rss feed your-config.yml --max-requests 5
+html2rss auto https://example.com/blog --max-redirects 10 --max-requests 5
+```
+
+Keep these values as low as possible. If a site only needs one extra redirect, prefer `--max-redirects 4` over a much larger number.
+
 ## Add It To html2rss-web
 
 Once the config works locally, add it to your `feeds.yml` or shared config repository and restart your
diff --git a/src/content/docs/getting-started.mdx b/src/content/docs/getting-started.mdx
@@ -25,3 +25,19 @@ That guide is the canonical setup flow for:
 - **[Browse working feed examples](/feed-directory/)**: see what successful outputs look like
 - **[Create Custom Feeds](/creating-custom-feeds)**: write configs when you need more control
 - **[Troubleshooting Guide](/troubleshooting/troubleshooting)**: fix startup or extraction problems
+
+## Using the Ruby CLI
+
+If you are working directly with the gem instead of `html2rss-web`, start with:
+
+```bash
+html2rss auto https://example.com/blog
+```
+
+If the target site is unusually redirect-heavy or needs extra follow-up requests, the CLI also supports:
+
+```bash
+html2rss auto https://example.com/blog --max-redirects 10 --max-requests 5
+```
+
+For config-driven runs, the same flags are available on `html2rss feed`.
diff --git a/src/content/docs/ruby-gem/how-to/advanced-features.mdx b/src/content/docs/ruby-gem/how-to/advanced-features.mdx
@@ -7,13 +7,13 @@ This guide covers advanced features and performance optimizations for html2rss.
 
 ## Parallel Processing
 
-html2rss uses parallel processing to improve performance when scraping multiple items. This happens automatically and doesn't require any configuration.
+html2rss uses parallel processing in auto-source discovery to improve performance when multiple scrapers inspect the same page. This happens automatically and doesn't require any configuration.
 
 ### How It Works
 
-- **Auto-source scraping:** Multiple scrapers run in parallel to analyze the page
-- **Item processing:** Each scraped item is processed in parallel
-- **Performance benefit:** Significantly faster when dealing with many items
+- **Auto-source scraping:** Multiple scrapers run in parallel to analyze the same response body
+- **Selectors and pagination:** Selector extraction and `rel="next"` pagination stay sequential and share the same request budget
+- **Performance benefit:** Faster auto-discovery without changing selector semantics
 
 ### Performance Tips
 
@@ -75,6 +75,8 @@ selectors:
     extractor: "href"
 ```
 
+When you use the Browserless strategy, Chromium rejects transport-level headers such as `Host`, `Connection`, `Content-Length`, and `Transfer-Encoding`. html2rss filters those headers before navigation and logs the filtered header names at `info` level.
+
 ## Monitoring and Debugging
 
 ### Enable Debug Logging
diff --git a/src/content/docs/ruby-gem/reference/auto-source.mdx b/src/content/docs/ruby-gem/reference/auto-source.mdx
@@ -17,16 +17,19 @@ auto_source: {}
 
 `auto_source` uses the following strategies to find content:
 
-1.  **`schema`:** Parses `<script type="json/ld">` tags containing structured data (e.g., [Schema.org](https://schema.org/)).
-2.  **`semantic_html`:** Searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`.
-3.  **`html`:** Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
-4.  **json_state:** Single-page applications often stash pre-rendered article data in `<script type="application/json">` tags or global variables
+1.  **`wordpress_api`:** Detects the `<link rel="https://api.w.org/">` tag used by WordPress and pulls posts from the REST API without parsing article HTML. See [WordPress API](/ruby-gem/reference/wordpress-api/).
+2.  **`schema`:** Parses `<script type="json/ld">` tags containing structured data (e.g., [Schema.org](https://schema.org/)).
+3.  **`semantic_html`:** Searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`.
+4.  **`html`:** Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
+5.  **json_state:** Single-page applications often stash pre-rendered article data in `<script type="application/json">` tags or global variables
     such as `window.__NEXT_DATA__`, `window.__NUXT__`, or `window.STATE`. The JSON-state scraper walks those blobs, finds arrays with
     `title`/`url` pairs, and converts them into the same hashes produced by `HtmlExtractor`.
 
 **`json_state` Limitations:** the scraper requires discoverable arrays of hashes containing clear `title` and `url` fields. Minified or
 obfuscated state objects, heavily encoded values, or blobs that require executing embedded functions are ignored.
 
+**`wordpress_api` Limitations:** this scraper depends on the page exposing a public WordPress REST API root. The current implementation fetches post records directly, but it does not yet resolve category names or featured media metadata.
+
 ## Fine-Tuning
 
 You can customize `auto_source` to improve its accuracy.
@@ -40,6 +43,8 @@ channel:
   url: https://example.com
 auto_source:
   scraper:
+    wordpress_api:
+      enabled: false # default: true
     schema:
       enabled: false # default: true
     semantic_html:
diff --git a/src/content/docs/ruby-gem/reference/selectors.mdx b/src/content/docs/ruby-gem/reference/selectors.mdx
@@ -70,7 +70,9 @@ selectors:
 Behavior:
 
 - `max_pages` is the total page budget for the item selector chain, including the initial page.
+- `max_pages` is capped by the system request ceiling of 10 pages per feed build.
 - Pagination follows strict `link[rel~="next"]` or `a[rel~="next"]` targets only.
+- Follow-up pages use the current page's effective origin after redirects.
 - Pagination stops when there is no next link, a page repeats, or the shared request budget is exhausted.
 - The same request safeguards apply to pagination and Browserless navigation, including timeout limits, redirect limits, response-size guards, and private-network denial.
 
@@ -120,10 +122,10 @@ Post-processors manipulate the extracted value.
 - `html_to_markdown`: Converts HTML to Markdown.
 - `markdown_to_html`: Converts Markdown to HTML.
 - `parse_time`: Parses a string into a `Time` object.
-- `parse_uri`: Parses a string into a `URI` object.
+- `parse_uri`: Resolves a relative URL against `channel.url` and returns the normalized URL string.
 - `sanitize_html`: Sanitizes HTML to prevent security vulnerabilities.
 - `substring`: Extracts a substring from a string.
-- `template`: Creates a new string from a template and other selector values.
+- `template`: Creates a new string from a template and other selector values. Use `%{self}` for the current selector value.
 
 > Always use the `sanitize_html` post-processor for any HTML content to prevent security risks.
 
diff --git a/src/content/docs/ruby-gem/reference/wordpress-api.mdx b/src/content/docs/ruby-gem/reference/wordpress-api.mdx
@@ -0,0 +1,91 @@
+---
+title: "WordPress API"
+description: "Use html2rss auto_source to read WordPress sites through their REST API instead of scraping article HTML."
+---
+
+The `wordpress_api` scraper is part of `auto_source`. It detects WordPress sites that advertise a REST API in the page `<head>` and then fetches structured post data directly from that API.
+
+This is usually more reliable than HTML scraping because the response already contains fields such as title, content, excerpt, permalink, publish date, and category IDs.
+
+## Detection
+
+The scraper activates when the page contains:
+
+```html
+<link rel="https://api.w.org/" href="https://example.com/wp-json/" />
+```
+
+When that tag is present, `html2rss` resolves the API root and requests:
+
+```text
+wp/v2/posts?per_page=100&_fields=id,title,excerpt,content,link,date,categories
+```
+
+## Basic Usage
+
+Enable `auto_source` as usual:
+
+```yml
+channel:
+  url: "https://example.com/blog"
+auto_source: {}
+```
+
+If the target is a standard WordPress site with a public API, no selector configuration is required.
+
+## Configure The Scraper
+
+You can disable the WordPress scraper while keeping the rest of `auto_source` enabled:
+
+```yml
+channel:
+  url: "https://example.com/blog"
+auto_source:
+  scraper:
+    wordpress_api:
+      enabled: false
+```
+
+This is useful if a site exposes the API link but you prefer another auto-source strategy.
+
+## What Gets Extracted
+
+The current scraper maps the WordPress post payload into `html2rss` article fields like this:
+
+| WordPress field    | html2rss article field |
+| ------------------ | ---------------------- |
+| `id`               | `id`                   |
+| `title.rendered`   | `title`                |
+| `content.rendered` | `description`          |
+| `link`             | `url`                  |
+| `date`             | `published_at`         |
+| `categories`       | `categories`           |
+
+If `content.rendered` is blank, the scraper falls back to `excerpt.rendered`.
+
+## Behavior Notes
+
+- The scraper uses the shared request session, so it participates in the same request safety model as the rest of the feed build.
+- It resolves relative API links against `channel.url`.
+- It currently stores WordPress category IDs as strings because category-name resolution is not implemented yet.
+- It currently does not resolve `featured_media` into an image URL.
+
+## When To Use It
+
+Prefer `wordpress_api` when:
+
+- The page is clearly powered by WordPress
+- The REST API is public
+- You want more stable extraction than CSS selectors or heuristic HTML scraping
+
+Prefer manual selectors when:
+
+- The site blocks or customizes the API heavily
+- You need fields that are not exposed by the post endpoint
+- You want complete control over item filtering or presentation
+
+## Related Docs
+
+- [Auto Source](/ruby-gem/reference/auto-source/)
+- [Selectors](/ruby-gem/reference/selectors/)
+- [Scraping JSON Responses](/ruby-gem/how-to/scraping-json/)