Skip to content

Commit ab7296c

Browse files
committed
docs: add wordpress-api and update gem docs
1 parent 3fbbaf8 commit ab7296c

6 files changed

Lines changed: 141 additions & 10 deletions

File tree

src/content/docs/creating-custom-feeds.mdx

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,21 @@ html2rss supports many configuration options:
160160

161161
4. **Check the output:** Make sure all items have titles, links, and descriptions
162162

163+
### Useful CLI flags when a site is difficult
164+
165+
Some sites need a little more request budget than the defaults.
166+
167+
- Use `--max-redirects` when the site bounces through several canonicalization or tracking redirects before the real page loads.
168+
- Use `--max-requests` when your config needs more than one request, for example pagination or other follow-up fetches.
169+
170+
```bash
171+
html2rss feed your-config.yml --max-redirects 10
172+
html2rss feed your-config.yml --max-requests 5
173+
html2rss auto https://example.com/blog --max-redirects 10 --max-requests 5
174+
```
175+
176+
Keep these values as low as possible. If a site only needs one extra redirect, prefer `--max-redirects 4` over a much larger number.
177+
163178
## Add It To html2rss-web
164179

165180
Once the config works locally, add it to your `feeds.yml` or shared config repository and restart your

src/content/docs/getting-started.mdx

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,19 @@ That guide is the canonical setup flow for:
2525
- **[Browse working feed examples](/feed-directory/)**: see what successful outputs look like
2626
- **[Create Custom Feeds](/creating-custom-feeds)**: write configs when you need more control
2727
- **[Troubleshooting Guide](/troubleshooting/troubleshooting)**: fix startup or extraction problems
28+
29+
## Using the Ruby CLI
30+
31+
If you are working directly with the gem instead of `html2rss-web`, start with:
32+
33+
```bash
34+
html2rss auto https://example.com/blog
35+
```
36+
37+
If the target site is unusually redirect-heavy or needs extra follow-up requests, the CLI also supports:
38+
39+
```bash
40+
html2rss auto https://example.com/blog --max-redirects 10 --max-requests 5
41+
```
42+
43+
For config-driven runs, the same flags are available on `html2rss feed`.

src/content/docs/ruby-gem/how-to/advanced-features.mdx

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,13 @@ This guide covers advanced features and performance optimizations for html2rss.
77

88
## Parallel Processing
99

10-
html2rss uses parallel processing to improve performance when scraping multiple items. This happens automatically and doesn't require any configuration.
10+
html2rss uses parallel processing in auto-source discovery to improve performance when multiple scrapers inspect the same page. This happens automatically and doesn't require any configuration.
1111

1212
### How It Works
1313

14-
- **Auto-source scraping:** Multiple scrapers run in parallel to analyze the page
15-
- **Item processing:** Each scraped item is processed in parallel
16-
- **Performance benefit:** Significantly faster when dealing with many items
14+
- **Auto-source scraping:** Multiple scrapers run in parallel to analyze the same response body
15+
- **Selectors and pagination:** Selector extraction and `rel="next"` pagination stay sequential and share the same request budget
16+
- **Performance benefit:** Faster auto-discovery without changing selector semantics
1717

1818
### Performance Tips
1919

@@ -75,6 +75,8 @@ selectors:
7575
extractor: "href"
7676
```
7777

78+
When you use the Browserless strategy, Chromium rejects transport-level headers such as `Host`, `Connection`, `Content-Length`, and `Transfer-Encoding`. html2rss filters those headers before navigation and logs the filtered header names at `info` level.
79+
7880
## Monitoring and Debugging
7981

8082
### Enable Debug Logging

src/content/docs/ruby-gem/reference/auto-source.mdx

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,16 +17,19 @@ auto_source: {}
1717
1818
`auto_source` uses the following strategies to find content:
1919

20-
1. **`schema`:** Parses `<script type="json/ld">` tags containing structured data (e.g., [Schema.org](https://schema.org/)).
21-
2. **`semantic_html`:** Searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`.
22-
3. **`html`:** Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
23-
4. **json_state:** Single-page applications often stash pre-rendered article data in `<script type="application/json">` tags or global variables
20+
1. **`wordpress_api`:** Detects the `<link rel="https://api.w.org/">` tag used by WordPress and pulls posts from the REST API without parsing article HTML. See [WordPress API](/ruby-gem/reference/wordpress-api/).
21+
2. **`schema`:** Parses `<script type="json/ld">` tags containing structured data (e.g., [Schema.org](https://schema.org/)).
22+
3. **`semantic_html`:** Searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`.
23+
4. **`html`:** Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
24+
5. **json_state:** Single-page applications often stash pre-rendered article data in `<script type="application/json">` tags or global variables
2425
such as `window.__NEXT_DATA__`, `window.__NUXT__`, or `window.STATE`. The JSON-state scraper walks those blobs, finds arrays with
2526
`title`/`url` pairs, and converts them into the same hashes produced by `HtmlExtractor`.
2627

2728
**`json_state` Limitations:** the scraper requires discoverable arrays of hashes containing clear `title` and `url` fields. Minified or
2829
obfuscated state objects, heavily encoded values, or blobs that require executing embedded functions are ignored.
2930

31+
**`wordpress_api` Limitations:** this scraper depends on the page exposing a public WordPress REST API root. The current implementation fetches post records directly, but it does not yet resolve category names or featured media metadata.
32+
3033
## Fine-Tuning
3134

3235
You can customize `auto_source` to improve its accuracy.
@@ -40,6 +43,8 @@ channel:
4043
url: https://example.com
4144
auto_source:
4245
scraper:
46+
wordpress_api:
47+
enabled: false # default: true
4348
schema:
4449
enabled: false # default: true
4550
semantic_html:

src/content/docs/ruby-gem/reference/selectors.mdx

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,9 @@ selectors:
7070
Behavior:
7171

7272
- `max_pages` is the total page budget for the item selector chain, including the initial page.
73+
- `max_pages` is capped by the system request ceiling of 10 pages per feed build.
7374
- Pagination follows strict `link[rel~="next"]` or `a[rel~="next"]` targets only.
75+
- Follow-up pages use the current page's effective origin after redirects.
7476
- Pagination stops when there is no next link, a page repeats, or the shared request budget is exhausted.
7577
- The same request safeguards apply to pagination and Browserless navigation, including timeout limits, redirect limits, response-size guards, and private-network denial.
7678

@@ -120,10 +122,10 @@ Post-processors manipulate the extracted value.
120122
- `html_to_markdown`: Converts HTML to Markdown.
121123
- `markdown_to_html`: Converts Markdown to HTML.
122124
- `parse_time`: Parses a string into a `Time` object.
123-
- `parse_uri`: Parses a string into a `URI` object.
125+
- `parse_uri`: Resolves a relative URL against `channel.url` and returns the normalized URL string.
124126
- `sanitize_html`: Sanitizes HTML to prevent security vulnerabilities.
125127
- `substring`: Extracts a substring from a string.
126-
- `template`: Creates a new string from a template and other selector values.
128+
- `template`: Creates a new string from a template and other selector values. Use `%{self}` for the current selector value.
127129

128130
> Always use the `sanitize_html` post-processor for any HTML content to prevent security risks.
129131

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
---
2+
title: "WordPress API"
3+
description: "Use html2rss auto_source to read WordPress sites through their REST API instead of scraping article HTML."
4+
---
5+
6+
The `wordpress_api` scraper is part of `auto_source`. It detects WordPress sites that advertise a REST API in the page `<head>` and then fetches structured post data directly from that API.
7+
8+
This is usually more reliable than HTML scraping because the response already contains fields such as title, content, excerpt, permalink, publish date, and category IDs.
9+
10+
## Detection
11+
12+
The scraper activates when the page contains:
13+
14+
```html
15+
<link rel="https://api.w.org/" href="https://example.com/wp-json/" />
16+
```
17+
18+
When that tag is present, `html2rss` resolves the API root and requests:
19+
20+
```text
21+
wp/v2/posts?per_page=100&_fields=id,title,excerpt,content,link,date,categories
22+
```
23+
24+
## Basic Usage
25+
26+
Enable `auto_source` as usual:
27+
28+
```yml
29+
channel:
30+
url: "https://example.com/blog"
31+
auto_source: {}
32+
```
33+
34+
If the target is a standard WordPress site with a public API, no selector configuration is required.
35+
36+
## Configure The Scraper
37+
38+
You can disable the WordPress scraper while keeping the rest of `auto_source` enabled:
39+
40+
```yml
41+
channel:
42+
url: "https://example.com/blog"
43+
auto_source:
44+
scraper:
45+
wordpress_api:
46+
enabled: false
47+
```
48+
49+
This is useful if a site exposes the API link but you prefer another auto-source strategy.
50+
51+
## What Gets Extracted
52+
53+
The current scraper maps the WordPress post payload into `html2rss` article fields like this:
54+
55+
| WordPress field | html2rss article field |
56+
| ------------------ | ---------------------- |
57+
| `id` | `id` |
58+
| `title.rendered` | `title` |
59+
| `content.rendered` | `description` |
60+
| `link` | `url` |
61+
| `date` | `published_at` |
62+
| `categories` | `categories` |
63+
64+
If `content.rendered` is blank, the scraper falls back to `excerpt.rendered`.
65+
66+
## Behavior Notes
67+
68+
- The scraper uses the shared request session, so it participates in the same request safety model as the rest of the feed build.
69+
- It resolves relative API links against `channel.url`.
70+
- It currently stores WordPress category IDs as strings because category-name resolution is not implemented yet.
71+
- It currently does not resolve `featured_media` into an image URL.
72+
73+
## When To Use It
74+
75+
Prefer `wordpress_api` when:
76+
77+
- The page is clearly powered by WordPress
78+
- The REST API is public
79+
- You want more stable extraction than CSS selectors or heuristic HTML scraping
80+
81+
Prefer manual selectors when:
82+
83+
- The site blocks or customizes the API heavily
84+
- You need fields that are not exposed by the post endpoint
85+
- You want complete control over item filtering or presentation
86+
87+
## Related Docs
88+
89+
- [Auto Source](/ruby-gem/reference/auto-source/)
90+
- [Selectors](/ruby-gem/reference/selectors/)
91+
- [Scraping JSON Responses](/ruby-gem/how-to/scraping-json/)

0 commit comments

Comments
 (0)