Skip to content

Commit bd9e5e8

Browse files
authored
docs: clarify Ruby request budgets and auto_source (#1100)
1 parent 55b81e9 commit bd9e5e8

10 files changed

Lines changed: 313 additions & 20 deletions

src/content/docs/creating-custom-feeds.mdx

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ sidebar:
66
---
77

88
import { Aside } from "@astrojs/starlight/components";
9+
import Code from "astro/components/Code.astro";
910

1011
When auto-sourcing isn't enough, you can write your own configuration files to create custom RSS feeds for any website. This guide shows you how to take full control with YAML configs.
1112

@@ -160,6 +161,22 @@ html2rss supports many configuration options:
160161

161162
4. **Check the output:** Make sure all items have titles, links, and descriptions
162163

164+
### Useful CLI flags when a site is difficult
165+
166+
Some sites need a little more request budget than the defaults.
167+
168+
- Use `--max-redirects` when the site bounces through several canonicalization or tracking redirects before the real page loads.
169+
- Use `--max-requests` when your config needs more than one request, for example pagination or other follow-up fetches.
170+
171+
<Code
172+
code={`html2rss feed your-config.yml --max-redirects 10
173+
html2rss feed your-config.yml --max-requests 5
174+
html2rss auto https://example.com/blog --max-redirects 10 --max-requests 5`}
175+
lang="bash"
176+
/>
177+
178+
Keep these values tight. Raise them only when the site proves it needs more.
179+
163180
## Add It To html2rss-web
164181

165182
Once the config works locally, add it to your `feeds.yml` or shared config repository and restart your

src/content/docs/getting-started.mdx

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ sidebar:
55
order: 1
66
---
77

8+
import Code from "astro/components/Code.astro";
9+
810
This page points to the main onboarding flow.
911

1012
## Start Here
@@ -23,3 +25,15 @@ That guide is the canonical setup flow for:
2325
- **[Browse working feed examples](/feed-directory/)** - See what success looks like
2426
- **[Create Custom Feeds](/creating-custom-feeds)** - Write configs when you need more control
2527
- **[Troubleshooting Guide](/troubleshooting/troubleshooting)** - Fix startup or extraction problems
28+
29+
## Using the Ruby CLI
30+
31+
If you are working directly with the gem instead of `html2rss-web`, start with:
32+
33+
<Code code={`html2rss auto https://example.com/blog`} lang="bash" />
34+
35+
If the target site is unusually redirect-heavy or needs extra follow-up requests, the CLI also supports:
36+
37+
<Code code={`html2rss auto https://example.com/blog --max-redirects 10 --max-requests 5`} lang="bash" />
38+
39+
For config-driven runs, the same flags are available on `html2rss feed`.

src/content/docs/ruby-gem/how-to/advanced-features.mdx

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,7 @@ This guide covers advanced features and performance optimizations for html2rss.
77

88
## Parallel Processing
99

10-
html2rss uses parallel processing to improve performance when scraping multiple items. This happens automatically and doesn't require any configuration.
11-
12-
### How It Works
13-
14-
- **Auto-source scraping:** Multiple scrapers run in parallel to analyze the page
15-
- **Item processing:** Each scraped item is processed in parallel
16-
- **Performance benefit:** Significantly faster when dealing with many items
10+
html2rss uses parallel processing in auto-source discovery. This happens automatically and doesn't require any configuration.
1711

1812
### Performance Tips
1913

@@ -88,7 +82,7 @@ LOG_LEVEL=debug html2rss feed config.yml
8882
Use the health check endpoint to monitor feed generation:
8983

9084
```bash
91-
curl -u username:password http://localhost:3000/health_check.txt
85+
curl -u username:password http://localhost:4000/health_check.txt
9286
```
9387

9488
## Article Validation

src/content/docs/ruby-gem/how-to/custom-http-requests.mdx

Lines changed: 41 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,15 @@ title: "Custom HTTP Requests"
33
description: "Learn how to customize HTTP requests with custom headers, authentication, and API interactions for html2rss."
44
---
55

6-
Some websites require custom HTTP headers, authentication, or other request settings to access their content. `html2rss` lets you customize requests for those cases.
6+
import Code from "astro/components/Code.astro";
7+
8+
Some sites only work when requests carry the headers, tokens, or cookies your browser uses. `html2rss` supports those cases without changing the rest of your feed workflow.
9+
10+
Keep this structure in mind:
11+
12+
- `headers` stays top-level
13+
- `strategy` stays top-level
14+
- request-specific controls such as budgets and Browserless options live under `request`
715

816
## When You Need Custom Headers
917

@@ -19,8 +27,8 @@ You might need custom HTTP requests when:
1927

2028
Add a `headers` section to your feed configuration. This example is a complete, valid config:
2129

22-
```yaml
23-
headers:
30+
<Code
31+
code={`headers:
2432
User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)"
2533
Authorization: "Bearer YOUR_API_TOKEN"
2634
Accept: "application/json"
@@ -32,8 +40,36 @@ selectors:
3240
title:
3341
selector: "title"
3442
url:
35-
selector: "url"
36-
```
43+
selector: "url"`}
44+
lang="yaml"
45+
/>
46+
47+
## Request Controls
48+
49+
Request budgets are configured under `request`, not as top-level keys:
50+
51+
<Code
52+
code={`headers:
53+
User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)"
54+
request:
55+
max_redirects: 5
56+
max_requests: 6
57+
channel:
58+
url: https://example.com/articles
59+
selectors:
60+
items:
61+
selector: article
62+
title:
63+
selector: h2
64+
url:
65+
selector: a
66+
extractor: href`}
67+
lang="yaml"
68+
/>
69+
70+
- `request.max_redirects` limits redirect hops
71+
- `request.max_requests` limits the total request budget for the feed build
72+
- `request.browserless.*` is reserved for Browserless-only behavior such as preload actions
3773

3874
## Common Use Cases
3975

src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,38 @@ title: Handling Dynamic Content
33
description: "Learn how to handle JavaScript-heavy websites and dynamic content with html2rss. Use browserless strategy for sites that load content dynamically."
44
---
55

6+
import Code from "astro/components/Code.astro";
7+
68
Some websites load their content dynamically using JavaScript. The default `html2rss` strategy might not see this content.
79

810
## Solution
911

1012
Use the [`browserless` strategy](/ruby-gem/reference/strategy) to render JavaScript-heavy websites with a headless browser.
1113

14+
Keep the strategy at the top level and put request-specific options under `request`:
15+
16+
<Code
17+
code={`strategy: browserless
18+
request:
19+
max_redirects: 5
20+
max_requests: 6
21+
browserless:
22+
preload:
23+
wait_for_network_idle:
24+
timeout_ms: 5000
25+
channel:
26+
url: https://example.com/app
27+
selectors:
28+
items:
29+
selector: .article
30+
title:
31+
selector: h2
32+
url:
33+
selector: a
34+
extractor: href`}
35+
lang="yaml"
36+
/>
37+
1238
## When to Use Browserless
1339

1440
The `browserless` strategy is necessary when:
@@ -18,6 +44,53 @@ The `browserless` strategy is necessary when:
1844
- **Infinite scroll** - Content loads as you scroll
1945
- **Dynamic forms** - Content changes based on user interaction
2046

47+
## Preload Actions
48+
49+
For dynamic sites, rendering once is often not enough. Use `request.browserless.preload` to wait, click, or scroll before the
50+
HTML snapshot is taken.
51+
52+
### Wait for JavaScript Requests
53+
54+
```yaml
55+
strategy: browserless
56+
request:
57+
browserless:
58+
preload:
59+
wait_for_network_idle:
60+
timeout_ms: 4000
61+
```
62+
63+
### Click "Load More" Buttons
64+
65+
```yaml
66+
strategy: browserless
67+
request:
68+
browserless:
69+
preload:
70+
click_selectors:
71+
- selector: ".load-more"
72+
max_clicks: 3
73+
delay_ms: 250
74+
wait_for_network_idle:
75+
timeout_ms: 3000
76+
```
77+
78+
### Scroll Infinite Lists
79+
80+
```yaml
81+
strategy: browserless
82+
request:
83+
browserless:
84+
preload:
85+
scroll_down:
86+
iterations: 5
87+
delay_ms: 200
88+
wait_for_network_idle:
89+
timeout_ms: 2500
90+
```
91+
92+
These preload steps can be combined in a single config when a site needs several interactions before all items appear.
93+
2194
## Performance Considerations
2295
2396
The `browserless` strategy is slower than the default `faraday` strategy because it:

src/content/docs/ruby-gem/reference/auto-source.mdx

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,16 +17,19 @@ auto_source: {}
1717
1818
`auto_source` uses the following strategies to find content:
1919

20-
1. **`schema`:** Parses `<script type="json/ld">` tags containing structured data (e.g., [Schema.org](https://schema.org/)).
21-
2. **`semantic_html`:** Searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`.
22-
3. **`html`:** Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
23-
4. **json_state:** Single-page applications often stash pre-rendered article data in `<script type="application/json">` tags or global variables
20+
1. **`wordpress_api`:** Detects the `<link rel="https://api.w.org/">` tag used by WordPress and pulls posts from the REST API without parsing article HTML. See [WordPress API](/ruby-gem/reference/wordpress-api/).
21+
2. **`schema`:** Parses `<script type="json/ld">` tags containing structured data (e.g., [Schema.org](https://schema.org/)).
22+
3. **`semantic_html`:** Searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`.
23+
4. **`html`:** Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
24+
5. **json_state:** Single-page applications often stash pre-rendered article data in `<script type="application/json">` tags or global variables
2425
such as `window.__NEXT_DATA__`, `window.__NUXT__`, or `window.STATE`. The JSON-state scraper walks those blobs, finds arrays with
2526
`title`/`url` pairs, and converts them into the same hashes produced by `HtmlExtractor`.
2627

2728
**`json_state` Limitations:** the scraper requires discoverable arrays of hashes containing clear `title` and `url` fields. Minified or
2829
obfuscated state objects, heavily encoded values, or blobs that require executing embedded functions are ignored.
2930

31+
**`wordpress_api` Limitations:** this scraper depends on the page exposing a public WordPress REST API root. The current implementation fetches post records directly, but it does not yet resolve category names or featured media metadata.
32+
3033
## Fine-Tuning
3134

3235
You can customize `auto_source` to improve its accuracy.
@@ -40,6 +43,8 @@ channel:
4043
url: https://example.com
4144
auto_source:
4245
scraper:
46+
wordpress_api:
47+
enabled: false # default: true
4348
schema:
4449
enabled: false # default: true
4550
semantic_html:

src/content/docs/ruby-gem/reference/cli-reference.mdx

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,9 @@ html2rss auto https://example.com/articles
2424
# Force browserless for JavaScript-heavy pages
2525
html2rss auto https://example.com/app --strategy browserless
2626

27+
# Set custom request budgets
28+
html2rss auto https://example.com/app --strategy browserless --max-redirects 5 --max-requests 6
29+
2730
# Hint the item selector while keeping auto enhancement
2831
html2rss auto https://example.com/articles --items_selector ".post-card"
2932
```
@@ -44,12 +47,17 @@ html2rss feed feeds.yml my-first-feed
4447
# Override the request strategy at runtime
4548
html2rss feed single.yml --strategy browserless
4649

50+
# Override request budgets at runtime
51+
html2rss feed single.yml --max-redirects 5 --max-requests 6
52+
4753
# Pass dynamic parameters into %<param>s placeholders
4854
html2rss feed single.yml --params id:42 foo:bar
4955
```
5056

5157
Command: `html2rss feed YAML_FILE [feed_name]`
5258

59+
The CLI keeps `strategy` as a top-level override and writes runtime request limits into the generated config under `request`.
60+
5361
### Schema
5462

5563
Prints the exported JSON Schema for the current gem version.

src/content/docs/ruby-gem/reference/selectors.mdx

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,9 @@ selectors:
7070
Behavior:
7171

7272
- `max_pages` is the total page budget for the item selector chain, including the initial page.
73+
- `max_pages` is capped by the system request ceiling of 10 pages per feed build.
7374
- Pagination follows strict `link[rel~="next"]` or `a[rel~="next"]` targets only.
75+
- Follow-up pages use the current page's effective origin after redirects.
7476
- Pagination stops when there is no next link, a page repeats, or the shared request budget is exhausted.
7577
- The same request safeguards apply to pagination and Browserless navigation, including timeout limits, redirect limits, response-size guards, and private-network denial.
7678

@@ -120,10 +122,10 @@ Post-processors manipulate the extracted value.
120122
- `html_to_markdown`: Converts HTML to Markdown.
121123
- `markdown_to_html`: Converts Markdown to HTML.
122124
- `parse_time`: Parses a string into a `Time` object.
123-
- `parse_uri`: Parses a string into a `URI` object.
125+
- `parse_uri`: Resolves a relative URL against `channel.url` and returns the normalized URL string.
124126
- `sanitize_html`: Sanitizes HTML to prevent security vulnerabilities.
125127
- `substring`: Extracts a substring from a string.
126-
- `template`: Creates a new string from a template and other selector values.
128+
- `template`: Creates a new string from a template and other selector values. Use `%{self}` for the current selector value.
127129

128130
> Always use the `sanitize_html` post-processor for any HTML content to prevent security risks.
129131

0 commit comments

Comments
 (0)