You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When auto-sourcing isn't enough, you can write your own configuration files to create custom RSS feeds for any website. This guide shows you how to take full control with YAML configs.
11
12
@@ -160,6 +161,22 @@ html2rss supports many configuration options:
160
161
161
162
4. **Check the output:** Make sure all items have titles, links, and descriptions
162
163
164
+
### Useful CLI flags when a site is difficult
165
+
166
+
Some sites need a little more request budget than the defaults.
167
+
168
+
- Use `--max-redirects` when the site bounces through several canonicalization or tracking redirects before the real page loads.
169
+
- Use `--max-requests` when your config needs more than one request, for example pagination or other follow-up fetches.
Copy file name to clipboardExpand all lines: src/content/docs/ruby-gem/how-to/advanced-features.mdx
+2-8Lines changed: 2 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,13 +7,7 @@ This guide covers advanced features and performance optimizations for html2rss.
7
7
8
8
## Parallel Processing
9
9
10
-
html2rss uses parallel processing to improve performance when scraping multiple items. This happens automatically and doesn't require any configuration.
11
-
12
-
### How It Works
13
-
14
-
-**Auto-source scraping:** Multiple scrapers run in parallel to analyze the page
15
-
-**Item processing:** Each scraped item is processed in parallel
16
-
-**Performance benefit:** Significantly faster when dealing with many items
10
+
html2rss uses parallel processing in auto-source discovery. This happens automatically and doesn't require any configuration.
Copy file name to clipboardExpand all lines: src/content/docs/ruby-gem/how-to/custom-http-requests.mdx
+41-5Lines changed: 41 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,15 @@ title: "Custom HTTP Requests"
3
3
description: "Learn how to customize HTTP requests with custom headers, authentication, and API interactions for html2rss."
4
4
---
5
5
6
-
Some websites require custom HTTP headers, authentication, or other request settings to access their content. `html2rss` lets you customize requests for those cases.
6
+
importCodefrom"astro/components/Code.astro";
7
+
8
+
Some sites only work when requests carry the headers, tokens, or cookies your browser uses. `html2rss` supports those cases without changing the rest of your feed workflow.
9
+
10
+
Keep this structure in mind:
11
+
12
+
-`headers` stays top-level
13
+
-`strategy` stays top-level
14
+
- request-specific controls such as budgets and Browserless options live under `request`
7
15
8
16
## When You Need Custom Headers
9
17
@@ -19,8 +27,8 @@ You might need custom HTTP requests when:
19
27
20
28
Add a `headers` section to your feed configuration. This example is a complete, valid config:
Copy file name to clipboardExpand all lines: src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx
+73Lines changed: 73 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,12 +3,38 @@ title: Handling Dynamic Content
3
3
description: "Learn how to handle JavaScript-heavy websites and dynamic content with html2rss. Use browserless strategy for sites that load content dynamically."
4
4
---
5
5
6
+
importCodefrom"astro/components/Code.astro";
7
+
6
8
Some websites load their content dynamically using JavaScript. The default `html2rss` strategy might not see this content.
7
9
8
10
## Solution
9
11
10
12
Use the [`browserless` strategy](/ruby-gem/reference/strategy) to render JavaScript-heavy websites with a headless browser.
11
13
14
+
Keep the strategy at the top level and put request-specific options under `request`:
15
+
16
+
<Code
17
+
code={`strategy: browserless
18
+
request:
19
+
max_redirects: 5
20
+
max_requests: 6
21
+
browserless:
22
+
preload:
23
+
wait_for_network_idle:
24
+
timeout_ms: 5000
25
+
channel:
26
+
url: https://example.com/app
27
+
selectors:
28
+
items:
29
+
selector: .article
30
+
title:
31
+
selector: h2
32
+
url:
33
+
selector: a
34
+
extractor: href`}
35
+
lang="yaml"
36
+
/>
37
+
12
38
## When to Use Browserless
13
39
14
40
The `browserless` strategy is necessary when:
@@ -18,6 +44,53 @@ The `browserless` strategy is necessary when:
18
44
-**Infinite scroll** - Content loads as you scroll
19
45
-**Dynamic forms** - Content changes based on user interaction
20
46
47
+
## Preload Actions
48
+
49
+
For dynamic sites, rendering once is often not enough. Use `request.browserless.preload` to wait, click, or scroll before the
50
+
HTML snapshot is taken.
51
+
52
+
### Wait for JavaScript Requests
53
+
54
+
```yaml
55
+
strategy: browserless
56
+
request:
57
+
browserless:
58
+
preload:
59
+
wait_for_network_idle:
60
+
timeout_ms: 4000
61
+
```
62
+
63
+
### Click "Load More" Buttons
64
+
65
+
```yaml
66
+
strategy: browserless
67
+
request:
68
+
browserless:
69
+
preload:
70
+
click_selectors:
71
+
- selector: ".load-more"
72
+
max_clicks: 3
73
+
delay_ms: 250
74
+
wait_for_network_idle:
75
+
timeout_ms: 3000
76
+
```
77
+
78
+
### Scroll Infinite Lists
79
+
80
+
```yaml
81
+
strategy: browserless
82
+
request:
83
+
browserless:
84
+
preload:
85
+
scroll_down:
86
+
iterations: 5
87
+
delay_ms: 200
88
+
wait_for_network_idle:
89
+
timeout_ms: 2500
90
+
```
91
+
92
+
These preload steps can be combined in a single config when a site needs several interactions before all items appear.
93
+
21
94
## Performance Considerations
22
95
23
96
The `browserless` strategy is slower than the default `faraday` strategy because it:
2. **`semantic_html`:** Searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`.
22
-
3. **`html`:** Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
23
-
4. **json_state:** Single-page applications often stash pre-rendered article data in `<script type="application/json">` tags or global variables
20
+
1. **`wordpress_api`:** Detects the `<link rel="https://api.w.org/">` tag used by WordPress and pulls posts from the REST API without parsing article HTML. See [WordPress API](/ruby-gem/reference/wordpress-api/).
3. **`semantic_html`:** Searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`.
23
+
4. **`html`:** Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
24
+
5. **json_state:** Single-page applications often stash pre-rendered article data in `<script type="application/json">` tags or global variables
24
25
such as `window.__NEXT_DATA__`, `window.__NUXT__`, or `window.STATE`. The JSON-state scraper walks those blobs, finds arrays with
25
26
`title`/`url` pairs, and converts them into the same hashes produced by `HtmlExtractor`.
26
27
27
28
**`json_state` Limitations:** the scraper requires discoverable arrays of hashes containing clear `title` and `url` fields. Minified or
28
29
obfuscated state objects, heavily encoded values, or blobs that require executing embedded functions are ignored.
29
30
31
+
**`wordpress_api` Limitations:** this scraper depends on the page exposing a public WordPress REST API root. The current implementation fetches post records directly, but it does not yet resolve category names or featured media metadata.
32
+
30
33
## Fine-Tuning
31
34
32
35
You can customize `auto_source` to improve its accuracy.
Copy file name to clipboardExpand all lines: src/content/docs/ruby-gem/reference/selectors.mdx
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -70,7 +70,9 @@ selectors:
70
70
Behavior:
71
71
72
72
- `max_pages`is the total page budget for the item selector chain, including the initial page.
73
+
- `max_pages`is capped by the system request ceiling of 10 pages per feed build.
73
74
- Pagination follows strict `link[rel~="next"]` or `a[rel~="next"]` targets only.
75
+
- Follow-up pages use the current page's effective origin after redirects.
74
76
- Pagination stops when there is no next link, a page repeats, or the shared request budget is exhausted.
75
77
- The same request safeguards apply to pagination and Browserless navigation, including timeout limits, redirect limits, response-size guards, and private-network denial.
76
78
@@ -120,10 +122,10 @@ Post-processors manipulate the extracted value.
120
122
- `html_to_markdown`: Converts HTML to Markdown.
121
123
- `markdown_to_html`: Converts Markdown to HTML.
122
124
- `parse_time`: Parses a string into a `Time` object.
123
-
- `parse_uri`: Parses a string into a `URI` object.
125
+
- `parse_uri`: Resolves a relative URL against `channel.url` and returns the normalized URL string.
124
126
- `sanitize_html`: Sanitizes HTML to prevent security vulnerabilities.
125
127
- `substring`: Extracts a substring from a string.
126
-
- `template`: Creates a new string from a template and other selector values.
128
+
- `template`: Creates a new string from a template and other selector values. Use `%{self}` for the current selector value.
127
129
128
130
> Always use the `sanitize_html` post-processor for any HTML content to prevent security risks.
0 commit comments