Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion src/content/docs/ruby-gem/how-to/advanced-features.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,16 @@ html2rss is designed to be memory-efficient:
For websites with many items:

```yaml
# Use specific selectors to limit items
channel:
url: "https://example.com/articles"
selectors:
items:
selector: ".article:not(.advertisement)" # Exclude ads
title:
selector: "h2" # More specific than generic selectors
url:
selector: "a"
extractor: "href"
```

## Error Recovery
Expand All @@ -59,6 +63,16 @@ Optimize requests with appropriate headers:
headers:
Accept: "text/html,application/xhtml+xml" # Avoid JSON if not needed
Accept-Encoding: "gzip, deflate" # Enable compression
channel:
url: "https://example.com/articles"
selectors:
items:
selector: "article"
title:
selector: "h2"
url:
selector: "a"
extractor: "href"
```

## Monitoring and Debugging
Expand Down Expand Up @@ -98,13 +112,20 @@ Invalid articles are automatically filtered out to prevent empty or broken feed
You can add custom validation by using post-processors:

```yaml
channel:
url: "https://example.com/articles"
selectors:
items:
selector: "article"
title:
selector: "h2"
post_process:
- name: "gsub"
pattern: "^\\s*$"
replacement: "Untitled"
url:
selector: "a"
extractor: "href"
```

## Best Practices
Expand Down
84 changes: 75 additions & 9 deletions src/content/docs/ruby-gem/how-to/custom-http-requests.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: "Custom HTTP Requests"
description: "Learn how to customize HTTP requests with custom headers, authentication, and API interactions for html2rss."
---

Some websites require custom HTTP headers, authentication, or specific request configurations to access their content. html2rss makes it easy to customize your requests to handle these scenarios.
Some websites require custom HTTP headers, authentication, or other request settings to access their content. `html2rss` lets you customize requests for those cases.

## When You Need Custom Headers

Expand All @@ -17,7 +17,7 @@ You might need custom HTTP requests when:

## Basic Configuration

Add a `headers` section to your feed configuration:
Add a `headers` section to your feed configuration. This example is a complete, valid config:

```yaml
headers:
Expand All @@ -28,9 +28,11 @@ channel:
url: https://api.example.com/posts
selectors:
items:
selector: ".post"
selector: "array > object"
title:
selector: "h2"
selector: "title"
url:
selector: "url"
```

## Common Use Cases
Expand All @@ -43,6 +45,15 @@ Many APIs require authentication tokens:
headers:
Authorization: "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
X-API-Key: "your-api-key-here"
channel:
url: "https://api.example.com/posts"
selectors:
items:
selector: "array > object"
title:
selector: "title"
url:
selector: "url"
```

### User Agent Spoofing
Expand All @@ -55,6 +66,16 @@ headers:
Accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
Accept-Language: "en-US,en;q=0.5"
Accept-Encoding: "gzip, deflate"
channel:
url: "https://example.com/articles"
selectors:
items:
selector: "article"
title:
selector: "h2"
url:
selector: "a"
extractor: "href"
```

### Content Type Negotiation
Expand All @@ -63,9 +84,16 @@ Request specific content types:

```yaml
headers:
Accept: "application/json" # For JSON APIs
Accept: "text/html" # For HTML content
Accept: "application/rss+xml" # For RSS feeds
Accept: "application/json"
channel:
url: "https://api.example.com/posts"
selectors:
items:
selector: "array > object"
title:
selector: "title"
url:
selector: "url"
```

### Custom API Headers
Expand All @@ -77,6 +105,15 @@ headers:
X-Requested-With: "XMLHttpRequest"
X-Custom-Header: "your-value"
Content-Type: "application/json"
channel:
url: "https://api.example.com/posts"
selectors:
items:
selector: "array > object"
title:
selector: "title"
url:
selector: "url"
```

## Dynamic Headers
Expand All @@ -85,12 +122,27 @@ You can use dynamic parameters in headers for runtime values:

```yaml
headers:
Authorization: "Bearer {{api_token}}"
X-User-ID: "{{user_id}}"
Authorization: "Bearer %<api_token>s"
X-User-ID: "%<user_id>s"
channel:
url: "https://api.example.com/users/%<user_id>s/posts"
selectors:
items:
selector: "array > object"
title:
selector: "title"
url:
selector: "url"
```

See our [Dynamic Parameters guide](/ruby-gem/how-to/dynamic-parameters) for more details.

## Notes

- Header examples that target third-party APIs are illustrative. Authentication requirements, header names, and response shapes can change independently of `html2rss`.
- For JSON APIs, validate the response structure before assuming selectors like `array > object` or `html_url` will match.
- If you document or share a config for reuse, prefer placeholder values and parameterized headers over embedding real tokens.

## Testing Your Headers

Test your configuration to ensure headers work correctly:
Expand Down Expand Up @@ -130,6 +182,13 @@ headers:
User-Agent: "html2rss/1.0"
channel:
url: https://api.github.com/repos/owner/repo/issues
selectors:
items:
selector: "array > object"
title:
selector: "title"
url:
selector: "html_url"
```

### Reddit API
Expand All @@ -140,6 +199,13 @@ headers:
Accept: "application/json"
channel:
url: https://www.reddit.com/r/programming.json
selectors:
items:
selector: "data > children > object > data"
title:
selector: "title"
url:
selector: "url"
```

## Related Topics
Expand Down
18 changes: 16 additions & 2 deletions src/content/docs/ruby-gem/how-to/dynamic-parameters.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,25 @@ title: Dynamic Parameters
description: "Learn how to use dynamic parameters in URLs and headers for creating reusable feed configurations. Pass runtime values to customize feeds."
---

For websites with similar structures but varying content based on a parameter in the URL or headers, you can use dynamic parameters.
Use dynamic parameters when websites share the same structure but vary by URL or header values.

## Solution

You can add dynamic parameters to the `channel` and `headers` values. This is useful for creating feeds from structurally similar pages with different URLs.

```yaml
channel:
url: "http://domainname.tld/whatever/%<id>s.html"
url: "https://domainname.tld/whatever/%<id>s.html"
headers:
X-Something: "%<foo>s"
selectors:
items:
selector: "article"
title:
selector: "h2"
url:
selector: "a"
extractor: "href"
```

You can then pass the values for these parameters when you run `html2rss`:
Expand All @@ -30,6 +38,12 @@ html2rss feed the_feed_config.yml --params id:42 foo:bar
- You provide the actual values for these parameters at runtime using the `--params` option.
- This allows you to reuse the same feed configuration for multiple similar pages or APIs.

## Notes

- Dynamic substitution applies to `channel` and `headers`. Selector definitions are not parameterized by this feature.
- If a config references `%<param>s` and you do not provide a value, feed generation fails unless the caller supplies a fallback.
- For shared config repositories such as `html2rss-configs`, it is common to store default parameter values alongside the config so examples, validation, and tests have concrete inputs.

## Related Topics

- **[Custom HTTP Requests](/ruby-gem/how-to/custom-http-requests/)** - Using dynamic parameters in headers
Expand Down
16 changes: 14 additions & 2 deletions src/content/docs/ruby-gem/how-to/managing-feed-configs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,24 @@ feeds:
channel:
url: "https://example.com/blog"
selectors:
# ...
items:
selector: ".post"
title:
selector: "h2"
url:
selector: "a"
extractor: "href"
my-second-feed:
channel:
url: "https://example.com/news"
selectors:
# ...
items:
selector: ".news-item"
title:
selector: "h2"
url:
selector: "a"
extractor: "href"
```

## Building Feeds from a YAML File
Expand Down
12 changes: 8 additions & 4 deletions src/content/docs/ruby-gem/how-to/scraping-json.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -68,10 +68,12 @@ Html2rss.feed(
Accept: 'application/json'
},
channel: {
url: 'http://domainname.tld/whatever.json'
url: 'https://domainname.tld/whatever.json'
},
selectors: {
title: { selector: 'foo' }
items: { selector: 'array > object' },
title: { selector: 'title' },
url: { selector: 'url' }
}
)
```
Expand All @@ -82,10 +84,12 @@ Html2rss.feed(
headers:
Accept: application/json
channel:
url: "http://domainname.tld/whatever.json"
url: "https://domainname.tld/whatever.json"
selectors:
items:
selector: "array > object"
title:
selector: "foo"
selector: ".title"
url:
selector: "url"
```
4 changes: 4 additions & 0 deletions src/content/docs/ruby-gem/reference/auto-source.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ You can customize `auto_source` to improve its accuracy.
Enable or disable specific scrapers and adjust their settings:

```yaml
channel:
url: https://example.com
auto_source:
scraper:
schema:
Expand All @@ -55,6 +57,8 @@ auto_source:
Remove unwanted items from the results:

```yaml
channel:
url: https://example.com
auto_source:
cleanup:
keep_different_domain: false # default: true
Expand Down
20 changes: 18 additions & 2 deletions src/content/docs/ruby-gem/reference/channel.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@ title: Channel
description: "Learn about the channel configuration block for RSS feed metadata. Configure feed title, description, author, and other RSS channel properties."
---

The `channel` configuration block defines the metadata for your RSS feed.
The `channel` configuration block defines your feed metadata.

This example is a complete feed config so you can see the `channel` block in context:

```yaml
channel:
Expand All @@ -12,8 +14,16 @@ channel:
description: "A feed of the latest news from Example.com"
author: "jane.doe@example.com (Jane Doe)"
ttl: 60
language: "en-us"
language: "en"
time_zone: "Europe/Berlin"
selectors:
items:
selector: "article"
title:
selector: "h2"
url:
selector: "a"
extractor: "href"
```

## Options
Expand All @@ -28,6 +38,12 @@ channel:
| `language` | Optional | The language of the feed. Defaults to the `lang` attribute of the `<html>` tag. |
| `time_zone` | Optional | The time zone for parsing dates. See the [list of tz database time zones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones). |

## Notes

- `language` is runtime-validated. Use a valid language code such as `en`, not an arbitrary string.
- `author` should follow the RSS-style `email (Name)` format when you set it explicitly.
- `time_zone` must be a known TZ database identifier such as `UTC` or `Europe/Berlin`.

---

For detailed documentation on the Ruby API, see the [official YARD documentation](https://www.rubydoc.info/gems/html2rss).
Loading
Loading