From 6043ee61d7f4c9dfbf38ae30ee9e5d2dee82b34a Mon Sep 17 00:00:00 2001 From: Gil Desmarais Date: Tue, 10 Mar 2026 21:50:09 +0100 Subject: [PATCH 1/3] docs(ruby-gem): align reference pages with current config surface --- .../docs/ruby-gem/reference/auto-source.mdx | 6 +- .../docs/ruby-gem/reference/channel.mdx | 20 ++++- .../docs/ruby-gem/reference/cli-reference.mdx | 82 ++++++++++++++----- .../docs/ruby-gem/reference/headers.mdx | 11 ++- .../docs/ruby-gem/reference/selectors.mdx | 26 ++++++ .../docs/ruby-gem/reference/strategy.mdx | 21 +++-- .../docs/ruby-gem/reference/stylesheets.mdx | 12 ++- 7 files changed, 148 insertions(+), 30 deletions(-) diff --git a/src/content/docs/ruby-gem/reference/auto-source.mdx b/src/content/docs/ruby-gem/reference/auto-source.mdx index ee0f2bd8..17146b03 100644 --- a/src/content/docs/ruby-gem/reference/auto-source.mdx +++ b/src/content/docs/ruby-gem/reference/auto-source.mdx @@ -33,9 +33,11 @@ You can customize `auto_source` to improve its accuracy. ### Scraper Options -Enable or disable specific scrapers and adjust their settings: +Enable or disable specific scrapers and adjust their settings in a complete feed config: ```yaml +channel: + url: https://example.com auto_source: scraper: schema: @@ -55,6 +57,8 @@ auto_source: Remove unwanted items from the results: ```yaml +channel: + url: https://example.com auto_source: cleanup: keep_different_domain: false # default: true diff --git a/src/content/docs/ruby-gem/reference/channel.mdx b/src/content/docs/ruby-gem/reference/channel.mdx index 73f2c06b..ccc12948 100644 --- a/src/content/docs/ruby-gem/reference/channel.mdx +++ b/src/content/docs/ruby-gem/reference/channel.mdx @@ -3,7 +3,9 @@ title: Channel description: "Learn about the channel configuration block for RSS feed metadata. Configure feed title, description, author, and other RSS channel properties." --- -The `channel` configuration block defines the metadata for your RSS feed. +The `channel` configuration block defines your feed metadata. + +This example is a complete feed config so you can see the `channel` block in context: ```yaml channel: @@ -12,8 +14,16 @@ channel: description: "A feed of the latest news from Example.com" author: "jane.doe@example.com (Jane Doe)" ttl: 60 - language: "en-us" + language: "en" time_zone: "Europe/Berlin" +selectors: + items: + selector: "article" + title: + selector: "h2" + url: + selector: "a" + extractor: "href" ``` ## Options @@ -28,6 +38,12 @@ channel: | `language` | Optional | The language of the feed. Defaults to the `lang` attribute of the `` tag. | | `time_zone` | Optional | The time zone for parsing dates. See the [list of tz database time zones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones). | +## Notes + +- `language` is runtime-validated. Use a valid language code such as `en`, not an arbitrary string. +- `author` should follow the RSS-style `email (Name)` format when you set it explicitly. +- `time_zone` must be a known TZ database identifier such as `UTC` or `Europe/Berlin`. + --- For detailed documentation on the Ruby API, see the [official YARD documentation](https://www.rubydoc.info/gems/html2rss). diff --git a/src/content/docs/ruby-gem/reference/cli-reference.mdx b/src/content/docs/ruby-gem/reference/cli-reference.mdx index b5d84ad3..fb74e3a9 100644 --- a/src/content/docs/ruby-gem/reference/cli-reference.mdx +++ b/src/content/docs/ruby-gem/reference/cli-reference.mdx @@ -3,50 +3,92 @@ title: CLI Reference description: Complete reference for the html2rss command-line interface --- -This section provides a reference for the `html2rss` command-line interface (CLI). +This page documents the `html2rss` command-line interface (CLI). For detailed documentation on the Ruby API, please refer to the official YARD documentation. [**📚 View the Ruby API Docs on rubydoc.info**](https://www.rubydoc.info/gems/html2rss) ---- +## Commands + +The `html2rss` executable is the primary way to interact with the gem from your terminal. + +### Auto -### Command-Line Interface (CLI) +Automatically discovers items from a page and prints the generated RSS feed to stdout. + +```bash +# Use the default faraday strategy +html2rss auto https://example.com/articles -The `html2rss` executable provides the primary way to interact with the tool from your terminal. +# Force browserless for JavaScript-heavy pages +html2rss auto https://example.com/app --strategy browserless -#### `html2rss auto ` +# Hint the item selector while keeping auto enhancement +html2rss auto https://example.com/articles --items_selector ".post-card" +``` -Automatically generates an RSS feed from the provided URL. +Command: `html2rss auto URL` -- `` (Required): The URL of the website to generate a feed from. +### Feed -**Example:** +Loads a YAML config, builds the feed, and prints the RSS XML to stdout. ```bash -html2rss auto https://unmatchedstyle.com/ +# Single-feed config +html2rss feed single.yml + +# Multi-feed config under the `feeds:` key +html2rss feed feeds.yml my-first-feed + +# Override the request strategy at runtime +html2rss feed single.yml --strategy browserless + +# Pass dynamic parameters into %s placeholders +html2rss feed single.yml --params id:42 foo:bar ``` -#### `html2rss feed ` +Command: `html2rss feed YAML_FILE [feed_name]` -Generates an RSS feed based on the provided YAML configuration file. +### Schema -- `` (Required): Path to your YAML configuration file. +Prints the exported JSON Schema for the current gem version. -**Examples:** +```bash +# Pretty-printed JSON (default) +html2rss schema + +# Compact JSON +html2rss schema --no-pretty + +# Write the schema to a file +html2rss schema --write tmp/html2rss-config.schema.json +``` + +Command: `html2rss schema` + +### Validate + +Validates a config with the runtime validator without generating a feed. ```bash -# Generate and print to console -html2rss feed my_feed.yml +# Validate a single-feed file +html2rss validate single.yml -# Generate and save to an XML file -html2rss feed my_feed.yml > my_feed.xml +# Validate one feed from a multi-feed file +html2rss validate feeds.yml my-first-feed ``` -#### `html2rss help` +Command: `html2rss validate YAML_FILE [feed_name]` + +### Help Displays the help message with available commands and options. -#### `html2rss --version` +Command: `html2rss help` + +### Version + +Displays the installed version of `html2rss`. -Displays the currently installed version of `html2rss`. +Command: `html2rss --version` diff --git a/src/content/docs/ruby-gem/reference/headers.mdx b/src/content/docs/ruby-gem/reference/headers.mdx index 58de18a2..c2ff34ed 100644 --- a/src/content/docs/ruby-gem/reference/headers.mdx +++ b/src/content/docs/ruby-gem/reference/headers.mdx @@ -7,13 +7,22 @@ The `headers` key allows you to set custom HTTP headers for your requests. This ## Configuration -You can add any number of headers to your configuration: +You can add any number of headers to your configuration. This example is a complete, valid feed config: ```yaml headers: User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)" Authorization: "Bearer YOUR_TOKEN" Accept: "application/json" +channel: + url: "https://api.example.com/posts" +selectors: + items: + selector: "array > object" + title: + selector: "title" + url: + selector: "url" ``` ## Dynamic Parameters diff --git a/src/content/docs/ruby-gem/reference/selectors.mdx b/src/content/docs/ruby-gem/reference/selectors.mdx index db68d043..2c50a8d5 100644 --- a/src/content/docs/ruby-gem/reference/selectors.mdx +++ b/src/content/docs/ruby-gem/reference/selectors.mdx @@ -48,6 +48,32 @@ Available options: - `"reverse"`: Reverses the order of items (useful when the website shows oldest items first) - Default: Items appear in the order they are found on the page +## Paginated Feeds + +`html2rss` can follow a single `rel="next"` pagination chain when you configure `selectors.items.pagination.max_pages`. + +```yml +channel: + url: "https://example.com/news" +selectors: + items: + selector: "article" + pagination: + max_pages: 3 + title: + selector: "h1" + url: + selector: "a" + extractor: "href" +``` + +Behavior: + +- `max_pages` is the total page budget for the item selector chain, including the initial page. +- Pagination follows strict `link[rel~="next"]` or `a[rel~="next"]` targets only. +- Pagination stops when there is no next link, a page repeats, or the shared request budget is exhausted. +- The same request safeguards apply to pagination and Browserless navigation, including timeout limits, redirect limits, response-size guards, and private-network denial. + ## RSS 2.0 Selectors While you can define any named selector, only the following are used in the final RSS feed: diff --git a/src/content/docs/ruby-gem/reference/strategy.mdx b/src/content/docs/ruby-gem/reference/strategy.mdx index e10cfd3a..01300178 100644 --- a/src/content/docs/ruby-gem/reference/strategy.mdx +++ b/src/content/docs/ruby-gem/reference/strategy.mdx @@ -27,10 +27,20 @@ docker run \ ### Configuration -Set the `strategy` to `browserless` in your feed configuration: +Set the `strategy` at the top level of your feed configuration: ```yml strategy: browserless +channel: + url: "https://example.com/app" +selectors: + items: + selector: ".article" + title: + selector: "h2" + url: + selector: "a" + extractor: "href" ``` ### Command-Line Usage @@ -39,11 +49,12 @@ You can also specify the strategy on the command line: ```sh # Set environment variables for your Browserless.io instance -BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" -BROWSERLESS_IO_API_TOKEN="6R0W53R135510" +BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" \ +BROWSERLESS_IO_API_TOKEN="6R0W53R135510" \ + html2rss feed my_config.yml --strategy browserless -# Use the browserless strategy -html2rss feed --strategy=browserless my_config.yml +# Or rely on the strategy stored in the YAML config +html2rss feed my_config.yml ``` --- diff --git a/src/content/docs/ruby-gem/reference/stylesheets.mdx b/src/content/docs/ruby-gem/reference/stylesheets.mdx index 6c62e91c..6110cf10 100644 --- a/src/content/docs/ruby-gem/reference/stylesheets.mdx +++ b/src/content/docs/ruby-gem/reference/stylesheets.mdx @@ -16,7 +16,7 @@ Styling your RSS feed provides several benefits: ## Configuration -You can add multiple stylesheets to your configuration: +You can add multiple stylesheets to a normal feed configuration: ```yaml stylesheets: @@ -26,6 +26,16 @@ stylesheets: - href: "https://example.com/rss.css" media: "all" type: "text/css" +channel: + url: "https://example.com/articles" +selectors: + items: + selector: "article" + title: + selector: "h2" + url: + selector: "a" + extractor: "href" ``` ## Further Reading From ae5a6f08378d494f3d3f2ab7ed0c42447d8b0e52 Mon Sep 17 00:00:00 2001 From: Gil Desmarais Date: Tue, 10 Mar 2026 21:51:05 +0100 Subject: [PATCH 2/3] docs(ruby-gem): make guides and tutorials use valid configs --- .../ruby-gem/how-to/advanced-features.mdx | 23 ++++- .../ruby-gem/how-to/custom-http-requests.mdx | 84 +++++++++++++++++-- .../ruby-gem/how-to/dynamic-parameters.mdx | 18 +++- .../ruby-gem/how-to/managing-feed-configs.mdx | 16 +++- .../docs/ruby-gem/how-to/scraping-json.mdx | 12 ++- .../ruby-gem/tutorials/your-first-feed.mdx | 24 +++--- 6 files changed, 148 insertions(+), 29 deletions(-) diff --git a/src/content/docs/ruby-gem/how-to/advanced-features.mdx b/src/content/docs/ruby-gem/how-to/advanced-features.mdx index 3be3a575..703bd9e9 100644 --- a/src/content/docs/ruby-gem/how-to/advanced-features.mdx +++ b/src/content/docs/ruby-gem/how-to/advanced-features.mdx @@ -35,12 +35,16 @@ html2rss is designed to be memory-efficient: For websites with many items: ```yaml -# Use specific selectors to limit items +channel: + url: "https://example.com/articles" selectors: items: selector: ".article:not(.advertisement)" # Exclude ads title: selector: "h2" # More specific than generic selectors + url: + selector: "a" + extractor: "href" ``` ## Error Recovery @@ -59,6 +63,16 @@ Optimize requests with appropriate headers: headers: Accept: "text/html,application/xhtml+xml" # Avoid JSON if not needed Accept-Encoding: "gzip, deflate" # Enable compression +channel: + url: "https://example.com/articles" +selectors: + items: + selector: "article" + title: + selector: "h2" + url: + selector: "a" + extractor: "href" ``` ## Monitoring and Debugging @@ -98,13 +112,20 @@ Invalid articles are automatically filtered out to prevent empty or broken feed You can add custom validation by using post-processors: ```yaml +channel: + url: "https://example.com/articles" selectors: + items: + selector: "article" title: selector: "h2" post_process: - name: "gsub" pattern: "^\\s*$" replacement: "Untitled" + url: + selector: "a" + extractor: "href" ``` ## Best Practices diff --git a/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx b/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx index 231f8860..33b6cca3 100644 --- a/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx +++ b/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx @@ -3,7 +3,7 @@ title: "Custom HTTP Requests" description: "Learn how to customize HTTP requests with custom headers, authentication, and API interactions for html2rss." --- -Some websites require custom HTTP headers, authentication, or specific request configurations to access their content. html2rss makes it easy to customize your requests to handle these scenarios. +Some websites require custom HTTP headers, authentication, or other request settings to access their content. `html2rss` lets you customize requests for those cases. ## When You Need Custom Headers @@ -17,7 +17,7 @@ You might need custom HTTP requests when: ## Basic Configuration -Add a `headers` section to your feed configuration: +Add a `headers` section to your feed configuration. This example is a complete, valid config: ```yaml headers: @@ -28,9 +28,11 @@ channel: url: https://api.example.com/posts selectors: items: - selector: ".post" + selector: "array > object" title: - selector: "h2" + selector: "title" + url: + selector: "url" ``` ## Common Use Cases @@ -43,6 +45,15 @@ Many APIs require authentication tokens: headers: Authorization: "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..." X-API-Key: "your-api-key-here" +channel: + url: "https://api.example.com/posts" +selectors: + items: + selector: "array > object" + title: + selector: "title" + url: + selector: "url" ``` ### User Agent Spoofing @@ -55,6 +66,16 @@ headers: Accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" Accept-Language: "en-US,en;q=0.5" Accept-Encoding: "gzip, deflate" +channel: + url: "https://example.com/articles" +selectors: + items: + selector: "article" + title: + selector: "h2" + url: + selector: "a" + extractor: "href" ``` ### Content Type Negotiation @@ -63,9 +84,16 @@ Request specific content types: ```yaml headers: - Accept: "application/json" # For JSON APIs - Accept: "text/html" # For HTML content - Accept: "application/rss+xml" # For RSS feeds + Accept: "application/json" +channel: + url: "https://api.example.com/posts" +selectors: + items: + selector: "array > object" + title: + selector: "title" + url: + selector: "url" ``` ### Custom API Headers @@ -77,6 +105,15 @@ headers: X-Requested-With: "XMLHttpRequest" X-Custom-Header: "your-value" Content-Type: "application/json" +channel: + url: "https://api.example.com/posts" +selectors: + items: + selector: "array > object" + title: + selector: "title" + url: + selector: "url" ``` ## Dynamic Headers @@ -85,12 +122,27 @@ You can use dynamic parameters in headers for runtime values: ```yaml headers: - Authorization: "Bearer {{api_token}}" - X-User-ID: "{{user_id}}" + Authorization: "Bearer %s" + X-User-ID: "%s" +channel: + url: "https://api.example.com/users/%s/posts" +selectors: + items: + selector: "array > object" + title: + selector: "title" + url: + selector: "url" ``` See our [Dynamic Parameters guide](/ruby-gem/how-to/dynamic-parameters) for more details. +## Notes + +- Header examples that target third-party APIs are illustrative. Authentication requirements, header names, and response shapes can change independently of `html2rss`. +- For JSON APIs, validate the response structure before assuming selectors like `array > object` or `html_url` will match. +- If you document or share a config for reuse, prefer placeholder values and parameterized headers over embedding real tokens. + ## Testing Your Headers Test your configuration to ensure headers work correctly: @@ -130,6 +182,13 @@ headers: User-Agent: "html2rss/1.0" channel: url: https://api.github.com/repos/owner/repo/issues +selectors: + items: + selector: "array > object" + title: + selector: "title" + url: + selector: "html_url" ``` ### Reddit API @@ -140,6 +199,13 @@ headers: Accept: "application/json" channel: url: https://www.reddit.com/r/programming.json +selectors: + items: + selector: "data > children > object > data" + title: + selector: "title" + url: + selector: "url" ``` ## Related Topics diff --git a/src/content/docs/ruby-gem/how-to/dynamic-parameters.mdx b/src/content/docs/ruby-gem/how-to/dynamic-parameters.mdx index a1a63d9a..ad74ac31 100644 --- a/src/content/docs/ruby-gem/how-to/dynamic-parameters.mdx +++ b/src/content/docs/ruby-gem/how-to/dynamic-parameters.mdx @@ -3,7 +3,7 @@ title: Dynamic Parameters description: "Learn how to use dynamic parameters in URLs and headers for creating reusable feed configurations. Pass runtime values to customize feeds." --- -For websites with similar structures but varying content based on a parameter in the URL or headers, you can use dynamic parameters. +Use dynamic parameters when websites share the same structure but vary by URL or header values. ## Solution @@ -11,9 +11,17 @@ You can add dynamic parameters to the `channel` and `headers` values. This is us ```yaml channel: - url: "http://domainname.tld/whatever/%s.html" + url: "https://domainname.tld/whatever/%s.html" headers: X-Something: "%s" +selectors: + items: + selector: "article" + title: + selector: "h2" + url: + selector: "a" + extractor: "href" ``` You can then pass the values for these parameters when you run `html2rss`: @@ -30,6 +38,12 @@ html2rss feed the_feed_config.yml --params id:42 foo:bar - You provide the actual values for these parameters at runtime using the `--params` option. - This allows you to reuse the same feed configuration for multiple similar pages or APIs. +## Notes + +- Dynamic substitution applies to `channel` and `headers`. Selector definitions are not parameterized by this feature. +- If a config references `%s` and you do not provide a value, feed generation fails unless the caller supplies a fallback. +- For shared config repositories such as `html2rss-configs`, it is common to store default parameter values alongside the config so examples, validation, and tests have concrete inputs. + ## Related Topics - **[Custom HTTP Requests](/ruby-gem/how-to/custom-http-requests/)** - Using dynamic parameters in headers diff --git a/src/content/docs/ruby-gem/how-to/managing-feed-configs.mdx b/src/content/docs/ruby-gem/how-to/managing-feed-configs.mdx index 61278ce5..89441f6a 100644 --- a/src/content/docs/ruby-gem/how-to/managing-feed-configs.mdx +++ b/src/content/docs/ruby-gem/how-to/managing-feed-configs.mdx @@ -21,12 +21,24 @@ feeds: channel: url: "https://example.com/blog" selectors: - # ... + items: + selector: ".post" + title: + selector: "h2" + url: + selector: "a" + extractor: "href" my-second-feed: channel: url: "https://example.com/news" selectors: - # ... + items: + selector: ".news-item" + title: + selector: "h2" + url: + selector: "a" + extractor: "href" ``` ## Building Feeds from a YAML File diff --git a/src/content/docs/ruby-gem/how-to/scraping-json.mdx b/src/content/docs/ruby-gem/how-to/scraping-json.mdx index 758b75f5..34a88f2a 100644 --- a/src/content/docs/ruby-gem/how-to/scraping-json.mdx +++ b/src/content/docs/ruby-gem/how-to/scraping-json.mdx @@ -68,10 +68,12 @@ Html2rss.feed( Accept: 'application/json' }, channel: { - url: 'http://domainname.tld/whatever.json' + url: 'https://domainname.tld/whatever.json' }, selectors: { - title: { selector: 'foo' } + items: { selector: 'array > object' }, + title: { selector: 'title' }, + url: { selector: 'url' } } ) ``` @@ -82,10 +84,12 @@ Html2rss.feed( headers: Accept: application/json channel: - url: "http://domainname.tld/whatever.json" + url: "https://domainname.tld/whatever.json" selectors: items: selector: "array > object" title: - selector: "foo" + selector: "title" + url: + selector: "url" ``` diff --git a/src/content/docs/ruby-gem/tutorials/your-first-feed.mdx b/src/content/docs/ruby-gem/tutorials/your-first-feed.mdx index 55474896..da94010f 100644 --- a/src/content/docs/ruby-gem/tutorials/your-first-feed.mdx +++ b/src/content/docs/ruby-gem/tutorials/your-first-feed.mdx @@ -27,35 +27,37 @@ html2rss auto https://unmatchedstyle.com/ When you need to extract content with precision, the `selectors` scraper is the tool for the job. This method gives you complete control over what content is included in your feed by using CSS selectors. -Let's create a feed for Stack Overflow's "Hot Network Questions". +Let's create a feed for a simple article listing page. -1. **Create a file** named `stackoverflow.yml`. +1. **Create a file** named `example.yml`. 2. **Add the following content:** ```yaml channel: - url: https://stackoverflow.com/questions + url: https://example.com/articles selectors: items: - selector: "#hot-network-questions > ul > li" + selector: ".article-card" title: - selector: "a" + selector: "h2 a" url: - selector: "a" + selector: "h2 a" extractor: "href" + description: + selector: ".summary" ``` 3. **Run the `feed` command:** ```bash - html2rss feed stackoverflow.yml + html2rss feed example.yml ``` This configuration tells `html2rss`: -- The main container for all items is `