Skip to content

Latest commit

 

History

History
186 lines (129 loc) · 5.45 KB

File metadata and controls

186 lines (129 loc) · 5.45 KB
title CLI Reference
description Complete reference for the html2rss command-line interface

import { Code } from "@astrojs/starlight/components";

This page documents the html2rss command-line interface (CLI).

For detailed documentation on the Ruby API, please refer to the official YARD documentation.

📚 View the Ruby API Docs on rubydoc.info

Commands

The html2rss executable is the primary way to interact with the gem from your terminal.

Auto

Automatically discovers items from a page and prints the generated RSS feed to stdout.

<Code code={html2rss auto https://example.com/articles ; \ html2rss auto https://example.com/app --strategy browserless --max-redirects 5 --max-requests 6 ; \ BOTASAURUS_SCRAPER_URL="http://localhost:4010" html2rss auto https://example.com/protected --strategy botasaurus ; \ html2rss auto https://example.com/articles --items_selector ".post-card"} lang="bash" />

Command: html2rss auto URL

Default behavior is --strategy auto, which tries faraday then botasaurus then browserless.

URL Surface Guidance For auto

auto works best when the input URL already exposes a server-rendered list of entries.

  • High-success surfaces:
    • newsroom or press listing pages
    • blog/category/tag listing pages
    • changelog/release notes/update listing pages
    • paginated archive/list views
  • Low-success surfaces:
    • generic homepages with heavy promo/navigation chrome
    • search results pages
    • client-rendered app shells (#app, #root, #__next, etc.)

When possible, pass a direct listing/update URL instead of a top-level homepage or app entrypoint.

Failure Outcomes You Should Expect

When no extractable items are found, auto classifies likely causes instead of only returning a generic message:

  • blocked surface likely (anti-bot or interstitial):
    • try a more specific public listing URL
  • app-shell surface detected:
    • switch to a direct listing/update URL
  • unsupported extraction surface for auto mode:
    • switch to listing/changelog/category URLs
    • use explicit selectors in a feed config

Known anti-bot interstitial responses (for example Cloudflare challenge pages) are surfaced explicitly as blocked-surface errors.

If all fallback tiers run but still extract zero items, html2rss raises:

  • No RSS feed items extracted after auto fallback ...

If failures continue after URL/surface fixes, retry with an explicit browser-based override (--strategy browserless), or --strategy botasaurus when BOTASAURUS_SCRAPER_URL is configured.

Start by changing the input URL to a direct listing/update page, then move to explicit selectors if needed.

Browserless Setup And Diagnostics (CLI)

browserless is an explicit override for CLI usage.

<Code code={`

1) Start Browserless in the background

docker run -d --rm --name html2rss-browserless
-p 3000:3000
-e "CONCURRENT=10"
-e "TOKEN=6R0W53R135510"
ghcr.io/browserless/chromium

2) Run html2rss against Browserless

BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000"
BROWSERLESS_IO_API_TOKEN="6R0W53R135510"
html2rss auto https://example.com/updates --strategy browserless

3) Stop Browserless when done

docker stop html2rss-browserless `} lang="bash" />

If you see Browserless connection failed, check:

  • BROWSERLESS_IO_WEBSOCKET_URL points to a reachable Browserless endpoint
  • BROWSERLESS_IO_API_TOKEN matches the Browserless TOKEN
  • the Browserless service is running and reachable from your shell environment

For custom Browserless endpoints, BROWSERLESS_IO_API_TOKEN is required.

Botasaurus Environment Requirement (CLI)

botasaurus is an explicit override for CLI usage and requires BOTASAURUS_SCRAPER_URL:

<Code code={BOTASAURUS_SCRAPER_URL="http://localhost:4010" \ html2rss auto https://example.com/updates --strategy botasaurus} lang="bash" />

If you see a Botasaurus configuration error, check:

  • BOTASAURUS_SCRAPER_URL is set
  • BOTASAURUS_SCRAPER_URL is a valid URL
  • the Botasaurus scrape API is reachable from the shell environment running html2rss

Feed

Loads a YAML config, builds the feed, and prints the RSS XML to stdout.

<Code code={html2rss feed single.yml ; \ html2rss feed feeds.yml my-first-feed ; \ html2rss feed single.yml --strategy auto ; \ html2rss feed single.yml --strategy browserless ; \ BOTASAURUS_SCRAPER_URL="http://localhost:4010" html2rss feed single.yml --strategy botasaurus ; \ html2rss feed single.yml --max-redirects 5 --max-requests 6 ; \ html2rss feed single.yml --params id:42 foo:bar} lang="bash" />

Command: html2rss feed YAML_FILE [feed_name]

The CLI keeps strategy as a top-level override and writes runtime request limits into the generated config under request.

Schema

Prints the exported JSON Schema for the current gem version.

<Code code={html2rss schema ; \ html2rss schema --no-pretty ; \ html2rss schema --write tmp/html2rss-config.schema.json} lang="bash" />

Command: html2rss schema

Validate

Validates a config with the runtime validator without generating a feed.

<Code code={html2rss validate single.yml ; \ html2rss validate feeds.yml my-first-feed} lang="bash" />

Command: html2rss validate YAML_FILE [feed_name]

Help

Displays the help message with available commands and options.

Command: html2rss help

Version

Displays the installed version of html2rss.

Command: html2rss --version