Skip to content

Latest commit

 

History

History
172 lines (116 loc) · 7.49 KB

File metadata and controls

172 lines (116 loc) · 7.49 KB
title Troubleshooting
description This guide provides solutions to common issues encountered when using html2rss.

import { Code } from "@astrojs/starlight/components";

This guide provides solutions to common issues encountered when using html2rss.

Essential Tools

Your browser's developer tools are essential for troubleshooting. Use them to inspect the HTML structure of a webpage and find the correct CSS selectors.

  • To open: Right-click an element on a webpage and select "Inspect" or "Inspect Element."

Common Issues (Ruby Gem / CLI)

auto Picks The Wrong Surface Or Finds No Items

The auto flow is URL-surface sensitive.

  • Higher success inputs:
    • newsroom/press listing URLs
    • category/tag/listing/archive URLs
    • changelog/release/update listing URLs
  • Lower success inputs:
    • generic homepages
    • search result pages
    • client-rendered app-shell entrypoints

If extraction quality is poor, switch to a more specific listing/update URL before tuning selectors.

Empty Feeds

If your feed is empty, check the following:

  • URL: Ensure the url in your configuration is correct and accessible.
  • items.selector: Verify that the items.selector matches the elements on the page.
  • Website Changes: Websites change their HTML structure frequently. Your selectors may be outdated.
  • JavaScript Content: If the content is loaded via JavaScript, use a browser-based rendering strategy.
  • Authentication: Some sites require authentication — check if you need to add headers or use a different strategy.

No scrapers found Failure Taxonomy (auto)

auto classifies no-scraper failures with actionable hints:

  • Blocked surface likely (anti-bot or interstitial):
    • try a more specific public listing URL
  • App-shell surface detected:
    • target a direct listing/update page instead of homepage/shell entrypoint
  • Unsupported extraction surface for auto mode:
    • switch to listing/changelog/category URLs
    • or use explicit selectors in YAML config

Known anti-bot interstitial patterns (for example Cloudflare challenge pages) are surfaced as blocked-surface errors instead of silent empty extraction results.

When all auto fallback tiers complete but still extract zero items, html2rss raises No RSS feed items extracted after auto fallback ....

If failures continue after URL/surface fixes, retry with an explicit browser-based override (--strategy browserless), or --strategy botasaurus when BOTASAURUS_SCRAPER_URL is configured.

Browserless Connection / Setup Failures

If you receive Browserless connection failed (...):

  1. Confirm Browserless is running and reachable from the machine running html2rss.
  2. Confirm BROWSERLESS_IO_WEBSOCKET_URL points at that running service.
  3. Confirm BROWSERLESS_IO_API_TOKEN matches the Browserless TOKEN.

Example local startup:

<Code code={docker run --rm -p 3000:3000 -e "CONCURRENT=10" -e "TOKEN=6R0W53R135510" ghcr.io/browserless/chromium} lang="bash" />

Then run with:

<Code code={BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" \ BROWSERLESS_IO_API_TOKEN="6R0W53R135510" \ html2rss auto https://example.com/updates --strategy browserless} lang="bash" />

For custom websocket endpoints, BROWSERLESS_IO_API_TOKEN is required.

Configuration Errors

Common configuration-related errors:

  • UnsupportedResponseContentType: The website returned content that html2rss can't parse (not HTML or JSON).
  • UnsupportedStrategy: The specified strategy is not available. Use auto, faraday, browserless, or botasaurus.
  • BOTASAURUS_SCRAPER_URL is required for strategy=botasaurus.: Set BOTASAURUS_SCRAPER_URL to your Botasaurus scrape API base URL when using --strategy botasaurus.
  • BOTASAURUS_SCRAPER_URL is invalid: Fix the URL format and retry.
  • Configuration must include at least 'selectors' or 'auto_source': You need to specify either manual selectors or enable auto-source.
  • stylesheet.type invalid: Only text/css and text/xsl are supported for stylesheets.

Missing Item Parts

If parts of your items (e.g., title, link) are missing, check the following:

  • Selector: Ensure the selector for the missing part is correct and relative to the items.selector.
  • Extractor: Verify that you are using the correct extractor (e.g., text, href, attribute).
  • Dynamic Content: faraday does not render JavaScript. If content loads dynamically, run with --strategy browserless (with Browserless available) or --strategy botasaurus (with BOTASAURUS_SCRAPER_URL configured) so the page can be rendered before extraction.

Date/Time Parsing Errors

If you are having issues with date/time parsing, check the following:

  • Date Format: The parse_time post-processor automatically detects common date formats using Ruby's Time.parse. Ensure your date strings are in a recognizable format.
  • time_zone: Specify the correct time_zone if the website uses a specific time zone.

html2rss Command Not Found

If you are getting a "command not found" error, try the following:

  • Re-install: Re-install html2rss to ensure it is installed correctly: gem install html2rss.
  • Check PATH: Ensure that the directory where Ruby gems are installed is in your system's PATH.

Web Application Issues (html2rss-web)

Instance Won’t Start

  • Verify Docker is installed and running: <Code code={docker --version} lang="bash" />
  • Check logs for errors: <Code code={docker compose logs} lang="bash" />
  • Ensure the app port (default compose binding: 4000) isn’t already in use: <Code code={lsof -i :4000} lang="bash" />
  • If the app exits immediately in production, check that HTML2RSS_SECRET_KEY is set.

Can’t Access the Web Interface

  • Confirm your firewall allows traffic on port 4000 or your reverse-proxy ports
  • Try accessing via the server’s IP instead of a domain name
  • Double-check that containers are running: <Code code={docker compose ps} lang="bash" />

Authentication Errors

  • 401 Unauthorized when creating feeds: The create-feed API expects a bearer token. Re-enter a valid access token in the UI or send Authorization: Bearer ... to POST /api/v1/feeds.
  • 403 Forbidden when creating feeds: Automatic feed generation may be disabled (AUTO_SOURCE_ENABLED=false) or the requested URL may not be allowed for the authenticated account.
  • 500 Internal Server Error: Check the application logs for detailed error information.
  • Health endpoint failures: Use GET /api/v1/health/live, GET /api/v1/health/ready, or authenticated GET /api/v1/health depending on which probe you are testing.

Feed Problems

  • Some sites may require JavaScript rendering; ensure the browserless service is running
  • Check the feed configuration in feeds.yml for typos or invalid selectors
  • Look for parsing errors in the logs: <Code code={docker compose logs html2rss-web} lang="bash" />

Tips & Tricks

  • Mobile Redirects: Check that the channel URL does not redirect to a mobile page with a different markup structure.
  • curl and pup: For static sites, use curl and pup to quickly find selectors: curl URL | pup.
  • CSS Selectors: For a comprehensive overview of CSS selectors, see the W3C documentation.

Still Stuck?