html2rss.github.io/src/content/docs/ruby-gem/how-to/advanced-features.mdx at ae5a6f08378d494f3d3f2ab7ed0c42447d8b0e52 · html2rss/html2rss.github.io

title	Advanced Features
description	Advanced features and performance optimizations for html2rss.

This guide covers advanced features and performance optimizations for html2rss.

Parallel Processing

html2rss uses parallel processing to improve performance when scraping multiple items. This happens automatically and doesn't require any configuration.

How It Works

Auto-source scraping: Multiple scrapers run in parallel to analyze the page
Item processing: Each scraped item is processed in parallel
Performance benefit: Significantly faster when dealing with many items

Performance Tips

Use appropriate selectors: More specific selectors reduce processing time
Limit items when possible: Use CSS selectors that target only the content you need
Cache responses: The web application caches responses automatically
Choose the right strategy: Use faraday for static content, browserless only when JavaScript is required

Memory Optimization

html2rss is designed to be memory-efficient:

Frozen objects: Parsed content is frozen to prevent accidental modifications
Efficient data structures: Uses Set instead of Array for lookups
Minimal allocations: Prefers bang methods to avoid unnecessary memory allocations

Large Feed Handling

For websites with many items:

channel:
  url: "https://example.com/articles"
selectors:
  items:
    selector: ".article:not(.advertisement)" # Exclude ads
  title:
    selector: "h2" # More specific than generic selectors
  url:
    selector: "a"
    extractor: "href"

Error Recovery

html2rss includes built-in error handling:

Graceful degradation: If one scraper fails, others continue
Detailed logging: Set LOG_LEVEL=debug for detailed information
Validation: Configuration is validated before processing

Custom Headers for Performance

Optimize requests with appropriate headers:

headers:
  Accept: "text/html,application/xhtml+xml" # Avoid JSON if not needed
  Accept-Encoding: "gzip, deflate" # Enable compression
channel:
  url: "https://example.com/articles"
selectors:
  items:
    selector: "article"
  title:
    selector: "h2"
  url:
    selector: "a"
    extractor: "href"

Monitoring and Debugging

Enable Debug Logging

LOG_LEVEL=debug html2rss feed config.yml

Web Application Health Checks

Use the health check endpoint to monitor feed generation:

curl -u username:password http://localhost:3000/health_check.txt

Article Validation

html2rss includes built-in validation for articles to ensure feed quality:

Validation Rules

Articles are considered valid if they have:

A non-empty URL
Either a title OR description (or both)
A unique ID

Invalid Articles

Invalid articles are automatically filtered out to prevent empty or broken feed items.

Custom Validation

You can add custom validation by using post-processors:

channel:
  url: "https://example.com/articles"
selectors:
  items:
    selector: "article"
  title:
    selector: "h2"
    post_process:
      - name: "gsub"
        pattern: "^\\s*$"
        replacement: "Untitled"
  url:
    selector: "a"
    extractor: "href"

Best Practices

Test configurations: Always test your configurations before deploying
Monitor performance: Use health checks to detect issues early
Keep selectors simple: Complex selectors are harder to maintain
Use auto-source when possible: It's often more reliable than manual selectors
Handle errors gracefully: Implement proper error handling in your applications
Validate your data: Ensure your selectors return valid content

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Processing

How It Works

Performance Tips

Memory Optimization

Large Feed Handling

Error Recovery

Custom Headers for Performance

Monitoring and Debugging

Enable Debug Logging

Web Application Health Checks

Article Validation

Validation Rules

Invalid Articles

Custom Validation

Best Practices

FilesExpand file tree

advanced-features.mdx

Latest commit

History

advanced-features.mdx

File metadata and controls

Parallel Processing

How It Works

Performance Tips

Memory Optimization

Large Feed Handling

Error Recovery

Custom Headers for Performance

Monitoring and Debugging

Enable Debug Logging

Web Application Health Checks

Article Validation

Validation Rules

Invalid Articles

Custom Validation

Best Practices