Skip to content

Latest commit

 

History

History
132 lines (97 loc) · 3.4 KB

File metadata and controls

132 lines (97 loc) · 3.4 KB
title Advanced Features
description Advanced features and performance optimizations for html2rss.

This guide covers advanced features and performance optimizations for html2rss.

Parallel Processing

html2rss uses parallel processing in auto-source discovery. This happens automatically and doesn't require any configuration.

Performance Tips

  1. Use appropriate selectors: More specific selectors reduce processing time
  2. Limit items when possible: Use CSS selectors that target only the content you need
  3. Cache responses: The web application caches responses automatically
  4. Choose the right strategy: Use faraday for static content, browserless only when JavaScript is required

Memory Optimization

html2rss is designed to be memory-efficient:

  • Frozen objects: Parsed content is frozen to prevent accidental modifications
  • Efficient data structures: Uses Set instead of Array for lookups
  • Minimal allocations: Prefers bang methods to avoid unnecessary memory allocations

Large Feed Handling

For websites with many items:

channel:
  url: "https://example.com/articles"
selectors:
  items:
    selector: ".article:not(.advertisement)" # Exclude ads
  title:
    selector: "h2" # More specific than generic selectors
  url:
    selector: "a"
    extractor: "href"

Error Recovery

html2rss includes built-in error handling:

  • Graceful degradation: If one scraper fails, others continue
  • Detailed logging: Set LOG_LEVEL=debug for detailed information
  • Validation: Configuration is validated before processing

Custom Headers for Performance

Optimize requests with appropriate headers:

headers:
  Accept: "text/html,application/xhtml+xml" # Avoid JSON if not needed
  Accept-Encoding: "gzip, deflate" # Enable compression
channel:
  url: "https://example.com/articles"
selectors:
  items:
    selector: "article"
  title:
    selector: "h2"
  url:
    selector: "a"
    extractor: "href"

Monitoring and Debugging

Enable Debug Logging

LOG_LEVEL=debug html2rss feed config.yml

Web Application Health Checks

Use the health check endpoint to monitor feed generation:

curl -u username:password http://localhost:4000/health_check.txt

Article Validation

html2rss includes built-in validation for articles to ensure feed quality:

Validation Rules

Articles are considered valid if they have:

  • A non-empty URL
  • Either a title OR description (or both)
  • A unique ID

Invalid Articles

Invalid articles are automatically filtered out to prevent empty or broken feed items.

Custom Validation

You can add custom validation by using post-processors:

channel:
  url: "https://example.com/articles"
selectors:
  items:
    selector: "article"
  title:
    selector: "h2"
    post_process:
      - name: "gsub"
        pattern: "^\\s*$"
        replacement: "Untitled"
  url:
    selector: "a"
    extractor: "href"

Best Practices

  1. Test configurations: Always test your configurations before deploying
  2. Monitor performance: Use health checks to detect issues early
  3. Keep selectors simple: Complex selectors are harder to maintain
  4. Use auto-source when possible: It's often more reliable than manual selectors
  5. Handle errors gracefully: Implement proper error handling in your applications
  6. Validate your data: Ensure your selectors return valid content