Skip to content

Latest commit

 

History

History
145 lines (109 loc) · 3.82 KB

File metadata and controls

145 lines (109 loc) · 3.82 KB
title Advanced Features
description Advanced features and performance optimizations for html2rss.

import { Code } from "@astrojs/starlight/components";

This guide covers advanced features and performance optimizations for html2rss.

Parallel Processing

html2rss uses parallel processing in auto-source discovery. This happens automatically and doesn't require any configuration.

Performance Tips

  1. Use appropriate selectors: More specific selectors reduce processing time
  2. Limit items when possible: Use CSS selectors that target only the content you need
  3. Cache responses: The web application caches responses automatically
  4. Choose the right strategy: Use static HTTP fetching for simple pages, and move to a JavaScript/browser-based extraction strategy when rendering or anti-bot handling is required

Memory Optimization

html2rss is designed to be memory-efficient:

  • Frozen objects: Parsed content is frozen to prevent accidental modifications
  • Efficient data structures: Uses Set instead of Array for lookups
  • Minimal allocations: Prefers bang methods to avoid unnecessary memory allocations

Large Feed Handling

For websites with many items:

<Code code={channel: url: "https://example.com/articles" selectors: items: selector: ".article:not(.advertisement)" # Exclude ads title: selector: "h2" # More specific than generic selectors url: selector: "a" extractor: "href"} lang="yaml" />

Error Recovery

html2rss includes built-in error handling:

  • Graceful degradation: If one scraper fails, others continue
  • Detailed logging: Set LOG_LEVEL=debug for detailed information
  • Validation: Configuration is validated before processing

Custom Headers for Performance

Optimize requests with appropriate headers:

<Code code={headers: Accept: "text/html,application/xhtml+xml" # Avoid JSON if not needed Accept-Encoding: "gzip, deflate" # Enable compression channel: url: "https://example.com/articles" selectors: items: selector: "article" title: selector: "h2" url: selector: "a" extractor: "href"} lang="yaml" />

Monitoring and Debugging

Enable Debug Logging

<Code code={LOG_LEVEL=debug html2rss feed config.yml} lang="bash" />

Web Application Health Checks

Use the authenticated health endpoint to monitor the web application, or use liveness/readiness endpoints when you do not use an auth token:

<Code code={curl -H "Authorization: Bearer YOUR_HEALTH_CHECK_TOKEN" \ http://localhost:4000/api/v1/health} lang="bash" />

Article Validation

html2rss includes built-in validation for articles to ensure feed quality:

Validation Rules

Articles are considered valid if they have:

  • A non-empty URL
  • Either a title OR description (or both)
  • A unique ID

Invalid Articles

Invalid articles are automatically filtered out to prevent empty or broken feed items.

Custom Validation

You can add custom validation by using post-processors:

<Code code={channel: url: "https://example.com/articles" selectors: items: selector: "article" title: selector: "h2" post_process: - name: "gsub" pattern: "^\\s*$" replacement: "Untitled" url: selector: "a" extractor: "href"} lang="yaml" />

Best Practices

  1. Test configurations: Always test your configurations before deploying
  2. Monitor performance: Use health checks to detect issues early
  3. Keep selectors simple: Complex selectors are harder to maintain
  4. Use auto-source when possible: It's often more reliable than manual selectors
  5. Handle errors gracefully: Implement proper error handling in your applications
  6. Validate your data: Ensure your selectors return valid content