| title | Advanced Features |
|---|---|
| description | Advanced features and performance optimizations for html2rss. |
import { Code } from "@astrojs/starlight/components";
This guide covers advanced features and performance optimizations for html2rss.
html2rss uses parallel processing in auto-source discovery. This happens automatically and doesn't require any configuration.
- Use appropriate selectors: More specific selectors reduce processing time
- Limit items when possible: Use CSS selectors that target only the content you need
- Cache responses: The web application caches responses automatically
- Choose the right strategy: Use static HTTP fetching for simple pages, and move to a JavaScript/browser-based extraction strategy when rendering or anti-bot handling is required
html2rss is designed to be memory-efficient:
- Frozen objects: Parsed content is frozen to prevent accidental modifications
- Efficient data structures: Uses
Setinstead ofArrayfor lookups - Minimal allocations: Prefers bang methods to avoid unnecessary memory allocations
For websites with many items:
<Code
code={channel: url: "https://example.com/articles" selectors: items: selector: ".article:not(.advertisement)" # Exclude ads title: selector: "h2" # More specific than generic selectors url: selector: "a" extractor: "href"}
lang="yaml"
/>
html2rss includes built-in error handling:
- Graceful degradation: If one scraper fails, others continue
- Detailed logging: Set
LOG_LEVEL=debugfor detailed information - Validation: Configuration is validated before processing
Optimize requests with appropriate headers:
<Code
code={headers: Accept: "text/html,application/xhtml+xml" # Avoid JSON if not needed Accept-Encoding: "gzip, deflate" # Enable compression channel: url: "https://example.com/articles" selectors: items: selector: "article" title: selector: "h2" url: selector: "a" extractor: "href"}
lang="yaml"
/>
<Code code={LOG_LEVEL=debug html2rss feed config.yml} lang="bash" />
Use the authenticated health endpoint to monitor the web application, or use liveness/readiness endpoints when you do not use an auth token:
<Code
code={curl -H "Authorization: Bearer YOUR_HEALTH_CHECK_TOKEN" \ http://localhost:4000/api/v1/health}
lang="bash"
/>
html2rss includes built-in validation for articles to ensure feed quality:
Articles are considered valid if they have:
- A non-empty URL
- Either a title OR description (or both)
- A unique ID
Invalid articles are automatically filtered out to prevent empty or broken feed items.
You can add custom validation by using post-processors:
<Code
code={channel: url: "https://example.com/articles" selectors: items: selector: "article" title: selector: "h2" post_process: - name: "gsub" pattern: "^\\s*$" replacement: "Untitled" url: selector: "a" extractor: "href"}
lang="yaml"
/>
- Test configurations: Always test your configurations before deploying
- Monitor performance: Use health checks to detect issues early
- Keep selectors simple: Complex selectors are harder to maintain
- Use auto-source when possible: It's often more reliable than manual selectors
- Handle errors gracefully: Implement proper error handling in your applications
- Validate your data: Ensure your selectors return valid content