Skip to content

Commit 4c5d59d

Browse files
committed
docs: align heavy-usage auto quality guidance and web retry semantics
1 parent 07af6fa commit 4c5d59d

4 files changed

Lines changed: 172 additions & 2 deletions

File tree

src/content/docs/ruby-gem/reference/cli-reference.mdx

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,60 @@ html2rss auto https://example.com/articles --items_selector ".post-card"
3333

3434
Command: `html2rss auto URL`
3535

36+
#### URL Surface Guidance For `auto`
37+
38+
`auto` works best when the input URL already exposes a server-rendered list of entries.
39+
40+
- High-success surfaces:
41+
- newsroom or press listing pages
42+
- blog/category/tag listing pages
43+
- changelog/release notes/update listing pages
44+
- paginated archive/list views
45+
- Low-success surfaces:
46+
- generic homepages with heavy promo/navigation chrome
47+
- search results pages
48+
- client-rendered app shells (`#app`, `#root`, `#__next`, etc.)
49+
50+
When possible, pass a direct listing/update URL instead of a top-level homepage or app entrypoint.
51+
52+
#### Failure Outcomes You Should Expect
53+
54+
When no extractable items are found, `auto` now classifies likely causes instead of only returning a generic message:
55+
56+
- `blocked surface likely (anti-bot or interstitial)`:
57+
- retry with `--strategy browserless`
58+
- try a more specific public listing URL
59+
- `app-shell surface detected`:
60+
- retry with `--strategy browserless`
61+
- switch to a direct listing/update URL
62+
- `unsupported extraction surface for auto mode`:
63+
- switch to listing/changelog/category URLs
64+
- use explicit selectors in a feed config
65+
66+
Known anti-bot interstitial responses (for example Cloudflare challenge pages) are surfaced explicitly as blocked-surface errors.
67+
68+
#### Browserless Setup And Diagnostics (CLI)
69+
70+
`browserless` is opt-in for CLI usage.
71+
72+
```bash
73+
# Start a local Browserless container (default local token)
74+
docker run --rm -p 3000:3000 -e "CONCURRENT=10" -e "TOKEN=6R0W53R135510" ghcr.io/browserless/chromium
75+
76+
# Run auto with Browserless
77+
BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" \
78+
BROWSERLESS_IO_API_TOKEN="6R0W53R135510" \
79+
html2rss auto https://example.com/updates --strategy browserless
80+
```
81+
82+
If you see `Browserless connection failed`, check:
83+
84+
- `BROWSERLESS_IO_WEBSOCKET_URL` points to a reachable Browserless endpoint
85+
- `BROWSERLESS_IO_API_TOKEN` matches the Browserless `TOKEN`
86+
- the Browserless service is running and reachable from your shell environment
87+
88+
For custom Browserless endpoints, `BROWSERLESS_IO_API_TOKEN` is required.
89+
3690
### Feed
3791

3892
Loads a YAML config, builds the feed, and prints the RSS XML to stdout.

src/content/docs/ruby-gem/reference/strategy.mdx

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ The `strategy` key defines how `html2rss` fetches a website's content.
1010

1111
`strategy` is a top-level config key. Request-specific controls live under `request`.
1212

13+
Use `faraday` first for direct newsroom/listing/changelog pages. Prefer `browserless` when the target is client-rendered, protected by anti-bot checks, or otherwise requires JavaScript to expose article links.
14+
1315
## `browserless`
1416

1517
To use the `browserless` strategy, you need a running instance of [Browserless.io](https://www.browserless.io/).
@@ -126,6 +128,18 @@ html2rss feed my_config.yml --max-redirects 5 --max-requests 6
126128
html2rss feed my_config.yml
127129
```
128130

131+
### Browserless Troubleshooting
132+
133+
If Browserless cannot connect, html2rss surfaces a `Browserless connection failed (...)` error with endpoint/token hints.
134+
135+
Check these first:
136+
137+
- `BROWSERLESS_IO_WEBSOCKET_URL` is reachable from where html2rss runs
138+
- `BROWSERLESS_IO_API_TOKEN` matches your Browserless `TOKEN`
139+
- your Browserless service is running and accepting connections
140+
141+
For custom Browserless websocket endpoints, `BROWSERLESS_IO_API_TOKEN` is mandatory. The local default endpoint (`ws://127.0.0.1:3000`) can use the default local token `6R0W53R135510`.
142+
129143
---
130144

131145
For detailed documentation on the Ruby API, see the [official YARD documentation](https://www.rubydoc.info/gems/html2rss).

src/content/docs/troubleshooting/troubleshooting.mdx

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,21 @@ Your browser's developer tools are essential for troubleshooting. Use them to in
1515

1616
## Common Issues (Ruby Gem / CLI)
1717

18+
### `auto` Picks The Wrong Surface Or Finds No Items
19+
20+
The `auto` flow is URL-surface sensitive.
21+
22+
- Higher success inputs:
23+
- newsroom/press listing URLs
24+
- category/tag/listing/archive URLs
25+
- changelog/release/update listing URLs
26+
- Lower success inputs:
27+
- generic homepages
28+
- search result pages
29+
- client-rendered app-shell entrypoints
30+
31+
If extraction quality is poor, switch to a more specific listing/update URL before tuning selectors.
32+
1833
### Empty Feeds
1934

2035
If your feed is empty, check the following:
@@ -25,6 +40,46 @@ If your feed is empty, check the following:
2540
- **JavaScript Content:** If the content is loaded via JavaScript, use the `browserless` strategy instead of `faraday`.
2641
- **Authentication:** Some sites require authentication — check if you need to add headers or use a different strategy.
2742

43+
### `No scrapers found` Failure Taxonomy (`auto`)
44+
45+
`auto` classifies no-scraper failures with actionable hints:
46+
47+
- **Blocked surface likely (anti-bot or interstitial):**
48+
- retry with `--strategy browserless`
49+
- try a more specific public listing URL
50+
- **App-shell surface detected:**
51+
- retry with `--strategy browserless`
52+
- target a direct listing/update page instead of homepage/shell entrypoint
53+
- **Unsupported extraction surface for auto mode:**
54+
- switch to listing/changelog/category URLs
55+
- or use explicit selectors in YAML config
56+
57+
Known anti-bot interstitial patterns (for example Cloudflare challenge pages) are surfaced as blocked-surface errors instead of silent empty extraction results.
58+
59+
### Browserless Connection / Setup Failures
60+
61+
If you receive `Browserless connection failed (...)`:
62+
63+
1. Confirm Browserless is running and reachable from the machine running `html2rss`.
64+
2. Confirm `BROWSERLESS_IO_WEBSOCKET_URL` points at that running service.
65+
3. Confirm `BROWSERLESS_IO_API_TOKEN` matches the Browserless `TOKEN`.
66+
67+
Example local startup:
68+
69+
```bash
70+
docker run --rm -p 3000:3000 -e "CONCURRENT=10" -e "TOKEN=6R0W53R135510" ghcr.io/browserless/chromium
71+
```
72+
73+
Then run with:
74+
75+
```bash
76+
BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" \
77+
BROWSERLESS_IO_API_TOKEN="6R0W53R135510" \
78+
html2rss auto https://example.com/updates --strategy browserless
79+
```
80+
81+
For custom websocket endpoints, `BROWSERLESS_IO_API_TOKEN` is required.
82+
2883
### Configuration Errors
2984

3085
Common configuration-related errors:

src/content/docs/web-application/how-to/use-automatic-feed-generation.mdx

Lines changed: 49 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,8 +57,55 @@ That is enough to confirm the self-hosted flow is working.
5757
## Strategy Behavior
5858

5959
- `faraday` is the default strategy and should be your first try for most pages.
60-
- The web UI automatically retries once with `browserless` after a `faraday` failure when the error looks retryable.
61-
- If `browserless` also fails, or if the first error is clearly about auth, URL validation, or unsupported strategy, the UI stops and shows the failure instead of looping.
60+
- During the feed-creation API request (`POST /api/v1/feeds`) from the web UI, a `faraday` submission may be retried once with `browserless` when the first failure looks retryable.
61+
- If that fallback attempt fails, or if the first failure is clearly auth/URL/unsupported-strategy related, the UI stops and shows an error.
62+
- This retry behavior is scoped to feed creation. It is not a general retry layer for later feed rendering (`GET /api/v1/feeds/:token`) or preview loading.
63+
64+
## Input URL Guidance (Quality First)
65+
66+
Automatic generation is most successful when the input URL is already a listing/update surface.
67+
68+
- Higher-success inputs:
69+
- newsroom/press listing pages
70+
- category/tag/archive/listing pages
71+
- changelog/release/update pages
72+
- Lower-success inputs:
73+
- generic homepages
74+
- search pages
75+
- app-shell entrypoints (client-rendered shells)
76+
77+
If output quality is poor, switch the input to a direct listing/update URL before assuming the feature is broken.
78+
79+
## Failure Meanings You May See
80+
81+
The backend runtime classifies common extraction failures with clearer intent:
82+
83+
- blocked/interstitial surface likely
84+
- app-shell surface likely
85+
- unsupported extraction surface for auto mode
86+
87+
In the current web product flow, these categories are mostly internal/operator-level signals (runtime/logging). They are not guaranteed to appear as labeled categories in the UI.
88+
89+
What users typically see today:
90+
91+
- feed-creation API errors (for example auth/URL/unsupported strategy)
92+
- preview-level fallback text such as `Preview unavailable right now.`
93+
- feed render error payloads when opening feed URLs directly
94+
95+
## Browserless Troubleshooting In `html2rss-web`
96+
97+
If Browserless-backed attempts fail:
98+
99+
- verify the Browserless container/service is running
100+
- verify `BROWSERLESS_IO_WEBSOCKET_URL` is reachable from the web container
101+
- verify `BROWSERLESS_IO_API_TOKEN` matches the Browserless `TOKEN`
102+
103+
For local Compose-based setups, check container health/logs with:
104+
105+
```bash
106+
docker compose ps browserless
107+
docker compose logs browserless
108+
```
62109

63110
## When to Stop and Switch
64111

0 commit comments

Comments
 (0)