|
| 1 | +--- |
| 2 | +title: "WordPress API" |
| 3 | +description: "Use html2rss auto_source to read WordPress sites through their REST API instead of scraping article HTML." |
| 4 | +--- |
| 5 | + |
| 6 | +The `wordpress_api` scraper is part of `auto_source`. It detects WordPress sites that advertise a REST API in the page `<head>` and then fetches structured post data directly from that API. |
| 7 | + |
| 8 | +This is usually more reliable than HTML scraping because the response already contains fields such as title, content, excerpt, permalink, publish date, and category IDs. |
| 9 | + |
| 10 | +## Detection |
| 11 | + |
| 12 | +The scraper activates when the page contains: |
| 13 | + |
| 14 | +```html |
| 15 | +<link rel="https://api.w.org/" href="https://example.com/wp-json/" /> |
| 16 | +``` |
| 17 | + |
| 18 | +When that tag is present, `html2rss` resolves the API root and requests: |
| 19 | + |
| 20 | +```text |
| 21 | +wp/v2/posts?per_page=100&_fields=id,title,excerpt,content,link,date,categories |
| 22 | +``` |
| 23 | + |
| 24 | +## Basic Usage |
| 25 | + |
| 26 | +Enable `auto_source` as usual: |
| 27 | + |
| 28 | +```yml |
| 29 | +channel: |
| 30 | + url: "https://example.com/blog" |
| 31 | +auto_source: {} |
| 32 | +``` |
| 33 | +
|
| 34 | +If the target is a standard WordPress site with a public API, no selector configuration is required. |
| 35 | +
|
| 36 | +## Configure The Scraper |
| 37 | +
|
| 38 | +You can disable the WordPress scraper while keeping the rest of `auto_source` enabled: |
| 39 | + |
| 40 | +```yml |
| 41 | +channel: |
| 42 | + url: "https://example.com/blog" |
| 43 | +auto_source: |
| 44 | + scraper: |
| 45 | + wordpress_api: |
| 46 | + enabled: false |
| 47 | +``` |
| 48 | + |
| 49 | +This is useful if a site exposes the API link but you prefer another auto-source strategy. |
| 50 | + |
| 51 | +## What Gets Extracted |
| 52 | + |
| 53 | +The current scraper maps the WordPress post payload into `html2rss` article fields like this: |
| 54 | + |
| 55 | +| WordPress field | html2rss article field | |
| 56 | +| ------------------ | ---------------------- | |
| 57 | +| `id` | `id` | |
| 58 | +| `title.rendered` | `title` | |
| 59 | +| `content.rendered` | `description` | |
| 60 | +| `link` | `url` | |
| 61 | +| `date` | `published_at` | |
| 62 | +| `categories` | `categories` | |
| 63 | + |
| 64 | +If `content.rendered` is blank, the scraper falls back to `excerpt.rendered`. |
| 65 | + |
| 66 | +## Behavior Notes |
| 67 | + |
| 68 | +- The scraper uses the shared request session, so it participates in the same request safety model as the rest of the feed build. |
| 69 | +- It resolves relative API links against `channel.url`. |
| 70 | +- It currently stores WordPress category IDs as strings because category-name resolution is not implemented yet. |
| 71 | +- It currently does not resolve `featured_media` into an image URL. |
| 72 | + |
| 73 | +## When To Use It |
| 74 | + |
| 75 | +Prefer `wordpress_api` when: |
| 76 | + |
| 77 | +- The page is clearly powered by WordPress |
| 78 | +- The REST API is public |
| 79 | +- You want more stable extraction than CSS selectors or heuristic HTML scraping |
| 80 | + |
| 81 | +Prefer manual selectors when: |
| 82 | + |
| 83 | +- The site blocks or customizes the API heavily |
| 84 | +- You need fields that are not exposed by the post endpoint |
| 85 | +- You want complete control over item filtering or presentation |
| 86 | + |
| 87 | +## Related Docs |
| 88 | + |
| 89 | +- [Auto Source](/ruby-gem/reference/auto-source/) |
| 90 | +- [Selectors](/ruby-gem/reference/selectors/) |
| 91 | +- [Scraping JSON Responses](/ruby-gem/how-to/scraping-json/) |
0 commit comments