docs: add llms.txt generator script and update root llms.txt by bloxster · Pull Request #21000 · erigontech/erigon

bloxster · 2026-05-05T13:29:37Z

Summary

Adds docs/site/scripts/generate-llms.py — a pure Python (stdlib only, zero npm deps) script that generates LLM-friendly content exports from the Docusaurus source files directly
Generates docs/site/static/llms.txt (page index, 71 pages) and docs/site/static/llms-full.txt (full clean markdown, ~351 KB), served at docs.erigon.tech/llms.txt and docs.erigon.tech/llms-full.txt
Updates the repo-root llms.txt, which was pointing to the deleted docs/gitbook/ folder — now mirrors the Docusaurus-generated index with live docs.erigon.tech URLs
Adds a CI guard in .github/workflows/docs-deploy.yml that runs generate-llms.py --check and the unit tests before the npm build, blocking drift between any of the four committed files (root + static/)
Adds a unit test suite (docs/site/scripts/test_generate_llms.py, 25 tests) covering placeholder preservation, fence transparency, JSX stripping, multi-line expr blocks, frontmatter parsing, and landing-page synthesis

Why a custom script instead of `docusaurus-plugin-llms-txt` (replaces #20993)

PR #20993 used the docusaurus-plugin-llms-txt@0.1.3 npm package. After review, we decided against it:

Wrong approach: the plugin works on compiled HTML output and converts it back to markdown — a lossy round-trip. Our source is already markdown.
Supply chain risk: the package has no declared source repo, is maintained by a personal Gmail address, and has not been updated in 16 months.
Unnecessary dependency: a Python stdlib script does the same job with no external dependencies, no build-time coupling, and cleaner output.

The custom script reads .md/.mdx files directly, strips MDX-specific syntax (imports, JSX components, HTML tags, expressions), extracts frontmatter titles and descriptions, and maps file paths to their deployed docs.erigon.tech URLs. Both Docusaurus plugin instances (main docs and help-center) are supported. Card-grid landing pages (e.g. docs/index.mdx) are detected via the lp-card JSX pattern and synthesized into structured "## Sections" + bullet lists rather than collapsing into a soup of title/desc fragments.

How to update

Re-run the script whenever doc content changes:

python3 docs/site/scripts/generate-llms.py

To verify on-disk files match what the script would generate (used by CI):

python3 docs/site/scripts/generate-llms.py --check

The CI guard in docs-deploy.yml runs --check and the unittest suite on every push touching docs/site/**, so a forgotten regeneration after a docs edit will fail the build before deploy.

Updates after review (commit `05a81fcd`)

Addressing yperbasis CHANGES_REQUESTED + Copilot follow-ups:

Blockers

✅ Preserve {ERIGON_VERSION} and other ALL_CAPS identifier placeholders in prose and table cells. The brace-strip regex now skips pure-uppercase identifier braces, mirroring the existing <IP>/<PID> angle-tag guard. Verified against the install-instructions table cell (erigon_{ERIGON_VERSION}_amd64.deb) and the version selector prose ((e.g., v{ERIGON_VERSION})) the reviewer flagged.
✅ Test-plan H1 assertion replaced — the prior ^# count incorrectly counted shell comments inside bash fences (e.g. # Reduce disk latency impact). Now uses ^URL: (one synthetic URL line per page = 71).
✅ Drift guard via CI rather than prebuild (catches drift in all 4 files, no Python coupling in the npm build path).

Non-blocking review items

✅ Singleton "## Erigon Docs" header dropped — the Introduction bullet sits directly under the preamble now.
✅ Landing-page MDX synthesis (no more title/desc soup for docs/index.mdx, staking/index.mdx, help-center/index.mdx, etc.).
✅ parse_frontmatter hardened: skip indented YAML continuations, _safe_int wrapper for sidebar_position.
✅ Nested _category_.json honored via ancestor_positions() for sort tie-breaking.
✅ --check flag for CI.
✅ first_description tightened — only skip lines that look like JSX leaks (^<tag, ^{, arrow-fn) instead of skipping any line that mentions those tokens mid-sentence.
✅ # Requires: Python 3.8+ documented at the top.

Test plan

Deployment

llms.txt renders correctly at docs.erigon.tech/llms.txt after deploy
llms-full.txt renders at docs.erigon.tech/llms-full.txt
Root llms.txt no longer references deleted docs/gitbook/ paths
Re-running the script produces identical output (--check returns OK)

Output quality — run after regenerating

Page index (llms.txt)

Every section header (## Get Started, ## Fundamentals, etc.) appears exactly once
No singleton section header (the Introduction bullet should sit directly under the preamble, no ## Erigon Docs line above it)
Index pages (e.g. get-started/index.mdx) appear before their siblings within each section
No entry has a blank or missing title
No entry description contains raw JSX (<Component, {props., import )

Full export (llms-full.txt)

No page has back-to-back duplicate H1 headings (synthetic title + body's own H1)
Fenced code blocks are intact — content between fences is unchanged, including shell export VAR=… lines
Inline code placeholders survive — {ERIGON_VERSION}, <YOUR_ADDRESS> style tokens are preserved both inside backtick spans and in bare prose / table cells
No truncated shell commands — curl, docker run, erigon invocations with {…} args are complete
Nested list indentation is preserved — sublists appear indented, not flush-left
No raw HTML/JSX tags leak into prose (<Link, <Tabs, <div, <section)
No raw MDX imports/exports leak (import Link from, export const)
Landing pages (docs/index.mdx, help-center/index.mdx, etc.) emit a ## Sections heading + bullet list, not unstructured title/desc fragments

Sanity checks (quick greps)

# Page count — synthetic URL line per page (should equal 71)
grep -c '^URL: ' docs/site/static/llms-full.txt

# Real JSX component leaks — uppercase-then-lowercase tag pattern (should be 0)
grep -cE '<[A-Z][a-z][a-zA-Z]+' docs/site/static/llms-full.txt

# MDX imports/exports leaked outside fences (should be 0)
grep -cE '^(import|export const|export function|export default)' docs/site/static/llms-full.txt

# Identifier placeholders preserved — should be > 0 if source uses any
grep -c '{ERIGON_VERSION}' docs/site/static/llms-full.txt

# Shell `export VAR=` lines preserved inside ```bash fences — should be > 0
grep -c '^export ' docs/site/static/llms-full.txt

Current values (regenerated, commit 05a81fcd): URL 71, JSX leaks 0, MDX imports/exports 0, {ERIGON_VERSION} 15, ^export 9.

Tests

python3 -m unittest discover docs/site/scripts -v
# Ran 25 tests in 0.001s — OK

🤖 Generated with Claude Code

Adds docs/site/scripts/generate-llms.py — a pure Python stdlib script (no npm deps) that reads all .md/.mdx source files from both Docusaurus plugin instances (docs/ and help-center/) and generates: docs/site/static/llms.txt — page index with titles/descriptions docs/site/static/llms-full.txt — full clean markdown for long-context LLMs The script is run once and its outputs are committed; re-run whenever docs content changes (or hook it into the pre-build step). Also updates the repo-root llms.txt, which was pointing to the now-deleted docs/gitbook/ folder. It now mirrors the Docusaurus-generated index with live docs.erigon.tech URLs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

LLMs struggle with files over ~2 MB. Print a warning at build time if the generated llms-full.txt crosses the 1.5 MB threshold so the operator knows to prune content before it becomes a problem. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ript The script now writes to two locations in one pass: - docs/site/static/ — served at docs.erigon.tech/llms{,-full}.txt - repo root — for LLMs/tools that read the GitHub repo directly Both pairs are identical; no more manual sync needed. Also adds llms-full.txt at the repo root (was missing) as the expected companion to llms.txt per the llms.txt standard. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…full.txt After removing JSX/HTML tags, text content inside those tags was left indented (e.g. card titles and descriptions on the landing page). Strip leading whitespace from all non-code lines so the output reads as clean prose. Regenerate both static and root copies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a stdlib-only Python generator to export the Docusaurus docs/help-center MD/MDX sources into LLM-friendly llms.txt (index) and llms-full.txt (concatenated markdown), and updates the repo-root llms.txt to point at the live docs.erigon.tech URLs (instead of the removed docs/gitbook/ paths).

Changes:

Add docs/site/scripts/generate-llms.py to generate/overwrite llms.txt and llms-full.txt in both docs/site/static/ and repo root.
Add generated docs/site/static/llms.txt (+ llms-full.txt) for publication by Docusaurus at the site root.
Update repo-root llms.txt to mirror the generated index with live URLs.

Reviewed changes

Copilot reviewed 3 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`docs/site/scripts/generate-llms.py`	New generator script: walks docs/help-center sources, strips MDX/JSX, builds sorted index + full export, and writes to static + repo root.
`docs/site/static/llms.txt`	Generated LLM routing index intended to be served at `docs.erigon.tech/llms.txt`.
`docs/site/static/llms-full.txt`	Generated concatenated markdown intended to be served at `docs.erigon.tech/llms-full.txt`.
`llms.txt`	Root index updated from deleted GitBook paths to Docusaurus/live `docs.erigon.tech` URLs.
`llms-full.txt`	Root full export copy written for repo consumers (mirrors site output).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Four fixes: 1. Preserve indentation inside fenced code blocks during MDX stripping — track fence state and skip lstrip() while inside a ``` fence so YAML/ Python examples with meaningful whitespace aren't corrupted. 2. Fix section sort order — replace the broken is_index heuristic with a depth-first sort key: (section_pos, depth_from_base, is_not_index, position). Section overview pages (index.mdx at depth 1) now reliably lead their section; subsections follow at depth 2+; leaf pages last. 3. Remove unused `import os`. 4. Remove stray space before `:` in loop header. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 5 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bloxster · 2026-05-06T06:08:08Z

@copilot apply changes based on the comments in this thread

Agent-Logs-Url: https://github.com/erigontech/erigon/sessions/29b3c04f-e946-4009-88bd-7e494572ffcb Co-authored-by: bloxster <40316187+bloxster@users.noreply.github.com>

Copilot · 2026-05-06T06:11:08Z

@copilot apply changes based on the comments in this thread

Applied all changes from that review thread in commit dfa060c:

Code-fence preservation (strip_mdx): Restructured the function to do a single line-by-line pass that tracks fenced code state. HTML/JSX tag stripping (<...>) and JSX expression stripping ({...}) now only apply outside fenced code blocks — <placeholder> tokens and JSON-RPC {"jsonrpc":"2.0",...} payloads inside code fences are preserved unchanged.
UTF-8 encoding: Added encoding="utf-8" to both _category_.json reads in get_category_label() and get_category_position().

Regenerated llms.txt and llms-full.txt with the fixed script.

Copilot

Pull request overview

Copilot reviewed 3 out of 5 changed files in this pull request and generated 3 comments.

Three fixes: 1. Backtick-aware substitution — add _sub_outside_backticks() helper and use it for tag and {expr} stripping so inline code spans like `erigon:v{ERIGON_VERSION}` and `<YOUR_ADDRESS>` are preserved. 2. Preserve indentation — remove lstrip() from non-fence lines to avoid corrupting nested lists and indented continuations; JSX-orphaned whitespace is harmless to LLM consumers. 3. Deduplicate H1 in llms-full.txt — strip the leading H1 from each page body before appending, since many docs pages open with an H1 matching the frontmatter title, causing back-to-back duplicate headings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bloxster · 2026-05-06T09:15:17Z

Copilot review — addressed in `ae6ade4`

Reviewed all 11 Copilot comments. Several were already resolved in the previous update (unused os import, extra space before colon, _category_.json encoding, tag/expression stripping inside fenced code blocks, sort key logic). Three genuine issues remained and are fixed in this commit:

1. Inline code placeholders were being stripped ({...} and <...> outside fences)
Added _sub_outside_backticks() helper that splits on backtick spans before applying regex substitution. Inline code like `erigon:v{ERIGON_VERSION}` and `<YOUR_ADDRESS>` is now preserved correctly.

2. lstrip() was removing meaningful indentation
Removed the blanket lstrip() on non-fence lines. It was intended to clean up orphaned JSX indentation but also corrupted nested list continuations and indented paragraphs. Residual 2–4 space indent from stripped JSX wrappers is harmless to LLM consumers.

3. Duplicate H1 headings in llms-full.txt
Many doc pages open with an H1 matching the frontmatter title. The synthetic # {title} header added during export produced back-to-back duplicates. Now strips the leading H1 from each page body before appending.

The regenerated llms-full.txt and root llms-full.txt are included in the commit.

Add _strip_multiline_expr_blocks() — a brace-depth-aware pass that discards {[...].map(...)} and similar inline JSX expression blocks that span multiple lines. These contain nested braces that defeat the single-line \{[^}]{0,120}\} pattern, causing artifacts like '<div key= style={{' and orphaned CSS-property lines to leak into llms-full.txt. The pass runs after import/export/comment stripping and before the line-by-line tag/expression cleanup. Code fences are tracked to avoid stripping brace-heavy content (JSON examples, shell scripts). Sanity checks now pass: 0 JSX component tags, 0 MDX imports/exports, 0 div/span leaks, {ERIGON_VERSION} placeholders preserved (9). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…-block stripper Three fixes addressing Zabaniya review feedback: 1. PascalCase tag pre-pass now requires [A-Z][a-z] (lowercase second letter) so ALL_CAPS placeholders like <IP>, <PID>, <DOWNLOADED_FILE_NAME>, <CHAIN>, <EL_RPC_URL>, <VALIDATOR_INFO_JSON> are no longer stripped when they appear inside code fences. 2. Update comment to accurately describe the restriction (was misleadingly claiming PLACEHOLDER tokens survived under the old [A-Z][a-zA-Z]* regex). 3. _strip_multiline_expr_blocks: fence markers (```) encountered while skip_depth > 0 are now discarded rather than appended to output, so fence markers that are part of a skipped JSX block don't leak through. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 5 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

**export_block_end_re matches }; but not }** Multi-line export blocks that close with bare `}` (no semicolon) kept `in_export_block = True` forever, silently swallowing all subsequent page content. Make the semicolon optional: `};?`. **Pre-fence passes not fence-aware** `{/* comment */}` removal and `<Component>` tag removal ran as raw `re.sub` across the entire text before any fence-tracking. Code examples showing JSX comments or capitalized component tags inside fenced blocks would be corrupted. Add `_split_fenced` / `_apply_nonfenced` helpers and route these passes through them. **first_description artifact filter too broad** Pattern `(const|let|var|function|...)` matched ordinary English words: "This function returns..." and "The var keyword..." were incorrectly classified as code artifacts and skipped. Replace with unambiguous patterns only: `=>`, `{`, `}`, `<[a-z][a-z]` (2-char lowercase tags). **first_description skips sentences starting with [** `startswith("[")` filtered reference-style link definitions correctly but also dropped valid opening sentences like "[Erigon](url) is a high-performance client." Replace blanket skip with a targeted check for reference definitions (`[label]: url` pattern). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

yperbasis

Overview

Replaces the abandoned npm-based approach (#20993) with a 459-line Python stdlib script that reads Docusaurus MDX/MD source directly and emits two artifacts (llms.txt index + llms-full.txt corpus) in two locations (docs/site/static/ for the live site, repo root for raw GitHub readers). The repo-root llms.txt previously linked into the deleted docs/gitbook/ tree, so this also fixes a broken pointer.

The reasoning in the PR description (avoiding a stale, single-maintainer npm package; preferring source markdown over HTML round-trip) is sound and well-argued.

What works well

Right approach for the source: parsing markdown directly avoids the lossy compile→HTML→markdown round-trip the npm plugin does.
Fence-awareness is layered correctly: _apply_nonfenced for multi-line, then a per-line in_fence pass, with _sub_outside_backticks protecting inline code spans. Verified empirically: {ERIGON_VERSION} and <YOUR_ADDRESS> survive inside backticks/fences (grep -c '{ERIGON_VERSION}' docs/site/static/llms-full.txt → 8 hits, all inside ```sh blocks).
Smart placeholder guard: the JSX-stripping regex <[A-Z][a-z][a-zA-Z]*[^>]*> requires a lowercase second char, so ALL-CAPS shell placeholders like <DOWNLOADED_FILE_NAME>, <IP>, <PID>, <CHAIN> survive in prose. Verified — 8 hits remain, all legitimate placeholders, not JSX leaks.
Earlier Copilot feedback already addressed: unused os import removed, _category_.json reads have explicit encoding="utf-8", sort key correctly puts index files first (is_not_index = stem != "index" — False sorts before True), leading H1 stripped to avoid duplicates.
Idempotency: re-running the script on unchanged source produces identical output (deterministic ordering, no timestamps).

Concerns

1. Bare `{ERIGON_VERSION}` in prose and tables gets stripped (correctness, medium)

The \{[^}]{0,120}\} pass strips MDX expression placeholders even when they reference frontmatter constants the reader needs. Visible in the generated llms-full.txt:

"select the latest stable version (e.g., v) or whichever version you prefer."
                                        ^ was: v{ERIGON_VERSION}

| 64-bit Intel/AMD | Debian Package (.deb) | erigon__amd64.deb |
                                              ^ was: erigon_v{ERIGON_VERSION}_amd64.deb

The version disappears entirely from prose and table cells. Two ways to address:

Source-side: ask docs authors to wrap version placeholders in backticks (`v{ERIGON_VERSION}`) — preserved by the existing pipeline.
Script-side: extend the placeholder heuristic — keep {IDENTIFIER} (uppercase identifier-only) the way <IP> is kept. Replace the strip regex with one that excludes pure-identifier braces, or substitute with a typeset form like [ERIGON_VERSION].

I'd lean script-side because table cells and inline prose both lose information silently and authors won't catch this in review.

2. Test-plan assertion is wrong (PR description, low)

grep -c '^# ' docs/site/static/llms-full.txt — One H1 per page — should equal page count (71)

Actual: 115, not 71. Some doc bodies use # for what should be ## subsections (e.g. # Register with Claude Code in one command inside why-using-erigon, # Mainnet instance / # Sepolia instance inside multiple-instances). Either:

Fix the source docs to use ## for sub-sections, or
Soften the assertion to ≥71.

Otherwise reviewers running the test plan will see a "failure" that's noise.

3. Two pairs of identical files committed (~700 KB; drift risk)

SHA(llms-full.txt) == SHA(docs/site/static/llms-full.txt) and same for llms.txt — verified via API. Pros: raw GitHub readers can find them at the repo root. Cons: doubles repo storage and creates drift potential if anyone hand-edits one.

The script writes both, so drift only happens if a human edits without rerunning. Two mitigations to consider:

CI guard: add a workflow that runs python3 docs/site/scripts/generate-llms.py and git diff --exit-code on docs/site/** paths. Catches both stale generated files and accidental hand-edits. The existing .github/workflows/docs-deploy.yml (paths: docs/site/**) is a natural place — currently it only npm run builds, so adding a regen step is one block.
Or drop one copy: keep the root files only and have Docusaurus serve them from static/ via a tiny copy step in package.json's prebuild. Symlinks won't work cross-platform; a one-line copy will.

4. MDX landing pages are noise in `llms-full.txt` (content, low)

The two heavy MDX index files (docs/index.mdx and get-started/index.mdx) collapse to a flat stream of card titles + descriptions:

Erigon Client Documentation
  An efficient, modular Ethereum execution client built for performance, ...

  Get Started
  Hardware requirements, installation options, and first-run guides ...

  Easy Nodes
  ...

— no headings, no list structure, just whitespace-indented title/desc fragments because all the <Link>/<div>/<svg> carriers were stripped. The content is technically still there, just structureless.

Since these landing pages duplicate child page content that's also present in the corpus, this isn't blocking — but worth noting as a known limitation. A cleaner option would be to emit a synthesized H2 + bullet list from the card title/desc pairs when a page has no prose, but that's scope creep.

5. Singleton "Erigon Docs" section header (cosmetic)

llms.txt line 7-9:

## Erigon Docs

- [Introduction](https://docs.erigon.tech/): Official documentation for Erigon — ...

The fallback "Erigon Docs" if instance == "docs" else "Help Center" is only used by root index pages, producing a 1-item section that's redundant with the # Erigon H1 at the top. Either inline the introduction into the preamble or merge the root index into the first real section.

6. Frontmatter parser fragility (low)

parse_frontmatter is fine for the current corpus but won't handle:

YAML lists (tags:\n - a) — list value lost; subsequent indented items skipped.
Multi-line strings (description: >) — only the first line kept, indented continuation lost.
Embedded colons in unquoted values — handled (uses partition).
Escaped quotes — outer strip catches them, inner \" stays literal.

Also position = int(meta.get("sidebar_position", 50)) will raise on a non-numeric value. Wrap in try/except for resilience.

These don't bite today (current frontmatter is simple), but if the docs ever add tags, keywords:, or admonition descriptions, surprises follow. Consider switching to a stricter line-handling subset or just being explicit about the supported subset in the docstring.

7. Nested `_category_.json` ordering ignored

fundamentals/configuring-erigon/_category_.json and fundamentals/modules/_category_.json are not consulted — only the top-level section's category file is read. Within a section, ordering at depth >1 falls back to sidebar_position plus a depth tiebreaker. The current output happens to read sensibly but doesn't match Docusaurus's own resolution. Probably fine for now; flag as a TODO when categories grow.

Suggestions / nits

Pin Python version: shebang says #!/usr/bin/env python3 but uses f"…{var:>8,}" formatting + Path.rglob + re.split patterns that work back to 3.8. A # Requires: Python 3.8+ comment near the top would prevent surprises.
Stale-output detection: the script prints a ⚠ WARNING if llms-full.txt > 1.5 MB — nice. Consider adding a --check flag that exits non-zero if the freshly-generated output differs from the on-disk version. Lets CI use it without committing.
first_description skip rules over-aggressive: lines containing <[a-z][a-z] or { are skipped — fine for JSX leaks, but a prose line that legitimately mentions <...> syntax (e.g. "<...> denotes a placeholder") is also skipped. Low priority, since the H1 + frontmatter description usually win first.
_strip_multiline_expr_blocks brace counting is naive: { / } inside string literals or backticks are also counted. Inside fences it's OK (skipped by in_fence check), but a stray prose line starting with { and containing {...} strings in attributes could mis-count. Hasn't manifested in the current output; flagging for future-proofing.
No tests: a 459-line MDX parser with regex layered fence-handling is exactly the kind of thing that benefits from a few dozen-line unittests pinning each behavior (placeholder preservation, fence transparency, multi-line tag stripping, leading-H1 dedupe). Worth ~30 lines of test for the next person who needs to extend it.

Security considerations

None — the script reads files only, no network or shell-out, no user input. int(...) could raise on malformed input, but that's a crash, not a vuln. Stdlib-only minimizes supply chain surface.

Recommendation

Requesting changes for the following before merge:

Fix or document the {ERIGON_VERSION} stripping (concern #1) — either source side or add a placeholder-identifier exclusion. This actively breaks information in tables.
Correct the test-plan H1 assertion (concern #2) — drop "should equal 71" or fix the docs so it's true.
Add a CI regen check (concern #3) — either the existing docs-deploy.yml or a small standalone job to fail when docs/site/** changes without regenerated llms*.txt. Otherwise the files will go stale within a release cycle.

Concerns #4–#7 are acceptable as follow-ups. Overall the script is well-structured, the design choice (Python stdlib over npm plugin) is correct, and the bulk of earlier review feedback has been addressed.

…hesize landing pages, harden generator Addresses yperbasis review (CHANGES_REQUESTED) and Copilot follow-ups on PR #21000. Blockers fixed: - Preserve `{ERIGON_VERSION}` and other ALL_CAPS identifier placeholders in prose and table cells. The brace-strip regex now skips pure-uppercase identifier braces, mirroring the existing `<IP>`/`<PID>` angle-tag guard. Affected: install instructions, .deb table cells, version refs throughout. - Add CI guard: `.github/workflows/docs-deploy.yml` runs `generate-llms.py --check` and the new unittest suite before the npm build, catching drift across the four committed llms files (root + static/). Non-blocking review items: - Synthesize landing-page index pages (docs/index.mdx, get-started/index.mdx, staking/index.mdx, etc.) into structured "## Sections" + bullet lists extracted from the lp-card JSX, instead of collapsing them into structureless title/desc fragments. - Drop the singleton "## Erigon Docs" section header — the Introduction bullet now sits directly under the preamble, no redundant 1-item header. - Harden parse_frontmatter: skip indented YAML continuations (so `tags:` lists no longer pollute keys), `_safe_int` wrapper around `sidebar_position` to tolerate non-numeric values. - Walk nested `_category_.json` files via `ancestor_positions()` so deeper sections honor their own category position in the sort, not just the top-level dir. - Add `--check` flag for CI use (exits non-zero on drift, no writes). - Tighten `first_description` to skip lines that LOOK LIKE JSX leaks (`^<tag`, `^{`, arrow-fn) instead of skipping any line containing those tokens — preserves prose paragraphs that mention them mid-sentence. - Add `# Requires: Python 3.8+` documentation at the top. Tests: - New `docs/site/scripts/test_generate_llms.py` — 25 tests covering placeholder preservation (inline, prose, table cells), JSX expression stripping, fence transparency for `export VAR=` and Python imports, multi-line expr blocks, landing-page synthesis, frontmatter parsing edge cases, and `_safe_int`. - Run with `python3 -m unittest discover docs/site/scripts -v`. Verification on regenerated corpus (71 pages, 351 KB): - `grep -c '{ERIGON_VERSION}' llms-full.txt` → 15 (was 8 — +7 in prose/tables) - `grep -c '^export ' llms-full.txt` → 9 (was 0 — fence preservation working) - Real JSX leaks (`<[A-Z][a-z]`) → 0 - MDX imports/exports → 0 - URL lines → 71 (one per page) - `--check` exits 0 on fresh output, 1 on tampered output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…esize landings, regenerate Brings #21013 in line with the latest state of #21000: - generate-llms.py: preserve {ERIGON_VERSION}/identifier braces in prose + table cells; multi-line JSX-expr block stripping; landing-page synthesis (## Sections + bullets) for card-grid index pages; singleton section header drop; nested _category_.json honored via ancestor_positions; hardened parse_frontmatter (skip indented continuations, _safe_int); tightened first_description (only skip JSX-leak-shaped lines); --check flag for drift detection; argparse main; Python 3.8+. - test_generate_llms.py: 25-test unittest suite covering placeholder preservation, fence transparency, JSX/MDX stripping, frontmatter parsing, and landing-page synthesis. - Regenerated llms.txt + llms-full.txt (root + static/) from current source, picking up sync-to-main's "four prune modes" / typo / yaml exec-form fixes.

When a PR touches docs/site/**, also verify: - generate-llms.py --check (regenerated outputs match committed files) - unittest discover (25-test suite stays green) Mirrors the guard in #21000's docs-deploy.yml so PRs to main get the same llms.txt drift protection that release/3.4 gets at deploy time.

bloxster requested review from AskAlexSharov, Giulio2002 and yperbasis as code owners May 5, 2026 13:29

bloxster mentioned this pull request May 5, 2026

docs: add LLM-friendly content export (llms.txt / llms-full.txt) #20993

Closed

4 tasks

Bloxster and others added 2 commits May 5, 2026 15:30

bloxster marked this pull request as draft May 5, 2026 13:32

bloxster requested a review from Copilot May 5, 2026 13:38

bloxster added docs Caplin Caplin: Consensus Layer, Beacon API labels May 5, 2026

Copilot started reviewing on behalf of bloxster May 5, 2026 13:39 View session

bloxster removed the Caplin Caplin: Consensus Layer, Beacon API label May 5, 2026

Copilot AI reviewed May 5, 2026

View reviewed changes

Comment thread docs/site/scripts/generate-llms.py Outdated

Comment thread docs/site/scripts/generate-llms.py Outdated

Comment thread docs/site/scripts/generate-llms.py Outdated

Comment thread docs/site/scripts/generate-llms.py Outdated

bloxster marked this pull request as ready for review May 5, 2026 14:14

bloxster requested a review from Copilot May 5, 2026 14:52

Copilot started reviewing on behalf of bloxster May 5, 2026 14:53 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

Comment thread docs/site/scripts/generate-llms.py Outdated

Comment thread docs/site/scripts/generate-llms.py Outdated

Comment thread docs/site/scripts/generate-llms.py Outdated

Comment thread docs/site/scripts/generate-llms.py Outdated

Copilot started work on behalf of bloxster May 6, 2026 06:08 View session

docs: fix strip_mdx code-fence preservation, add utf-8 encoding

dfa060c

Agent-Logs-Url: https://github.com/erigontech/erigon/sessions/29b3c04f-e946-4009-88bd-7e494572ffcb Co-authored-by: bloxster <40316187+bloxster@users.noreply.github.com>

Copilot finished work on behalf of bloxster May 6, 2026 06:11

bloxster requested a review from Copilot May 6, 2026 08:45

Copilot started reviewing on behalf of bloxster May 6, 2026 08:46 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

Comment thread docs/site/scripts/generate-llms.py Outdated

Comment thread docs/site/scripts/generate-llms.py Outdated

Comment thread docs/site/scripts/generate-llms.py

Bloxster and others added 2 commits May 6, 2026 11:53

bloxster mentioned this pull request May 6, 2026

docs: sync Docusaurus site to main #21013

Draft

5 tasks

Merge branch 'release/3.4' into docs/llms-generator

efd7b11

bloxster requested a review from Copilot May 7, 2026 07:51

Copilot started reviewing on behalf of bloxster May 7, 2026 07:51 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread docs/site/scripts/generate-llms.py Outdated

bloxster mentioned this pull request May 7, 2026

ci: skip Go jobs and add docs-site build for docs-only PRs #21028

Open

3 tasks

bloxster and others added 2 commits May 7, 2026 10:14

Potential fix for pull request finding

56807fe

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

yperbasis requested changes May 7, 2026

View reviewed changes

bloxster requested a review from mriccobene as a code owner May 7, 2026 17:05

bloxster mentioned this pull request May 7, 2026

ci: skip Go jobs and add docs-site build for docs-only PRs #21045

Open

4 tasks

Conversation

bloxster commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why a custom script instead of docusaurus-plugin-llms-txt (replaces #20993)

How to update

Updates after review (commit 05a81fcd)

Test plan

Deployment

Output quality — run after regenerating

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bloxster commented May 6, 2026

Uh oh!

Copilot AI commented May 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bloxster commented May 6, 2026

Copilot review — addressed in ae6ade4

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

yperbasis left a comment

Choose a reason for hiding this comment

Overview

What works well

Concerns

1. Bare {ERIGON_VERSION} in prose and tables gets stripped (correctness, medium)

2. Test-plan assertion is wrong (PR description, low)

3. Two pairs of identical files committed (~700 KB; drift risk)

4. MDX landing pages are noise in llms-full.txt (content, low)

5. Singleton "Erigon Docs" section header (cosmetic)

6. Frontmatter parser fragility (low)

7. Nested _category_.json ordering ignored

Suggestions / nits

Security considerations

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bloxster commented May 5, 2026 •

edited

Loading

Why a custom script instead of `docusaurus-plugin-llms-txt` (replaces #20993)

Updates after review (commit `05a81fcd`)

Copilot review — addressed in `ae6ade4`

1. Bare `{ERIGON_VERSION}` in prose and tables gets stripped (correctness, medium)

4. MDX landing pages are noise in `llms-full.txt` (content, low)

7. Nested `_category_.json` ordering ignored