docs: add llms.txt generator script and update root llms.txt#21000
docs: add llms.txt generator script and update root llms.txt#21000bloxster wants to merge 13 commits intorelease/3.4from
Conversation
Adds docs/site/scripts/generate-llms.py — a pure Python stdlib script (no npm deps) that reads all .md/.mdx source files from both Docusaurus plugin instances (docs/ and help-center/) and generates: docs/site/static/llms.txt — page index with titles/descriptions docs/site/static/llms-full.txt — full clean markdown for long-context LLMs The script is run once and its outputs are committed; re-run whenever docs content changes (or hook it into the pre-build step). Also updates the repo-root llms.txt, which was pointing to the now-deleted docs/gitbook/ folder. It now mirrors the Docusaurus-generated index with live docs.erigon.tech URLs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
LLMs struggle with files over ~2 MB. Print a warning at build time if the generated llms-full.txt crosses the 1.5 MB threshold so the operator knows to prune content before it becomes a problem. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ript
The script now writes to two locations in one pass:
- docs/site/static/ — served at docs.erigon.tech/llms{,-full}.txt
- repo root — for LLMs/tools that read the GitHub repo directly
Both pairs are identical; no more manual sync needed. Also adds
llms-full.txt at the repo root (was missing) as the expected companion
to llms.txt per the llms.txt standard.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…full.txt After removing JSX/HTML tags, text content inside those tags was left indented (e.g. card titles and descriptions on the landing page). Strip leading whitespace from all non-code lines so the output reads as clean prose. Regenerate both static and root copies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a stdlib-only Python generator to export the Docusaurus docs/help-center MD/MDX sources into LLM-friendly llms.txt (index) and llms-full.txt (concatenated markdown), and updates the repo-root llms.txt to point at the live docs.erigon.tech URLs (instead of the removed docs/gitbook/ paths).
Changes:
- Add
docs/site/scripts/generate-llms.pyto generate/overwritellms.txtandllms-full.txtin bothdocs/site/static/and repo root. - Add generated
docs/site/static/llms.txt(+llms-full.txt) for publication by Docusaurus at the site root. - Update repo-root
llms.txtto mirror the generated index with live URLs.
Reviewed changes
Copilot reviewed 3 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
docs/site/scripts/generate-llms.py |
New generator script: walks docs/help-center sources, strips MDX/JSX, builds sorted index + full export, and writes to static + repo root. |
docs/site/static/llms.txt |
Generated LLM routing index intended to be served at docs.erigon.tech/llms.txt. |
docs/site/static/llms-full.txt |
Generated concatenated markdown intended to be served at docs.erigon.tech/llms-full.txt. |
llms.txt |
Root index updated from deleted GitBook paths to Docusaurus/live docs.erigon.tech URLs. |
llms-full.txt |
Root full export copy written for repo consumers (mirrors site output). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Four fixes: 1. Preserve indentation inside fenced code blocks during MDX stripping — track fence state and skip lstrip() while inside a ``` fence so YAML/ Python examples with meaningful whitespace aren't corrupted. 2. Fix section sort order — replace the broken is_index heuristic with a depth-first sort key: (section_pos, depth_from_base, is_not_index, position). Section overview pages (index.mdx at depth 1) now reliably lead their section; subsections follow at depth 2+; leaf pages last. 3. Remove unused `import os`. 4. Remove stray space before `:` in loop header. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 5 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@copilot apply changes based on the comments in this thread |
Agent-Logs-Url: https://github.com/erigontech/erigon/sessions/29b3c04f-e946-4009-88bd-7e494572ffcb Co-authored-by: bloxster <40316187+bloxster@users.noreply.github.com>
Applied all changes from that review thread in commit
Regenerated |
Three fixes:
1. Backtick-aware substitution — add _sub_outside_backticks() helper and
use it for tag and {expr} stripping so inline code spans like
`erigon:v{ERIGON_VERSION}` and `<YOUR_ADDRESS>` are preserved.
2. Preserve indentation — remove lstrip() from non-fence lines to avoid
corrupting nested lists and indented continuations; JSX-orphaned
whitespace is harmless to LLM consumers.
3. Deduplicate H1 in llms-full.txt — strip the leading H1 from each
page body before appending, since many docs pages open with an H1
matching the frontmatter title, causing back-to-back duplicate headings.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot review — addressed in ae6ade4Reviewed all 11 Copilot comments. Several were already resolved in the previous update (unused 1. Inline code placeholders were being stripped ( 2. 3. Duplicate H1 headings in The regenerated |
Add _strip_multiline_expr_blocks() — a brace-depth-aware pass that
discards {[...].map(...)} and similar inline JSX expression blocks that
span multiple lines. These contain nested braces that defeat the
single-line \{[^}]{0,120}\} pattern, causing artifacts like
'<div key= style={{' and orphaned CSS-property lines to leak into
llms-full.txt.
The pass runs after import/export/comment stripping and before the
line-by-line tag/expression cleanup. Code fences are tracked to avoid
stripping brace-heavy content (JSON examples, shell scripts).
Sanity checks now pass: 0 JSX component tags, 0 MDX imports/exports,
0 div/span leaks, {ERIGON_VERSION} placeholders preserved (9).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-block stripper Three fixes addressing Zabaniya review feedback: 1. PascalCase tag pre-pass now requires [A-Z][a-z] (lowercase second letter) so ALL_CAPS placeholders like <IP>, <PID>, <DOWNLOADED_FILE_NAME>, <CHAIN>, <EL_RPC_URL>, <VALIDATOR_INFO_JSON> are no longer stripped when they appear inside code fences. 2. Update comment to accurately describe the restriction (was misleadingly claiming PLACEHOLDER tokens survived under the old [A-Z][a-zA-Z]* regex). 3. _strip_multiline_expr_blocks: fence markers (```) encountered while skip_depth > 0 are now discarded rather than appended to output, so fence markers that are part of a skipped JSX block don't leak through. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
**export_block_end_re matches }; but not }**
Multi-line export blocks that close with bare `}` (no semicolon) kept
`in_export_block = True` forever, silently swallowing all subsequent
page content. Make the semicolon optional: `};?`.
**Pre-fence passes not fence-aware**
`{/* comment */}` removal and `<Component>` tag removal ran as raw
`re.sub` across the entire text before any fence-tracking. Code
examples showing JSX comments or capitalized component tags inside
fenced blocks would be corrupted. Add `_split_fenced` / `_apply_nonfenced`
helpers and route these passes through them.
**first_description artifact filter too broad**
Pattern `(const|let|var|function|...)` matched ordinary English words:
"This function returns..." and "The var keyword..." were incorrectly
classified as code artifacts and skipped. Replace with unambiguous
patterns only: `=>`, `{`, `}`, `<[a-z][a-z]` (2-char lowercase tags).
**first_description skips sentences starting with [**
`startswith("[")` filtered reference-style link definitions correctly
but also dropped valid opening sentences like "[Erigon](url) is a
high-performance client." Replace blanket skip with a targeted check
for reference definitions (`[label]: url` pattern).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
yperbasis
left a comment
There was a problem hiding this comment.
Overview
Replaces the abandoned npm-based approach (#20993) with a 459-line Python stdlib script that reads Docusaurus MDX/MD source directly and emits two artifacts (llms.txt index + llms-full.txt corpus) in two locations (docs/site/static/ for the live site, repo root for raw GitHub readers). The repo-root llms.txt previously linked into the deleted docs/gitbook/ tree, so this also fixes a broken pointer.
The reasoning in the PR description (avoiding a stale, single-maintainer npm package; preferring source markdown over HTML round-trip) is sound and well-argued.
What works well
- Right approach for the source: parsing markdown directly avoids the lossy compile→HTML→markdown round-trip the npm plugin does.
- Fence-awareness is layered correctly:
_apply_nonfencedfor multi-line, then a per-linein_fencepass, with_sub_outside_backticksprotecting inline code spans. Verified empirically:{ERIGON_VERSION}and<YOUR_ADDRESS>survive inside backticks/fences (grep -c '{ERIGON_VERSION}' docs/site/static/llms-full.txt→ 8 hits, all inside```shblocks). - Smart placeholder guard: the JSX-stripping regex
<[A-Z][a-z][a-zA-Z]*[^>]*>requires a lowercase second char, so ALL-CAPS shell placeholders like<DOWNLOADED_FILE_NAME>,<IP>,<PID>,<CHAIN>survive in prose. Verified — 8 hits remain, all legitimate placeholders, not JSX leaks. - Earlier Copilot feedback already addressed: unused
osimport removed,_category_.jsonreads have explicitencoding="utf-8", sort key correctly puts index files first (is_not_index = stem != "index"—Falsesorts beforeTrue), leading H1 stripped to avoid duplicates. - Idempotency: re-running the script on unchanged source produces identical output (deterministic ordering, no timestamps).
Concerns
1. Bare {ERIGON_VERSION} in prose and tables gets stripped (correctness, medium)
The \{[^}]{0,120}\} pass strips MDX expression placeholders even when they reference frontmatter constants the reader needs. Visible in the generated llms-full.txt:
"select the latest stable version (e.g., v) or whichever version you prefer."
^ was: v{ERIGON_VERSION}
| 64-bit Intel/AMD | Debian Package (.deb) | erigon__amd64.deb |
^ was: erigon_v{ERIGON_VERSION}_amd64.deb
The version disappears entirely from prose and table cells. Two ways to address:
- Source-side: ask docs authors to wrap version placeholders in backticks (
`v{ERIGON_VERSION}`) — preserved by the existing pipeline. - Script-side: extend the placeholder heuristic — keep
{IDENTIFIER}(uppercase identifier-only) the way<IP>is kept. Replace the strip regex with one that excludes pure-identifier braces, or substitute with a typeset form like[ERIGON_VERSION].
I'd lean script-side because table cells and inline prose both lose information silently and authors won't catch this in review.
2. Test-plan assertion is wrong (PR description, low)
grep -c '^# ' docs/site/static/llms-full.txt— One H1 per page — should equal page count (71)
Actual: 115, not 71. Some doc bodies use # for what should be ## subsections (e.g. # Register with Claude Code in one command inside why-using-erigon, # Mainnet instance / # Sepolia instance inside multiple-instances). Either:
- Fix the source docs to use
##for sub-sections, or - Soften the assertion to
≥71.
Otherwise reviewers running the test plan will see a "failure" that's noise.
3. Two pairs of identical files committed (~700 KB; drift risk)
SHA(llms-full.txt) == SHA(docs/site/static/llms-full.txt) and same for llms.txt — verified via API. Pros: raw GitHub readers can find them at the repo root. Cons: doubles repo storage and creates drift potential if anyone hand-edits one.
The script writes both, so drift only happens if a human edits without rerunning. Two mitigations to consider:
- CI guard: add a workflow that runs
python3 docs/site/scripts/generate-llms.pyandgit diff --exit-codeondocs/site/**paths. Catches both stale generated files and accidental hand-edits. The existing.github/workflows/docs-deploy.yml(paths: docs/site/**) is a natural place — currently it onlynpm run builds, so adding a regen step is one block. - Or drop one copy: keep the root files only and have Docusaurus serve them from
static/via a tiny copy step inpackage.json'sprebuild. Symlinks won't work cross-platform; a one-line copy will.
4. MDX landing pages are noise in llms-full.txt (content, low)
The two heavy MDX index files (docs/index.mdx and get-started/index.mdx) collapse to a flat stream of card titles + descriptions:
Erigon Client Documentation
An efficient, modular Ethereum execution client built for performance, ...
Get Started
Hardware requirements, installation options, and first-run guides ...
Easy Nodes
...
— no headings, no list structure, just whitespace-indented title/desc fragments because all the <Link>/<div>/<svg> carriers were stripped. The content is technically still there, just structureless.
Since these landing pages duplicate child page content that's also present in the corpus, this isn't blocking — but worth noting as a known limitation. A cleaner option would be to emit a synthesized H2 + bullet list from the card title/desc pairs when a page has no prose, but that's scope creep.
5. Singleton "Erigon Docs" section header (cosmetic)
llms.txt line 7-9:
## Erigon Docs
- [Introduction](https://docs.erigon.tech/): Official documentation for Erigon — ...
The fallback "Erigon Docs" if instance == "docs" else "Help Center" is only used by root index pages, producing a 1-item section that's redundant with the # Erigon H1 at the top. Either inline the introduction into the preamble or merge the root index into the first real section.
6. Frontmatter parser fragility (low)
parse_frontmatter is fine for the current corpus but won't handle:
- YAML lists (
tags:\n - a) — list value lost; subsequent indented items skipped. - Multi-line strings (
description: >) — only the first line kept, indented continuation lost. - Embedded colons in unquoted values — handled (uses
partition). - Escaped quotes — outer strip catches them, inner
\"stays literal.
Also position = int(meta.get("sidebar_position", 50)) will raise on a non-numeric value. Wrap in try/except for resilience.
These don't bite today (current frontmatter is simple), but if the docs ever add tags, keywords:, or admonition descriptions, surprises follow. Consider switching to a stricter line-handling subset or just being explicit about the supported subset in the docstring.
7. Nested _category_.json ordering ignored
fundamentals/configuring-erigon/_category_.json and fundamentals/modules/_category_.json are not consulted — only the top-level section's category file is read. Within a section, ordering at depth >1 falls back to sidebar_position plus a depth tiebreaker. The current output happens to read sensibly but doesn't match Docusaurus's own resolution. Probably fine for now; flag as a TODO when categories grow.
Suggestions / nits
- Pin Python version: shebang says
#!/usr/bin/env python3but usesf"…{var:>8,}"formatting +Path.rglob+re.splitpatterns that work back to 3.8. A# Requires: Python 3.8+comment near the top would prevent surprises. - Stale-output detection: the script prints a
⚠ WARNINGifllms-full.txt > 1.5 MB— nice. Consider adding a--checkflag that exits non-zero if the freshly-generated output differs from the on-disk version. Lets CI use it without committing. first_descriptionskip rules over-aggressive: lines containing<[a-z][a-z]or{are skipped — fine for JSX leaks, but a prose line that legitimately mentions<...>syntax (e.g. "<...>denotes a placeholder") is also skipped. Low priority, since the H1 + frontmatter description usually win first._strip_multiline_expr_blocksbrace counting is naive:{/}inside string literals or backticks are also counted. Inside fences it's OK (skipped byin_fencecheck), but a stray prose line starting with{and containing{...}strings in attributes could mis-count. Hasn't manifested in the current output; flagging for future-proofing.- No tests: a 459-line MDX parser with regex layered fence-handling is exactly the kind of thing that benefits from a few dozen-line
unittests pinning each behavior (placeholder preservation, fence transparency, multi-line tag stripping, leading-H1 dedupe). Worth ~30 lines of test for the next person who needs to extend it.
Security considerations
None — the script reads files only, no network or shell-out, no user input. int(...) could raise on malformed input, but that's a crash, not a vuln. Stdlib-only minimizes supply chain surface.
Recommendation
Requesting changes for the following before merge:
- Fix or document the
{ERIGON_VERSION}stripping (concern #1) — either source side or add a placeholder-identifier exclusion. This actively breaks information in tables. - Correct the test-plan H1 assertion (concern #2) — drop "should equal 71" or fix the docs so it's true.
- Add a CI regen check (concern #3) — either the existing
docs-deploy.ymlor a small standalone job to fail whendocs/site/**changes without regeneratedllms*.txt. Otherwise the files will go stale within a release cycle.
Concerns #4–#7 are acceptable as follow-ups. Overall the script is well-structured, the design choice (Python stdlib over npm plugin) is correct, and the bulk of earlier review feedback has been addressed.
…hesize landing pages, harden generator Addresses yperbasis review (CHANGES_REQUESTED) and Copilot follow-ups on PR #21000. Blockers fixed: - Preserve `{ERIGON_VERSION}` and other ALL_CAPS identifier placeholders in prose and table cells. The brace-strip regex now skips pure-uppercase identifier braces, mirroring the existing `<IP>`/`<PID>` angle-tag guard. Affected: install instructions, .deb table cells, version refs throughout. - Add CI guard: `.github/workflows/docs-deploy.yml` runs `generate-llms.py --check` and the new unittest suite before the npm build, catching drift across the four committed llms files (root + static/). Non-blocking review items: - Synthesize landing-page index pages (docs/index.mdx, get-started/index.mdx, staking/index.mdx, etc.) into structured "## Sections" + bullet lists extracted from the lp-card JSX, instead of collapsing them into structureless title/desc fragments. - Drop the singleton "## Erigon Docs" section header — the Introduction bullet now sits directly under the preamble, no redundant 1-item header. - Harden parse_frontmatter: skip indented YAML continuations (so `tags:` lists no longer pollute keys), `_safe_int` wrapper around `sidebar_position` to tolerate non-numeric values. - Walk nested `_category_.json` files via `ancestor_positions()` so deeper sections honor their own category position in the sort, not just the top-level dir. - Add `--check` flag for CI use (exits non-zero on drift, no writes). - Tighten `first_description` to skip lines that LOOK LIKE JSX leaks (`^<tag`, `^{`, arrow-fn) instead of skipping any line containing those tokens — preserves prose paragraphs that mention them mid-sentence. - Add `# Requires: Python 3.8+` documentation at the top. Tests: - New `docs/site/scripts/test_generate_llms.py` — 25 tests covering placeholder preservation (inline, prose, table cells), JSX expression stripping, fence transparency for `export VAR=` and Python imports, multi-line expr blocks, landing-page synthesis, frontmatter parsing edge cases, and `_safe_int`. - Run with `python3 -m unittest discover docs/site/scripts -v`. Verification on regenerated corpus (71 pages, 351 KB): - `grep -c '{ERIGON_VERSION}' llms-full.txt` → 15 (was 8 — +7 in prose/tables) - `grep -c '^export ' llms-full.txt` → 9 (was 0 — fence preservation working) - Real JSX leaks (`<[A-Z][a-z]`) → 0 - MDX imports/exports → 0 - URL lines → 71 (one per page) - `--check` exits 0 on fresh output, 1 on tampered output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…esize landings, regenerate Brings #21013 in line with the latest state of #21000: - generate-llms.py: preserve {ERIGON_VERSION}/identifier braces in prose + table cells; multi-line JSX-expr block stripping; landing-page synthesis (## Sections + bullets) for card-grid index pages; singleton section header drop; nested _category_.json honored via ancestor_positions; hardened parse_frontmatter (skip indented continuations, _safe_int); tightened first_description (only skip JSX-leak-shaped lines); --check flag for drift detection; argparse main; Python 3.8+. - test_generate_llms.py: 25-test unittest suite covering placeholder preservation, fence transparency, JSX/MDX stripping, frontmatter parsing, and landing-page synthesis. - Regenerated llms.txt + llms-full.txt (root + static/) from current source, picking up sync-to-main's "four prune modes" / typo / yaml exec-form fixes.
When a PR touches docs/site/**, also verify: - generate-llms.py --check (regenerated outputs match committed files) - unittest discover (25-test suite stays green) Mirrors the guard in #21000's docs-deploy.yml so PRs to main get the same llms.txt drift protection that release/3.4 gets at deploy time.
Summary
docs/site/scripts/generate-llms.py— a pure Python (stdlib only, zero npm deps) script that generates LLM-friendly content exports from the Docusaurus source files directlydocs/site/static/llms.txt(page index, 71 pages) anddocs/site/static/llms-full.txt(full clean markdown, ~351 KB), served atdocs.erigon.tech/llms.txtanddocs.erigon.tech/llms-full.txtllms.txt, which was pointing to the deleteddocs/gitbook/folder — now mirrors the Docusaurus-generated index with livedocs.erigon.techURLs.github/workflows/docs-deploy.ymlthat runsgenerate-llms.py --checkand the unit tests before the npm build, blocking drift between any of the four committed files (root +static/)docs/site/scripts/test_generate_llms.py, 25 tests) covering placeholder preservation, fence transparency, JSX stripping, multi-line expr blocks, frontmatter parsing, and landing-page synthesisWhy a custom script instead of
docusaurus-plugin-llms-txt(replaces #20993)PR #20993 used the
docusaurus-plugin-llms-txt@0.1.3npm package. After review, we decided against it:The custom script reads
.md/.mdxfiles directly, strips MDX-specific syntax (imports, JSX components, HTML tags, expressions), extracts frontmatter titles and descriptions, and maps file paths to their deployeddocs.erigon.techURLs. Both Docusaurus plugin instances (main docs and help-center) are supported. Card-grid landing pages (e.g.docs/index.mdx) are detected via thelp-cardJSX pattern and synthesized into structured "## Sections" + bullet lists rather than collapsing into a soup of title/desc fragments.How to update
Re-run the script whenever doc content changes:
To verify on-disk files match what the script would generate (used by CI):
The CI guard in
docs-deploy.ymlruns--checkand the unittest suite on every push touchingdocs/site/**, so a forgotten regeneration after a docs edit will fail the build before deploy.Updates after review (commit
05a81fcd)Addressing yperbasis CHANGES_REQUESTED + Copilot follow-ups:
Blockers
{ERIGON_VERSION}and other ALL_CAPS identifier placeholders in prose and table cells. The brace-strip regex now skips pure-uppercase identifier braces, mirroring the existing<IP>/<PID>angle-tag guard. Verified against the install-instructions table cell (erigon_{ERIGON_VERSION}_amd64.deb) and the version selector prose ((e.g., v{ERIGON_VERSION})) the reviewer flagged.^#count incorrectly counted shell comments insidebashfences (e.g.# Reduce disk latency impact). Now uses^URL:(one synthetic URL line per page = 71).prebuild(catches drift in all 4 files, no Python coupling in the npm build path).Non-blocking review items
docs/index.mdx,staking/index.mdx,help-center/index.mdx, etc.).parse_frontmatterhardened: skip indented YAML continuations,_safe_intwrapper forsidebar_position._category_.jsonhonored viaancestor_positions()for sort tie-breaking.--checkflag for CI.first_descriptiontightened — only skip lines that look like JSX leaks (^<tag,^{, arrow-fn) instead of skipping any line that mentions those tokens mid-sentence.# Requires: Python 3.8+documented at the top.Test plan
Deployment
llms.txtrenders correctly atdocs.erigon.tech/llms.txtafter deployllms-full.txtrenders atdocs.erigon.tech/llms-full.txtllms.txtno longer references deleteddocs/gitbook/paths--checkreturns OK)Output quality — run after regenerating
Page index (
llms.txt)## Get Started,## Fundamentals, etc.) appears exactly once## Erigon Docsline above it)get-started/index.mdx) appear before their siblings within each section<Component,{props.,import)Full export (
llms-full.txt)export VAR=…lines{ERIGON_VERSION},<YOUR_ADDRESS>style tokens are preserved both inside backtick spans and in bare prose / table cellscurl,docker run,erigoninvocations with{…}args are complete<Link,<Tabs,<div,<section)import Link from,export const)docs/index.mdx,help-center/index.mdx, etc.) emit a## Sectionsheading + bullet list, not unstructured title/desc fragmentsSanity checks (quick greps)
Current values (regenerated, commit
05a81fcd): URL 71, JSX leaks 0, MDX imports/exports 0,{ERIGON_VERSION}15,^export9.Tests
python3 -m unittest discover docs/site/scripts -v # Ran 25 tests in 0.001s — OK🤖 Generated with Claude Code