Skip to content

docs: add llms.txt generator script and update root llms.txt#21000

Open
bloxster wants to merge 13 commits intorelease/3.4from
docs/llms-generator
Open

docs: add llms.txt generator script and update root llms.txt#21000
bloxster wants to merge 13 commits intorelease/3.4from
docs/llms-generator

Conversation

@bloxster
Copy link
Copy Markdown
Collaborator

@bloxster bloxster commented May 5, 2026

Summary

  • Adds docs/site/scripts/generate-llms.py — a pure Python (stdlib only, zero npm deps) script that generates LLM-friendly content exports from the Docusaurus source files directly
  • Generates docs/site/static/llms.txt (page index, 71 pages) and docs/site/static/llms-full.txt (full clean markdown, ~351 KB), served at docs.erigon.tech/llms.txt and docs.erigon.tech/llms-full.txt
  • Updates the repo-root llms.txt, which was pointing to the deleted docs/gitbook/ folder — now mirrors the Docusaurus-generated index with live docs.erigon.tech URLs
  • Adds a CI guard in .github/workflows/docs-deploy.yml that runs generate-llms.py --check and the unit tests before the npm build, blocking drift between any of the four committed files (root + static/)
  • Adds a unit test suite (docs/site/scripts/test_generate_llms.py, 25 tests) covering placeholder preservation, fence transparency, JSX stripping, multi-line expr blocks, frontmatter parsing, and landing-page synthesis

Why a custom script instead of docusaurus-plugin-llms-txt (replaces #20993)

PR #20993 used the docusaurus-plugin-llms-txt@0.1.3 npm package. After review, we decided against it:

  • Wrong approach: the plugin works on compiled HTML output and converts it back to markdown — a lossy round-trip. Our source is already markdown.
  • Supply chain risk: the package has no declared source repo, is maintained by a personal Gmail address, and has not been updated in 16 months.
  • Unnecessary dependency: a Python stdlib script does the same job with no external dependencies, no build-time coupling, and cleaner output.

The custom script reads .md/.mdx files directly, strips MDX-specific syntax (imports, JSX components, HTML tags, expressions), extracts frontmatter titles and descriptions, and maps file paths to their deployed docs.erigon.tech URLs. Both Docusaurus plugin instances (main docs and help-center) are supported. Card-grid landing pages (e.g. docs/index.mdx) are detected via the lp-card JSX pattern and synthesized into structured "## Sections" + bullet lists rather than collapsing into a soup of title/desc fragments.

How to update

Re-run the script whenever doc content changes:

python3 docs/site/scripts/generate-llms.py

To verify on-disk files match what the script would generate (used by CI):

python3 docs/site/scripts/generate-llms.py --check

The CI guard in docs-deploy.yml runs --check and the unittest suite on every push touching docs/site/**, so a forgotten regeneration after a docs edit will fail the build before deploy.

Updates after review (commit 05a81fcd)

Addressing yperbasis CHANGES_REQUESTED + Copilot follow-ups:

Blockers

  • ✅ Preserve {ERIGON_VERSION} and other ALL_CAPS identifier placeholders in prose and table cells. The brace-strip regex now skips pure-uppercase identifier braces, mirroring the existing <IP>/<PID> angle-tag guard. Verified against the install-instructions table cell (erigon_{ERIGON_VERSION}_amd64.deb) and the version selector prose ((e.g., v{ERIGON_VERSION})) the reviewer flagged.
  • ✅ Test-plan H1 assertion replaced — the prior ^# count incorrectly counted shell comments inside bash fences (e.g. # Reduce disk latency impact). Now uses ^URL: (one synthetic URL line per page = 71).
  • ✅ Drift guard via CI rather than prebuild (catches drift in all 4 files, no Python coupling in the npm build path).

Non-blocking review items

  • ✅ Singleton "## Erigon Docs" header dropped — the Introduction bullet sits directly under the preamble now.
  • ✅ Landing-page MDX synthesis (no more title/desc soup for docs/index.mdx, staking/index.mdx, help-center/index.mdx, etc.).
  • parse_frontmatter hardened: skip indented YAML continuations, _safe_int wrapper for sidebar_position.
  • ✅ Nested _category_.json honored via ancestor_positions() for sort tie-breaking.
  • --check flag for CI.
  • first_description tightened — only skip lines that look like JSX leaks (^<tag, ^{, arrow-fn) instead of skipping any line that mentions those tokens mid-sentence.
  • # Requires: Python 3.8+ documented at the top.

Test plan

Deployment

  • llms.txt renders correctly at docs.erigon.tech/llms.txt after deploy
  • llms-full.txt renders at docs.erigon.tech/llms-full.txt
  • Root llms.txt no longer references deleted docs/gitbook/ paths
  • Re-running the script produces identical output (--check returns OK)

Output quality — run after regenerating

Page index (llms.txt)

  • Every section header (## Get Started, ## Fundamentals, etc.) appears exactly once
  • No singleton section header (the Introduction bullet should sit directly under the preamble, no ## Erigon Docs line above it)
  • Index pages (e.g. get-started/index.mdx) appear before their siblings within each section
  • No entry has a blank or missing title
  • No entry description contains raw JSX (<Component, {props., import )

Full export (llms-full.txt)

  • No page has back-to-back duplicate H1 headings (synthetic title + body's own H1)
  • Fenced code blocks are intact — content between fences is unchanged, including shell export VAR=… lines
  • Inline code placeholders survive — {ERIGON_VERSION}, <YOUR_ADDRESS> style tokens are preserved both inside backtick spans and in bare prose / table cells
  • No truncated shell commands — curl, docker run, erigon invocations with {…} args are complete
  • Nested list indentation is preserved — sublists appear indented, not flush-left
  • No raw HTML/JSX tags leak into prose (<Link, <Tabs, <div, <section)
  • No raw MDX imports/exports leak (import Link from, export const)
  • Landing pages (docs/index.mdx, help-center/index.mdx, etc.) emit a ## Sections heading + bullet list, not unstructured title/desc fragments

Sanity checks (quick greps)

# Page count — synthetic URL line per page (should equal 71)
grep -c '^URL: ' docs/site/static/llms-full.txt

# Real JSX component leaks — uppercase-then-lowercase tag pattern (should be 0)
grep -cE '<[A-Z][a-z][a-zA-Z]+' docs/site/static/llms-full.txt

# MDX imports/exports leaked outside fences (should be 0)
grep -cE '^(import|export const|export function|export default)' docs/site/static/llms-full.txt

# Identifier placeholders preserved — should be > 0 if source uses any
grep -c '{ERIGON_VERSION}' docs/site/static/llms-full.txt

# Shell `export VAR=` lines preserved inside ```bash fences — should be > 0
grep -c '^export ' docs/site/static/llms-full.txt

Current values (regenerated, commit 05a81fcd): URL 71, JSX leaks 0, MDX imports/exports 0, {ERIGON_VERSION} 15, ^export 9.

Tests

python3 -m unittest discover docs/site/scripts -v
# Ran 25 tests in 0.001s — OK

🤖 Generated with Claude Code

Adds docs/site/scripts/generate-llms.py — a pure Python stdlib script
(no npm deps) that reads all .md/.mdx source files from both Docusaurus
plugin instances (docs/ and help-center/) and generates:

  docs/site/static/llms.txt      — page index with titles/descriptions
  docs/site/static/llms-full.txt — full clean markdown for long-context LLMs

The script is run once and its outputs are committed; re-run whenever
docs content changes (or hook it into the pre-build step).

Also updates the repo-root llms.txt, which was pointing to the now-deleted
docs/gitbook/ folder. It now mirrors the Docusaurus-generated index with
live docs.erigon.tech URLs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bloxster and others added 2 commits May 5, 2026 15:30
LLMs struggle with files over ~2 MB. Print a warning at build time
if the generated llms-full.txt crosses the 1.5 MB threshold so the
operator knows to prune content before it becomes a problem.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ript

The script now writes to two locations in one pass:
  - docs/site/static/ — served at docs.erigon.tech/llms{,-full}.txt
  - repo root         — for LLMs/tools that read the GitHub repo directly

Both pairs are identical; no more manual sync needed. Also adds
llms-full.txt at the repo root (was missing) as the expected companion
to llms.txt per the llms.txt standard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@bloxster bloxster marked this pull request as draft May 5, 2026 13:32
…full.txt

After removing JSX/HTML tags, text content inside those tags was left
indented (e.g. card titles and descriptions on the landing page). Strip
leading whitespace from all non-code lines so the output reads as clean
prose. Regenerate both static and root copies.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@bloxster bloxster requested a review from Copilot May 5, 2026 13:38
@bloxster bloxster added docs Caplin Caplin: Consensus Layer, Beacon API labels May 5, 2026
@bloxster bloxster removed the Caplin Caplin: Consensus Layer, Beacon API label May 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a stdlib-only Python generator to export the Docusaurus docs/help-center MD/MDX sources into LLM-friendly llms.txt (index) and llms-full.txt (concatenated markdown), and updates the repo-root llms.txt to point at the live docs.erigon.tech URLs (instead of the removed docs/gitbook/ paths).

Changes:

  • Add docs/site/scripts/generate-llms.py to generate/overwrite llms.txt and llms-full.txt in both docs/site/static/ and repo root.
  • Add generated docs/site/static/llms.txt (+ llms-full.txt) for publication by Docusaurus at the site root.
  • Update repo-root llms.txt to mirror the generated index with live URLs.

Reviewed changes

Copilot reviewed 3 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
docs/site/scripts/generate-llms.py New generator script: walks docs/help-center sources, strips MDX/JSX, builds sorted index + full export, and writes to static + repo root.
docs/site/static/llms.txt Generated LLM routing index intended to be served at docs.erigon.tech/llms.txt.
docs/site/static/llms-full.txt Generated concatenated markdown intended to be served at docs.erigon.tech/llms-full.txt.
llms.txt Root index updated from deleted GitBook paths to Docusaurus/live docs.erigon.tech URLs.
llms-full.txt Root full export copy written for repo consumers (mirrors site output).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/site/scripts/generate-llms.py Outdated
Comment thread docs/site/scripts/generate-llms.py Outdated
Comment thread docs/site/scripts/generate-llms.py Outdated
Comment thread docs/site/scripts/generate-llms.py Outdated
Four fixes:

1. Preserve indentation inside fenced code blocks during MDX stripping —
   track fence state and skip lstrip() while inside a ``` fence so YAML/
   Python examples with meaningful whitespace aren't corrupted.

2. Fix section sort order — replace the broken is_index heuristic with a
   depth-first sort key: (section_pos, depth_from_base, is_not_index,
   position). Section overview pages (index.mdx at depth 1) now reliably
   lead their section; subsections follow at depth 2+; leaf pages last.

3. Remove unused `import os`.

4. Remove stray space before `:` in loop header.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@bloxster bloxster marked this pull request as ready for review May 5, 2026 14:14
@bloxster bloxster requested a review from Copilot May 5, 2026 14:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 5 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/site/scripts/generate-llms.py Outdated
Comment thread docs/site/scripts/generate-llms.py Outdated
Comment thread docs/site/scripts/generate-llms.py Outdated
Comment thread docs/site/scripts/generate-llms.py Outdated
@bloxster
Copy link
Copy Markdown
Collaborator Author

bloxster commented May 6, 2026

@copilot apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor

Copilot AI commented May 6, 2026

@copilot apply changes based on the comments in this thread

Applied all changes from that review thread in commit dfa060c:

  1. Code-fence preservation (strip_mdx): Restructured the function to do a single line-by-line pass that tracks fenced code state. HTML/JSX tag stripping (<...>) and JSX expression stripping ({...}) now only apply outside fenced code blocks — <placeholder> tokens and JSON-RPC {"jsonrpc":"2.0",...} payloads inside code fences are preserved unchanged.

  2. UTF-8 encoding: Added encoding="utf-8" to both _category_.json reads in get_category_label() and get_category_position().

Regenerated llms.txt and llms-full.txt with the fixed script.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 5 changed files in this pull request and generated 3 comments.

Comment thread docs/site/scripts/generate-llms.py Outdated
Comment thread docs/site/scripts/generate-llms.py Outdated
Comment thread docs/site/scripts/generate-llms.py
Three fixes:

1. Backtick-aware substitution — add _sub_outside_backticks() helper and
   use it for tag and {expr} stripping so inline code spans like
   `erigon:v{ERIGON_VERSION}` and `<YOUR_ADDRESS>` are preserved.

2. Preserve indentation — remove lstrip() from non-fence lines to avoid
   corrupting nested lists and indented continuations; JSX-orphaned
   whitespace is harmless to LLM consumers.

3. Deduplicate H1 in llms-full.txt — strip the leading H1 from each
   page body before appending, since many docs pages open with an H1
   matching the frontmatter title, causing back-to-back duplicate headings.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@bloxster
Copy link
Copy Markdown
Collaborator Author

bloxster commented May 6, 2026

Copilot review — addressed in ae6ade4

Reviewed all 11 Copilot comments. Several were already resolved in the previous update (unused os import, extra space before colon, _category_.json encoding, tag/expression stripping inside fenced code blocks, sort key logic). Three genuine issues remained and are fixed in this commit:

1. Inline code placeholders were being stripped ({...} and <...> outside fences)
Added _sub_outside_backticks() helper that splits on backtick spans before applying regex substitution. Inline code like `erigon:v{ERIGON_VERSION}` and `<YOUR_ADDRESS>` is now preserved correctly.

2. lstrip() was removing meaningful indentation
Removed the blanket lstrip() on non-fence lines. It was intended to clean up orphaned JSX indentation but also corrupted nested list continuations and indented paragraphs. Residual 2–4 space indent from stripped JSX wrappers is harmless to LLM consumers.

3. Duplicate H1 headings in llms-full.txt
Many doc pages open with an H1 matching the frontmatter title. The synthetic # {title} header added during export produced back-to-back duplicates. Now strips the leading H1 from each page body before appending.

The regenerated llms-full.txt and root llms-full.txt are included in the commit.

Bloxster and others added 2 commits May 6, 2026 11:53
Add _strip_multiline_expr_blocks() — a brace-depth-aware pass that
discards {[...].map(...)} and similar inline JSX expression blocks that
span multiple lines.  These contain nested braces that defeat the
single-line \{[^}]{0,120}\} pattern, causing artifacts like
'<div key= style={{' and orphaned CSS-property lines to leak into
llms-full.txt.

The pass runs after import/export/comment stripping and before the
line-by-line tag/expression cleanup.  Code fences are tracked to avoid
stripping brace-heavy content (JSON examples, shell scripts).

Sanity checks now pass: 0 JSX component tags, 0 MDX imports/exports,
0 div/span leaks, {ERIGON_VERSION} placeholders preserved (9).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-block stripper

Three fixes addressing Zabaniya review feedback:

1. PascalCase tag pre-pass now requires [A-Z][a-z] (lowercase second letter)
   so ALL_CAPS placeholders like <IP>, <PID>, <DOWNLOADED_FILE_NAME>,
   <CHAIN>, <EL_RPC_URL>, <VALIDATOR_INFO_JSON> are no longer stripped
   when they appear inside code fences.

2. Update comment to accurately describe the restriction (was misleadingly
   claiming PLACEHOLDER tokens survived under the old [A-Z][a-zA-Z]* regex).

3. _strip_multiline_expr_blocks: fence markers (```) encountered while
   skip_depth > 0 are now discarded rather than appended to output, so
   fence markers that are part of a skipped JSX block don't leak through.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 5 changed files in this pull request and generated 1 comment.

Comment thread docs/site/scripts/generate-llms.py Outdated
bloxster and others added 2 commits May 7, 2026 10:14
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
**export_block_end_re matches }; but not }**
Multi-line export blocks that close with bare `}` (no semicolon) kept
`in_export_block = True` forever, silently swallowing all subsequent
page content. Make the semicolon optional: `};?`.

**Pre-fence passes not fence-aware**
`{/* comment */}` removal and `<Component>` tag removal ran as raw
`re.sub` across the entire text before any fence-tracking. Code
examples showing JSX comments or capitalized component tags inside
fenced blocks would be corrupted. Add `_split_fenced` / `_apply_nonfenced`
helpers and route these passes through them.

**first_description artifact filter too broad**
Pattern `(const|let|var|function|...)` matched ordinary English words:
"This function returns..." and "The var keyword..." were incorrectly
classified as code artifacts and skipped. Replace with unambiguous
patterns only: `=>`, `{`, `}`, `<[a-z][a-z]` (2-char lowercase tags).

**first_description skips sentences starting with [**
`startswith("[")` filtered reference-style link definitions correctly
but also dropped valid opening sentences like "[Erigon](url) is a
high-performance client." Replace blanket skip with a targeted check
for reference definitions (`[label]: url` pattern).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@yperbasis yperbasis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

Replaces the abandoned npm-based approach (#20993) with a 459-line Python stdlib script that reads Docusaurus MDX/MD source directly and emits two artifacts (llms.txt index + llms-full.txt corpus) in two locations (docs/site/static/ for the live site, repo root for raw GitHub readers). The repo-root llms.txt previously linked into the deleted docs/gitbook/ tree, so this also fixes a broken pointer.

The reasoning in the PR description (avoiding a stale, single-maintainer npm package; preferring source markdown over HTML round-trip) is sound and well-argued.

What works well

  • Right approach for the source: parsing markdown directly avoids the lossy compile→HTML→markdown round-trip the npm plugin does.
  • Fence-awareness is layered correctly: _apply_nonfenced for multi-line, then a per-line in_fence pass, with _sub_outside_backticks protecting inline code spans. Verified empirically: {ERIGON_VERSION} and <YOUR_ADDRESS> survive inside backticks/fences (grep -c '{ERIGON_VERSION}' docs/site/static/llms-full.txt → 8 hits, all inside ```sh blocks).
  • Smart placeholder guard: the JSX-stripping regex <[A-Z][a-z][a-zA-Z]*[^>]*> requires a lowercase second char, so ALL-CAPS shell placeholders like <DOWNLOADED_FILE_NAME>, <IP>, <PID>, <CHAIN> survive in prose. Verified — 8 hits remain, all legitimate placeholders, not JSX leaks.
  • Earlier Copilot feedback already addressed: unused os import removed, _category_.json reads have explicit encoding="utf-8", sort key correctly puts index files first (is_not_index = stem != "index"False sorts before True), leading H1 stripped to avoid duplicates.
  • Idempotency: re-running the script on unchanged source produces identical output (deterministic ordering, no timestamps).

Concerns

1. Bare {ERIGON_VERSION} in prose and tables gets stripped (correctness, medium)

The \{[^}]{0,120}\} pass strips MDX expression placeholders even when they reference frontmatter constants the reader needs. Visible in the generated llms-full.txt:

"select the latest stable version (e.g., v) or whichever version you prefer."
                                        ^ was: v{ERIGON_VERSION}

| 64-bit Intel/AMD | Debian Package (.deb) | erigon__amd64.deb |
                                              ^ was: erigon_v{ERIGON_VERSION}_amd64.deb

The version disappears entirely from prose and table cells. Two ways to address:

  • Source-side: ask docs authors to wrap version placeholders in backticks (`v{ERIGON_VERSION}`) — preserved by the existing pipeline.
  • Script-side: extend the placeholder heuristic — keep {IDENTIFIER} (uppercase identifier-only) the way <IP> is kept. Replace the strip regex with one that excludes pure-identifier braces, or substitute with a typeset form like [ERIGON_VERSION].

I'd lean script-side because table cells and inline prose both lose information silently and authors won't catch this in review.

2. Test-plan assertion is wrong (PR description, low)

grep -c '^# ' docs/site/static/llms-full.txt — One H1 per page — should equal page count (71)

Actual: 115, not 71. Some doc bodies use # for what should be ## subsections (e.g. # Register with Claude Code in one command inside why-using-erigon, # Mainnet instance / # Sepolia instance inside multiple-instances). Either:

  • Fix the source docs to use ## for sub-sections, or
  • Soften the assertion to ≥71.

Otherwise reviewers running the test plan will see a "failure" that's noise.

3. Two pairs of identical files committed (~700 KB; drift risk)

SHA(llms-full.txt) == SHA(docs/site/static/llms-full.txt) and same for llms.txt — verified via API. Pros: raw GitHub readers can find them at the repo root. Cons: doubles repo storage and creates drift potential if anyone hand-edits one.

The script writes both, so drift only happens if a human edits without rerunning. Two mitigations to consider:

  • CI guard: add a workflow that runs python3 docs/site/scripts/generate-llms.py and git diff --exit-code on docs/site/** paths. Catches both stale generated files and accidental hand-edits. The existing .github/workflows/docs-deploy.yml (paths: docs/site/**) is a natural place — currently it only npm run builds, so adding a regen step is one block.
  • Or drop one copy: keep the root files only and have Docusaurus serve them from static/ via a tiny copy step in package.json's prebuild. Symlinks won't work cross-platform; a one-line copy will.

4. MDX landing pages are noise in llms-full.txt (content, low)

The two heavy MDX index files (docs/index.mdx and get-started/index.mdx) collapse to a flat stream of card titles + descriptions:

Erigon Client Documentation
  An efficient, modular Ethereum execution client built for performance, ...

  Get Started
  Hardware requirements, installation options, and first-run guides ...

  Easy Nodes
  ...

— no headings, no list structure, just whitespace-indented title/desc fragments because all the <Link>/<div>/<svg> carriers were stripped. The content is technically still there, just structureless.

Since these landing pages duplicate child page content that's also present in the corpus, this isn't blocking — but worth noting as a known limitation. A cleaner option would be to emit a synthesized H2 + bullet list from the card title/desc pairs when a page has no prose, but that's scope creep.

5. Singleton "Erigon Docs" section header (cosmetic)

llms.txt line 7-9:

## Erigon Docs

- [Introduction](https://docs.erigon.tech/): Official documentation for Erigon — ...

The fallback "Erigon Docs" if instance == "docs" else "Help Center" is only used by root index pages, producing a 1-item section that's redundant with the # Erigon H1 at the top. Either inline the introduction into the preamble or merge the root index into the first real section.

6. Frontmatter parser fragility (low)

parse_frontmatter is fine for the current corpus but won't handle:

  • YAML lists (tags:\n - a) — list value lost; subsequent indented items skipped.
  • Multi-line strings (description: >) — only the first line kept, indented continuation lost.
  • Embedded colons in unquoted values — handled (uses partition).
  • Escaped quotes — outer strip catches them, inner \" stays literal.

Also position = int(meta.get("sidebar_position", 50)) will raise on a non-numeric value. Wrap in try/except for resilience.

These don't bite today (current frontmatter is simple), but if the docs ever add tags, keywords:, or admonition descriptions, surprises follow. Consider switching to a stricter line-handling subset or just being explicit about the supported subset in the docstring.

7. Nested _category_.json ordering ignored

fundamentals/configuring-erigon/_category_.json and fundamentals/modules/_category_.json are not consulted — only the top-level section's category file is read. Within a section, ordering at depth >1 falls back to sidebar_position plus a depth tiebreaker. The current output happens to read sensibly but doesn't match Docusaurus's own resolution. Probably fine for now; flag as a TODO when categories grow.

Suggestions / nits

  • Pin Python version: shebang says #!/usr/bin/env python3 but uses f"…{var:>8,}" formatting + Path.rglob + re.split patterns that work back to 3.8. A # Requires: Python 3.8+ comment near the top would prevent surprises.
  • Stale-output detection: the script prints a ⚠ WARNING if llms-full.txt > 1.5 MB — nice. Consider adding a --check flag that exits non-zero if the freshly-generated output differs from the on-disk version. Lets CI use it without committing.
  • first_description skip rules over-aggressive: lines containing <[a-z][a-z] or { are skipped — fine for JSX leaks, but a prose line that legitimately mentions <...> syntax (e.g. "<...> denotes a placeholder") is also skipped. Low priority, since the H1 + frontmatter description usually win first.
  • _strip_multiline_expr_blocks brace counting is naive: { / } inside string literals or backticks are also counted. Inside fences it's OK (skipped by in_fence check), but a stray prose line starting with { and containing {...} strings in attributes could mis-count. Hasn't manifested in the current output; flagging for future-proofing.
  • No tests: a 459-line MDX parser with regex layered fence-handling is exactly the kind of thing that benefits from a few dozen-line unittests pinning each behavior (placeholder preservation, fence transparency, multi-line tag stripping, leading-H1 dedupe). Worth ~30 lines of test for the next person who needs to extend it.

Security considerations

None — the script reads files only, no network or shell-out, no user input. int(...) could raise on malformed input, but that's a crash, not a vuln. Stdlib-only minimizes supply chain surface.

Recommendation

Requesting changes for the following before merge:

  1. Fix or document the {ERIGON_VERSION} stripping (concern #1) — either source side or add a placeholder-identifier exclusion. This actively breaks information in tables.
  2. Correct the test-plan H1 assertion (concern #2) — drop "should equal 71" or fix the docs so it's true.
  3. Add a CI regen check (concern #3) — either the existing docs-deploy.yml or a small standalone job to fail when docs/site/** changes without regenerated llms*.txt. Otherwise the files will go stale within a release cycle.

Concerns #4#7 are acceptable as follow-ups. Overall the script is well-structured, the design choice (Python stdlib over npm plugin) is correct, and the bulk of earlier review feedback has been addressed.

…hesize landing pages, harden generator

Addresses yperbasis review (CHANGES_REQUESTED) and Copilot follow-ups on PR #21000.

Blockers fixed:
- Preserve `{ERIGON_VERSION}` and other ALL_CAPS identifier placeholders in
  prose and table cells. The brace-strip regex now skips pure-uppercase
  identifier braces, mirroring the existing `<IP>`/`<PID>` angle-tag guard.
  Affected: install instructions, .deb table cells, version refs throughout.
- Add CI guard: `.github/workflows/docs-deploy.yml` runs
  `generate-llms.py --check` and the new unittest suite before the npm build,
  catching drift across the four committed llms files (root + static/).

Non-blocking review items:
- Synthesize landing-page index pages (docs/index.mdx, get-started/index.mdx,
  staking/index.mdx, etc.) into structured "## Sections" + bullet lists
  extracted from the lp-card JSX, instead of collapsing them into structureless
  title/desc fragments.
- Drop the singleton "## Erigon Docs" section header — the Introduction bullet
  now sits directly under the preamble, no redundant 1-item header.
- Harden parse_frontmatter: skip indented YAML continuations (so `tags:` lists
  no longer pollute keys), `_safe_int` wrapper around `sidebar_position` to
  tolerate non-numeric values.
- Walk nested `_category_.json` files via `ancestor_positions()` so deeper
  sections honor their own category position in the sort, not just the
  top-level dir.
- Add `--check` flag for CI use (exits non-zero on drift, no writes).
- Tighten `first_description` to skip lines that LOOK LIKE JSX leaks
  (`^<tag`, `^{`, arrow-fn) instead of skipping any line containing those
  tokens — preserves prose paragraphs that mention them mid-sentence.
- Add `# Requires: Python 3.8+` documentation at the top.

Tests:
- New `docs/site/scripts/test_generate_llms.py` — 25 tests covering placeholder
  preservation (inline, prose, table cells), JSX expression stripping, fence
  transparency for `export VAR=` and Python imports, multi-line expr blocks,
  landing-page synthesis, frontmatter parsing edge cases, and `_safe_int`.
- Run with `python3 -m unittest discover docs/site/scripts -v`.

Verification on regenerated corpus (71 pages, 351 KB):
- `grep -c '{ERIGON_VERSION}' llms-full.txt` → 15 (was 8 — +7 in prose/tables)
- `grep -c '^export ' llms-full.txt` → 9 (was 0 — fence preservation working)
- Real JSX leaks (`<[A-Z][a-z]`) → 0
- MDX imports/exports → 0
- URL lines → 71 (one per page)
- `--check` exits 0 on fresh output, 1 on tampered output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bloxster bloxster requested a review from mriccobene as a code owner May 7, 2026 17:05
bloxster pushed a commit that referenced this pull request May 7, 2026
…esize landings, regenerate

Brings #21013 in line with the latest state of #21000:

- generate-llms.py: preserve {ERIGON_VERSION}/identifier braces in prose +
  table cells; multi-line JSX-expr block stripping; landing-page synthesis
  (## Sections + bullets) for card-grid index pages; singleton section
  header drop; nested _category_.json honored via ancestor_positions;
  hardened parse_frontmatter (skip indented continuations, _safe_int);
  tightened first_description (only skip JSX-leak-shaped lines);
  --check flag for drift detection; argparse main; Python 3.8+.

- test_generate_llms.py: 25-test unittest suite covering placeholder
  preservation, fence transparency, JSX/MDX stripping, frontmatter
  parsing, and landing-page synthesis.

- Regenerated llms.txt + llms-full.txt (root + static/) from current
  source, picking up sync-to-main's "four prune modes" / typo / yaml
  exec-form fixes.
bloxster pushed a commit that referenced this pull request May 7, 2026
When a PR touches docs/site/**, also verify:
- generate-llms.py --check (regenerated outputs match committed files)
- unittest discover (25-test suite stays green)

Mirrors the guard in #21000's docs-deploy.yml so PRs to main get the
same llms.txt drift protection that release/3.4 gets at deploy time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants