|
| 1 | +# Rustdoc LLM Text Format RFC — rust-lang/rfcs#3751 |
| 2 | + |
| 3 | +**PR:** https://github.com/rust-lang/rfcs/pull/3751 |
| 4 | +**Author:** Folyd |
| 5 | +**Date:** December 2024 |
| 6 | +**Status:** Not merged. Received significant pushback. |
| 7 | + |
| 8 | +## The Proposal |
| 9 | + |
| 10 | +Add an LLM-friendly text output format to `rustdoc`. The author needed an LLM to use the `oas3` crate but got code for v0.4.0 (knowledge cutoff) instead of v0.13.1. The existing rustdoc JSON format is 5.5MB for this crate — too noisy for LLM consumption. A text summary would be <1KB with just the public API. |
| 11 | + |
| 12 | +Proposed URL pattern: `https://docs.rs/oas3/0.11.3/docs.txt` |
| 13 | + |
| 14 | +## Arguments FOR |
| 15 | + |
| 16 | +**Folyd (author):** |
| 17 | +- Rustdoc's JSON format is 161,873 lines for a single crate — unsuitable for LLM context windows |
| 18 | +- Text formats for LLMs are becoming standard (cites llms.txt) |
| 19 | +- AI models can understand arbitrary text — what they need is a suitable format, not more intelligence |
| 20 | +- The problem isn't HTML comprehension, it's engineering effort to extract useful info from complex pages |
| 21 | + |
| 22 | +**lebensterben:** |
| 23 | +- The format is useful to humans too — "much more readable than other formats when you just need a synopsis of a library" |
| 24 | + |
| 25 | +**ahicks92 (initial skeptic, then converted by Copilot experience):** |
| 26 | +- Personal account: Copilot autocompleted full trait impls, macros, 20-30 lines at a time — in niche audio synthesis code doing "horrifying things to the type system" |
| 27 | +- "Does my chores, and sometimes reads my mind, producing code in my style" is super valuable |
| 28 | +- Two usage modes: beginners chat with it to learn, experts lean on it for boring parts |
| 29 | + |
| 30 | +## Arguments AGAINST |
| 31 | + |
| 32 | +**clarfonthey:** |
| 33 | +- "I prefer my oceans unboiled" (environmental cost of LLM usage) |
| 34 | +- You can already view collapsed code in IDEs — not adding anything new for humans |
| 35 | +- The format isn't searchable, so not useful to humans either |
| 36 | +- Feeding info into "statistical models which cannot understand it" is not a good use case |
| 37 | +- Links to arxiv.org/abs/2410.05229 (paper on limits of LLM understanding) |
| 38 | + |
| 39 | +**juntyr:** |
| 40 | +- The JSON format is already meant to be machine-consumed |
| 41 | +- Going through a textual representation just to re-obtain semantic information seems wasteful |
| 42 | +- Why not use an external tool like `rusty-man` to produce reduced output? |
| 43 | + |
| 44 | +**ahicks92 (despite being pro-AI, argues against THIS approach):** |
| 45 | +- No real reason this can't be an external tool — RFC doesn't have a path for consumption |
| 46 | +- Nobody in AI has standardized how to provide context yet |
| 47 | +- "Every time anyone says 'AI is like X', 6 months from now that's no longer the case" |
| 48 | +- This feels premature given pace of progress |
| 49 | +- "Consume complex HTML" will likely be a solved problem by the time this stabilizes |
| 50 | +- Rust's RFC process is too slow (4+ months) for a field with 2-year product cycles |
| 51 | +- "This RFC argues for stripping context. History argues for providing more context." |
| 52 | +- Anthropic is standardizing context via MCP; OpenAI claims coding AI better than most devs |
| 53 | +- Would bet significant money that "make a text version as humans" won't matter in 2 years |
| 54 | + |
| 55 | +**workingjubilee (T-rustdoc team member):** |
| 56 | +- LLMs are in-flux technology with changing context windows, compression methods, and input requirements |
| 57 | +- A "redux" format targeted at current LLM usage is ill-suited for a stability guarantee |
| 58 | +- "Very likely it is not the desired format within 3 months, never mind 3 years" |
| 59 | +- **Could simply be implemented as a library** that filters the JSON output |
| 60 | +- Rustdoc already has a JSON format — use that as the foundation for external tools |
| 61 | + |
| 62 | +**aDotInTheVoid:** |
| 63 | +- "The JSON format allows for this to be done outside of rustdoc, and there's no advantage to having this be in rustdoc itself" |
| 64 | + |
| 65 | +## Key Themes from the Discussion |
| 66 | + |
| 67 | +1. **External tool vs. built-in**: Strong consensus that this should be an external tool consuming rustdoc's JSON, not a core rustdoc feature. Mature language toolchains shouldn't bake in AI-specific formats because the requirements change too fast. |
| 68 | + |
| 69 | +2. **Pace of change**: Multiple commenters argue AI capabilities advance faster than RFC processes. Any format designed for today's LLMs will be obsolete by the time it ships. |
| 70 | + |
| 71 | +3. **Strip context vs. provide more context**: The fundamental tension — LLMs currently have limited context windows (argues for stripping), but context windows are growing fast and AI is getting better at processing complex inputs (argues for providing everything). |
| 72 | + |
| 73 | +4. **Environmental concerns**: "Ocean boiling" argument against LLM-optimized tooling. |
| 74 | + |
| 75 | +5. **JSON as the right intermediate format**: Rustdoc already has machine-readable JSON output. The community prefers external tools that transform this JSON rather than adding new output formats to rustdoc itself. |
| 76 | + |
| 77 | +## Relevance to RDoc |
| 78 | + |
| 79 | +The Rust community rejected baking LLM-specific output into rustdoc. But their situation differs from RDoc's: |
| 80 | +- Rustdoc already has a JSON output format. RDoc doesn't have an equivalent machine-readable output. |
| 81 | +- Rust's RFC process is slow. RDoc can ship features faster as a gem. |
| 82 | +- The counter-arguments about pace of change apply equally to RDoc — any AI-specific format may be obsolete soon. |
| 83 | +- The "external tool" argument is strong: maybe RDoc should provide good structured data (Markdown, type info) and let external tools (MCP servers, skills) handle AI-specific consumption. |
0 commit comments