Skip to content

Commit 0f158a8

Browse files
committed
Revise talk outline and add deep research on types, AI, and documentation
- Restructure talk: AI hook → feature walkthrough → honest forward-looking section - New framing: RDoc prioritizes better contributing experience, not chasing AI hype - Add research: Mundler PLDI 2025, Endoh benchmark, Rustdoc LLM RFC, APIMatic, OpenAI/Astral, Shopify Boba — all verified against primary sources - Add research: documentation skills landscape, Markdown for agents
1 parent 8d4187c commit 0f158a8

5 files changed

Lines changed: 580 additions & 216 deletions

File tree

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# AI Documentation Skills — Research
2+
3+
## Key Insight
4+
No major doc generator (JSDoc, YARD, Sphinx, RDoc) has shipped AI-specific features yet. All AI integration happens at the consumption layer (MCP servers), not the generation layer.
5+
6+
## Doc Generator AI Efforts
7+
8+
### Rustdoc — Markdown Output (Pre-RFC, active)
9+
- Proposal: `cargo doc --output-format markdown` for agent-consumable API docs
10+
- Also: `rustdoc-md` crate that converts rustdoc JSON to Markdown today
11+
- Cargo issue: github.com/rust-lang/cargo/issues/16720
12+
- Internals discussion: internals.rust-lang.org/t/pre-rfc-add-llm-text-version-to-rustdoc/22090
13+
- **Most relevant parallel to what RDoc could do**
14+
15+
### llms.txt Standard
16+
- `/llms.txt` (summaries) and `/llms-full.txt` (full Markdown)
17+
- 600+ adopters: Anthropic, Stripe, Cloudflare, Cursor
18+
- Doc-site-side effort, not doc-generator-side
19+
- Stan already evaluated and rejected based on impact data
20+
21+
## Consumption-Layer Tools (MCP Servers)
22+
23+
### Dash MCP Server
24+
- Official MCP from Kapeli (Dash 8+)
25+
- Search installed docsets, list docs, extract content
26+
- github.com/Kapeli/dash-mcp-server
27+
28+
### Context7 (Upstash)
29+
- Resolves library names to version-specific documentation
30+
- **Vercel measured: task pass rate 53% → 100% with version-matched docs**
31+
- github.com/upstash/context7
32+
33+
### DevDocs MCP
34+
- Multiple implementations wrapping devdocs.io
35+
- Local search across downloaded docsets
36+
37+
### Espressif (ESP32)
38+
- Hardware vendor shipping dedicated MCP server for their docs
39+
- developer.espressif.com/blog/2026/04/doc-mcp-server/
40+
41+
## Takeaway for RDoc
42+
- RDoc could be the first major doc generator to add AI-specific output
43+
- Rust community is working on the same idea (Markdown output from doc tool)
44+
- The consumption layer (MCP servers) is ahead of the generation layer
45+
- Question: should RDoc generate Markdown alongside HTML, or expose an MCP interface?

research/markdown_for_agents.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Markdown for AI Agents — Research
2+
3+
## The Problem
4+
How does an AI agent programmatically get Markdown from a documentation site?
5+
6+
## Current Approaches
7+
8+
### llms.txt (closest to a standard)
9+
- `/llms.txt` (overview + links) and `/llms-full.txt` (complete Markdown)
10+
- Adopted by Anthropic, Cloudflare, Read the Docs, Mintlify (~784 sites)
11+
- No major LLM provider officially consumes it
12+
- Stan already evaluated and rejected based on impact data
13+
14+
### External Proxies
15+
- **Jina Reader** (`r.jina.ai/URL`) — prefix any URL to get Markdown. Widely used.
16+
- **Firecrawl** — API that crawls sites and returns clean Markdown for RAG pipelines
17+
- **Trafilatura, readability** — Libraries for extraction
18+
19+
### Source-as-Markdown (MDN Model)
20+
- MDN content is authored as Markdown on GitHub (`mdn/content`)
21+
- 1:1 mapping between rendered URL and raw `.md` on GitHub
22+
- AI can fetch raw Markdown from GitHub given any MDN URL
23+
24+
### Content Negotiation
25+
- No `Accept: text/markdown` convention exists
26+
- No `.md` URL extension convention
27+
- llms.txt sidesteps content negotiation entirely (separate well-known path, like robots.txt)
28+
29+
## What Doesn't Work Well
30+
- "Copy as Markdown" buttons — requires human action, bad UX
31+
- Manual conversion — transitional approach at best
32+
33+
## Open Questions
34+
- How does an AI agent know Markdown is available without trying?
35+
- Should doc sites serve Markdown at a predictable URL pattern?
36+
- Or should the rendering tool (RDoc) generate Markdown output alongside HTML?
37+
- If RDoc generates `.md` files, how are they discovered?
38+
39+
## Takeaway
40+
No standard exists yet. The field is split between well-known paths (llms.txt), external proxies (Jina), and source-as-Markdown (MDN). This is a genuinely open problem worth discussing honestly.

research/rustdoc_llm_rfc.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Rustdoc LLM Text Format RFC — rust-lang/rfcs#3751
2+
3+
**PR:** https://github.com/rust-lang/rfcs/pull/3751
4+
**Author:** Folyd
5+
**Date:** December 2024
6+
**Status:** Not merged. Received significant pushback.
7+
8+
## The Proposal
9+
10+
Add an LLM-friendly text output format to `rustdoc`. The author needed an LLM to use the `oas3` crate but got code for v0.4.0 (knowledge cutoff) instead of v0.13.1. The existing rustdoc JSON format is 5.5MB for this crate — too noisy for LLM consumption. A text summary would be <1KB with just the public API.
11+
12+
Proposed URL pattern: `https://docs.rs/oas3/0.11.3/docs.txt`
13+
14+
## Arguments FOR
15+
16+
**Folyd (author):**
17+
- Rustdoc's JSON format is 161,873 lines for a single crate — unsuitable for LLM context windows
18+
- Text formats for LLMs are becoming standard (cites llms.txt)
19+
- AI models can understand arbitrary text — what they need is a suitable format, not more intelligence
20+
- The problem isn't HTML comprehension, it's engineering effort to extract useful info from complex pages
21+
22+
**lebensterben:**
23+
- The format is useful to humans too — "much more readable than other formats when you just need a synopsis of a library"
24+
25+
**ahicks92 (initial skeptic, then converted by Copilot experience):**
26+
- Personal account: Copilot autocompleted full trait impls, macros, 20-30 lines at a time — in niche audio synthesis code doing "horrifying things to the type system"
27+
- "Does my chores, and sometimes reads my mind, producing code in my style" is super valuable
28+
- Two usage modes: beginners chat with it to learn, experts lean on it for boring parts
29+
30+
## Arguments AGAINST
31+
32+
**clarfonthey:**
33+
- "I prefer my oceans unboiled" (environmental cost of LLM usage)
34+
- You can already view collapsed code in IDEs — not adding anything new for humans
35+
- The format isn't searchable, so not useful to humans either
36+
- Feeding info into "statistical models which cannot understand it" is not a good use case
37+
- Links to arxiv.org/abs/2410.05229 (paper on limits of LLM understanding)
38+
39+
**juntyr:**
40+
- The JSON format is already meant to be machine-consumed
41+
- Going through a textual representation just to re-obtain semantic information seems wasteful
42+
- Why not use an external tool like `rusty-man` to produce reduced output?
43+
44+
**ahicks92 (despite being pro-AI, argues against THIS approach):**
45+
- No real reason this can't be an external tool — RFC doesn't have a path for consumption
46+
- Nobody in AI has standardized how to provide context yet
47+
- "Every time anyone says 'AI is like X', 6 months from now that's no longer the case"
48+
- This feels premature given pace of progress
49+
- "Consume complex HTML" will likely be a solved problem by the time this stabilizes
50+
- Rust's RFC process is too slow (4+ months) for a field with 2-year product cycles
51+
- "This RFC argues for stripping context. History argues for providing more context."
52+
- Anthropic is standardizing context via MCP; OpenAI claims coding AI better than most devs
53+
- Would bet significant money that "make a text version as humans" won't matter in 2 years
54+
55+
**workingjubilee (T-rustdoc team member):**
56+
- LLMs are in-flux technology with changing context windows, compression methods, and input requirements
57+
- A "redux" format targeted at current LLM usage is ill-suited for a stability guarantee
58+
- "Very likely it is not the desired format within 3 months, never mind 3 years"
59+
- **Could simply be implemented as a library** that filters the JSON output
60+
- Rustdoc already has a JSON format — use that as the foundation for external tools
61+
62+
**aDotInTheVoid:**
63+
- "The JSON format allows for this to be done outside of rustdoc, and there's no advantage to having this be in rustdoc itself"
64+
65+
## Key Themes from the Discussion
66+
67+
1. **External tool vs. built-in**: Strong consensus that this should be an external tool consuming rustdoc's JSON, not a core rustdoc feature. Mature language toolchains shouldn't bake in AI-specific formats because the requirements change too fast.
68+
69+
2. **Pace of change**: Multiple commenters argue AI capabilities advance faster than RFC processes. Any format designed for today's LLMs will be obsolete by the time it ships.
70+
71+
3. **Strip context vs. provide more context**: The fundamental tension — LLMs currently have limited context windows (argues for stripping), but context windows are growing fast and AI is getting better at processing complex inputs (argues for providing everything).
72+
73+
4. **Environmental concerns**: "Ocean boiling" argument against LLM-optimized tooling.
74+
75+
5. **JSON as the right intermediate format**: Rustdoc already has machine-readable JSON output. The community prefers external tools that transform this JSON rather than adding new output formats to rustdoc itself.
76+
77+
## Relevance to RDoc
78+
79+
The Rust community rejected baking LLM-specific output into rustdoc. But their situation differs from RDoc's:
80+
- Rustdoc already has a JSON output format. RDoc doesn't have an equivalent machine-readable output.
81+
- Rust's RFC process is slow. RDoc can ship features faster as a gem.
82+
- The counter-arguments about pace of change apply equally to RDoc — any AI-specific format may be obsolete soon.
83+
- The "external tool" argument is strong: maybe RDoc should provide good structured data (Markdown, type info) and let external tools (MCP servers, skills) handle AI-specific consumption.

0 commit comments

Comments
 (0)