Replies: 2 comments 2 replies
-
|
Having table of contents as a separate element seems useful for generating AI summary of pdfs. Assuming that we want to feed AI JSONs in the first place, and that it will recognize this as something important and not just another page. For questions:
|
Beta Was this translation helpful? Give feedback.
-
|
Thanks for raising this — here's where we stand and some thoughts on both questions. Current state in the codebase: Q1: Dedicated JSON object type? I'd suggest introducing a dedicated
Since {
"type": "toc",
"items": [
{
"type": "toc item",
"text": "Chapter 1: Introduction",
"level": 1,
"page": 5,
"link": "#heading-1",
"children": [
{
"type": "toc item",
"text": "1.1 Background",
"level": 2,
"page": 7,
"link": "#heading-1-1"
}
]
}
]
}For Markdown output, rendering as a nested list with hyperlinks seems like a natural fit: - [Chapter 1: Introduction](#heading-1) ... 5
- [1.1 Background](#heading-1-1) ... 7Q2: Recognize TOC in non-tagged PDFs? Rather than building separate heuristics, it might be worth leveraging Autotagging for this. Since Autotagging can assign structure tags — including Would love to hear your thoughts on this direction! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Tagged PDF has special objects for Table of Contents represented as
TOCandTOCIstructure elements.. It may be seen as a special type of lists, where each list item is a link to a heading within the document.Questions:
Beta Was this translation helpful? Give feedback.
All reactions