support for Table of Contents #87

bdoubrov · 2025-10-30T14:01:55Z

bdoubrov
Oct 30, 2025
Maintainer

Tagged PDF has special objects for Table of Contents represented as TOC and TOCI structure elements.. It may be seen as a special type of lists, where each list item is a link to a heading within the document.

Questions:

do we want to have new dedicated object types in JSON? In markdown they would be seen as lists, as discussed here https://stackoverflow.com/questions/11948245/markdown-to-create-pages-and-table-of-contents
do we want to recognize table of contents even if the PDF is not Tagged?

denisbialy · 2025-11-11T21:38:30Z

denisbialy
Nov 11, 2025
Maintainer

Having table of contents as a separate element seems useful for generating AI summary of pdfs. Assuming that we want to feed AI JSONs in the first place, and that it will recognize this as something important and not just another page.

For questions:

Yes
It has a very high chance to be recognized as a borderless table (title is one column, page number is another). I don't know if we want to mess with tables recognition even more.

0 replies

hnc-jglee · 2026-03-17T07:59:09Z

hnc-jglee
Mar 17, 2026
Maintainer

Thanks for raising this — here's where we stand and some thoughts on both questions.

Current state in the codebase:
The TaggedDocumentProcessor already has a commented-out stub for TABLE_OF_CONTENT (lines 128-130), and this is on the Q2 2026 roadmap. The infrastructure (entity classes, serializers, schema) doesn't exist yet, but the patterns from ListProcessor/ListSerializer could serve as a good starting template.

Q1: Dedicated JSON object type?

I'd suggest introducing a dedicated toc type. While TOC is structurally similar to a list, it carries distinct semantics that would be worth preserving:

Each item is a link to a heading with an associated page number
The hierarchy mirrors the document's heading structure
Consumers (especially AI pipelines, as @denisbialy noted) would benefit from knowing this is a TOC rather than an arbitrary list

Since TOCI elements in tagged PDFs typically contain hyperlinks (internal links to headings), it would also be worth capturing the link destination. One possible JSON shape could look like:

{
  "type": "toc",
  "items": [
    {
      "type": "toc item",
      "text": "Chapter 1: Introduction",
      "level": 1,
      "page": 5,
      "link": "#heading-1",
      "children": [
        {
          "type": "toc item",
          "text": "1.1 Background",
          "level": 2,
          "page": 7,
          "link": "#heading-1-1"
        }
      ]
    }
  ]
}

For Markdown output, rendering as a nested list with hyperlinks seems like a natural fit:

- [Chapter 1: Introduction](#heading-1) ... 5
  - [1.1 Background](#heading-1-1) ... 7

Q2: Recognize TOC in non-tagged PDFs?

Rather than building separate heuristics, it might be worth leveraging Autotagging for this. Since Autotagging can assign structure tags — including TOC/TOCI — to non-tagged PDFs, the tagged TOC processing path could be reused as-is. If that approach works, we could focus on implementing tagged TOC/TOCI handling first, and non-tagged PDFs would get TOC support naturally through Autotagging.

Would love to hear your thoughts on this direction!

2 replies

MaximPlusov Mar 17, 2026
Maintainer

@hnc-jglee

non-tagged PDFs would get TOC support naturally through Autotagging.

The auto-tagging is just used structure elements that we detect during layout recognition step, so recognizing table of contents for non-tagged PDFs is separate task.

hnc-jglee Apr 3, 2026
Maintainer

@MaximPlusov Good point — you're right. Looking at it more carefully, auto-tagging and TOC detection are indeed separate concerns. The roadmap actually reflects this as well: "Auto-Tagging Engine" and "TOC Extraction" are listed as distinct items.

So the practical plan would be:

Phase 1 — Tagged PDF TOC processing: Activate the existing TABLE_OF_CONTENT stub in TaggedDocumentProcessor and build out the entity classes, serializers, and JSON schema (the toc type proposed above). This handles PDFs that already have TOC/TOCI structure elements.
Phase 2 — Untagged PDF TOC detection: Build dedicated heuristics to recognize TOC-like patterns from visual layout (e.g., repeated "title ... page number" lines, leader dots, internal hyperlinks). This would be a standalone detection step, not dependent on auto-tagging.

Thanks for the correction — it keeps the implementation plan realistic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for Table of Contents #87

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

support for Table of Contents #87

Uh oh!

Uh oh!

bdoubrov Oct 30, 2025 Maintainer

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

denisbialy Nov 11, 2025 Maintainer

Uh oh!

hnc-jglee Mar 17, 2026 Maintainer

Uh oh!

MaximPlusov Mar 17, 2026 Maintainer

Uh oh!

hnc-jglee Apr 3, 2026 Maintainer

bdoubrov
Oct 30, 2025
Maintainer

Replies: 2 comments 2 replies

denisbialy
Nov 11, 2025
Maintainer

hnc-jglee
Mar 17, 2026
Maintainer

MaximPlusov Mar 17, 2026
Maintainer

hnc-jglee Apr 3, 2026
Maintainer