Skip to content

feat: Auto generate chapters for podcasts that provide timestamps#5119

Open
harryr0se wants to merge 18 commits intoadvplyr:masterfrom
harryr0se:auto-generate-chapters-from-timestamps
Open

feat: Auto generate chapters for podcasts that provide timestamps#5119
harryr0se wants to merge 18 commits intoadvplyr:masterfrom
harryr0se:auto-generate-chapters-from-timestamps

Conversation

@harryr0se
Copy link
Copy Markdown

@harryr0se harryr0se commented Mar 11, 2026

Brief summary

This PR adds support for the automatic generation of chapters when a podcast episode provides timestamps in the description, it does this by scraping the description line by line and building up a chapter list

Which issue is fixed?

I started working on this as it's something I really wanted, but I've found the following related issue:
#2363

In-depth Description

If the newly added autoGenerateChapters field is true on the Podcast object, the generation code will run when ABS creates a PodcastEpisode object from a newly downloaded RSSPocastEpisode

The generation steps:

  1. Break up the description into lines, currently it splits on any of the following </p>, <br /> or \n
  2. Iterating each line we look for a timestamp via regex
  3. If we match, we try to work out if the timestamp contains an hour or not, it's common for descriptions to only start including hours when they tick over the hour mark, for example
• 00:00 Chapter 1
• 30:00 Chapter 2
• 1:04:14 Chapter 3
  1. We then calculate the chapter start time in seconds based upon this timestamp
  2. Extracting the title is a matter of a further regex which attempts to find text after the timestamp
  3. If there are other chapters that have been generated then we update the last ones end value to be this new chapters start, this makes the assumption that timestamps will be sequential and contiguous
  4. Once out of the loop we update the last chapter to end at the duration of the audio file

Error checking

I believe that this sort of feature should be quite conservative and if there are instances where we would be unsure of the state of a given timestamp we should bail out of the entire process for the podcast episode. This is particularly important due to the fact we're treating them as neighboring chapters, so errors could propagate

This implementation currently has the following error handling:

  1. Throwing on basic argument null checks
  2. Throwing if we're unable to scrape the title of a given chapter
  3. Throw if we scrape and are only able to find one chapter (perhaps this isn't required, but one chapter seems unhelpful and I felt it could indicate some parsing failure)
  4. Throw if there's timestamps past the end of the audio file
  5. Throw if there's minutes or seconds over 59

How have you tested this?

I have added a new test suite for this scraping code, I've tried to cover a number of success and failure cases
All of the above checks if "error checking" should be captured by tests

I've also been running my fork with this for nearly a week and it's working well on the 3 podcasts I subscribe to which provide timestamps

Screenshots

Web interface

image image

iOS app beta

image

Next steps

I wanted to open this PR to start a discussion with maintainers and get feedback.

I'm aware that there's a re-write of the front end ongoing, so I've tried to craft this PR the something that could land server side and then be included in the new UI. In the meantime it could be enabled on a per podcast basis via the api
It would be nice to know if that would be something you'd be open to

@harryr0se harryr0se marked this pull request as ready for review March 13, 2026 19:21
…asts table

- Bump minor version (I wasn't sure if this was needed for the migration)
- Feature is now controlled by the field in the podcast database object
- Move parsing code and tests to existing utils/parsers/ dir
- Add more test cases
@advplyr
Copy link
Copy Markdown
Owner

advplyr commented Mar 17, 2026

I'm open to this, but I don't think we need a flag for it. If we can determine that this is reliable enough then we can have it on by default. It could possibly be a library setting that you could turn off.

If the podcast episode has chapters in the audio file, or it has chapters in the RSS feed then we should always prefer those. If it has neither and the description timestamps meet our criteria, then we pull chapters from the description.

@harryr0se
Copy link
Copy Markdown
Author

@advplyr Thanks for the feedback!

I'm open to this, but I don't think we need a flag for it. If we can determine that this is reliable enough then we can have it on by default. It could possibly be a library setting that you could turn off.

That sounds great, let me update the PR to remove the flag and make it the default

If the podcast episode has chapters in the audio file, or it has chapters in the RSS feed then we should always prefer those. If it has neither and the description timestamps meet our criteria, then we pull chapters from the description.

This makes sense to me, I believe that priority is already part of this PR, this is the final fallback after checking the audioFile and the rssPodcastEpisode objects

    if (audioFile.chapters?.length) {
      podcastEpisode.chapters = audioFile.chapters.map((ch) => ({ ...ch }))
    } else if (rssPodcastEpisode.chapters?.length) {
      podcastEpisode.chapters = rssPodcastEpisode.chapters.map((ch) => ({ ...ch }))
    } else {
     ... Try auto generating
   }

If we can determine that this is reliable enough

Regarding this, do you have any particular tests you'd like to see for such a feature?
I've been going through the podcasts that I personally subscribe to which support timestamps and testing with them.

Last night I also went through the top podcasts on PocketCasts searching for more test cases, which is where I came across an example where chapter titles could contain html tags. Currently I'm capturing all of these with automated tests, if you have any further scenarios or error checking you'd like to see let me know

@harryr0se
Copy link
Copy Markdown
Author

@advplyr I've updated the PR to remove the flag and added a few more test cases

I've also added a high level early out if there's no timestamps in the full description string to make sure this code only runs when it's most likely to succeed

throw new Error(`Chapter found that starts after over audio duration. Duration: ${audioDurationSecs}s - Chapter start ${startTime}s`)
}

let chapterTitleMatch = chapterTitleRegex.exec(line)

Check failure

Code scanning / CodeQL

Polynomial regular expression used on uncontrolled data High

This
regular expression
that depends on
a user-provided value
may run slow on strings with many repetitions of ' '.
This
regular expression
that depends on
a user-provided value
may run slow on strings with many repetitions of ' '.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants