feat: Auto generate chapters for podcasts that provide timestamps#5119
feat: Auto generate chapters for podcasts that provide timestamps#5119harryr0se wants to merge 18 commits intoadvplyr:masterfrom
Conversation
…efix to match other logs
…ub.com/harryr0se/audiobookshelf into auto-generate-chapters-from-timestamps
…asts table - Bump minor version (I wasn't sure if this was needed for the migration) - Feature is now controlled by the field in the podcast database object - Move parsing code and tests to existing utils/parsers/ dir - Add more test cases
|
I'm open to this, but I don't think we need a flag for it. If we can determine that this is reliable enough then we can have it on by default. It could possibly be a library setting that you could turn off. If the podcast episode has chapters in the audio file, or it has chapters in the RSS feed then we should always prefer those. If it has neither and the description timestamps meet our criteria, then we pull chapters from the description. |
|
@advplyr Thanks for the feedback!
That sounds great, let me update the PR to remove the flag and make it the default
This makes sense to me, I believe that priority is already part of this PR, this is the final fallback after checking the if (audioFile.chapters?.length) {
podcastEpisode.chapters = audioFile.chapters.map((ch) => ({ ...ch }))
} else if (rssPodcastEpisode.chapters?.length) {
podcastEpisode.chapters = rssPodcastEpisode.chapters.map((ch) => ({ ...ch }))
} else {
... Try auto generating
}
Regarding this, do you have any particular tests you'd like to see for such a feature? Last night I also went through the top podcasts on PocketCasts searching for more test cases, which is where I came across an example where chapter titles could contain html tags. Currently I'm capturing all of these with automated tests, if you have any further scenarios or error checking you'd like to see let me know |
|
@advplyr I've updated the PR to remove the flag and added a few more test cases I've also added a high level early out if there's no timestamps in the full description string to make sure this code only runs when it's most likely to succeed |
| throw new Error(`Chapter found that starts after over audio duration. Duration: ${audioDurationSecs}s - Chapter start ${startTime}s`) | ||
| } | ||
|
|
||
| let chapterTitleMatch = chapterTitleRegex.exec(line) |
Check failure
Code scanning / CodeQL
Polynomial regular expression used on uncontrolled data High
Brief summary
This PR adds support for the automatic generation of chapters when a podcast episode provides timestamps in the description, it does this by scraping the description line by line and building up a chapter list
Which issue is fixed?
I started working on this as it's something I really wanted, but I've found the following related issue:
#2363
In-depth Description
If the newly added
autoGenerateChaptersfield is true on thePodcastobject, the generation code will run when ABS creates aPodcastEpisodeobject from a newly downloadedRSSPocastEpisodeThe generation steps:
</p>,<br />or\nendvalue to be this new chaptersstart, this makes the assumption that timestamps will be sequential and contiguousError checking
I believe that this sort of feature should be quite conservative and if there are instances where we would be unsure of the state of a given timestamp we should bail out of the entire process for the podcast episode. This is particularly important due to the fact we're treating them as neighboring chapters, so errors could propagate
This implementation currently has the following error handling:
nullchecksHow have you tested this?
I have added a new test suite for this scraping code, I've tried to cover a number of success and failure cases
All of the above checks if "error checking" should be captured by tests
I've also been running my fork with this for nearly a week and it's working well on the 3 podcasts I subscribe to which provide timestamps
Screenshots
Web interface
iOS app beta
Next steps
I wanted to open this PR to start a discussion with maintainers and get feedback.
I'm aware that there's a re-write of the front end ongoing, so I've tried to craft this PR the something that could land server side and then be included in the new UI. In the meantime it could be enabled on a per podcast basis via the api
It would be nice to know if that would be something you'd be open to