Skip to content

the-ai-entrepreneur-ai-hub/youtube-transcript-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

YouTube Transcript Extractor — Get Video Transcripts for AI, RAG & Content Repurposing

Extract transcripts, subtitles, and captions from any YouTube video. Supports playlists, channels, 100+ languages, and both manual and auto-generated captions. Perfect for AI training data, RAG pipelines, content repurposing, and SEO analysis. No YouTube API key required.

Run on Apify Available on RapidAPI License: ISC

What It Does

flowchart LR
    A["YouTube URLs<br/>Videos, Playlists, Channels"] --> B["InnerTube API<br/>Android / iOS / TV clients"]
    B --> C["Caption Tracks<br/>100+ Languages"]
    C --> D["Structured Output<br/>Full Text + Timestamps"]
    D --> E["Your Pipeline<br/>RAG / AI / Content"]

    style A fill:#ff0000,color:#fff,stroke:none
    style B fill:#1a1a2e,color:#fff,stroke:none
    style C fill:#0f3460,color:#fff,stroke:none
    style D fill:#533483,color:#fff,stroke:none
    style E fill:#34a853,color:#fff,stroke:none
Loading

This extractor uses YouTube's internal InnerTube API to fetch caption tracks directly — no YouTube Data API key required, no OAuth, no quotas. It tries multiple client types (Android, iOS, TV) to maximize success rate, even for music videos and VEVO content.

What Data You Get

{
  "videoId": "dQw4w9WgXcQ",
  "videoUrl": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
  "title": "Rick Astley - Never Gonna Give You Up",
  "channelName": "Rick Astley",
  "viewCount": 1500000000,
  "publishDate": "2009-10-25",
  "language": "en",
  "isAutoGenerated": true,
  "hasTranscript": true,
  "transcriptText": "We're no strangers to love. You know the rules and so do I...",
  "wordCount": 284,
  "segments": [
    {
      "text": "We're no strangers to love",
      "start": 18.0,
      "duration": 3.5,
      "startFormatted": "0:18"
    }
  ],
  "availableLanguages": [
    { "code": "en", "name": "English", "isAutoGenerated": true },
    { "code": "es", "name": "Spanish", "isAutoGenerated": true }
  ]
}

Quick Start

cURL

curl "https://api.apify.com/v2/acts/george.the.developer~youtube-transcript-scraper/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
  -X POST \
  -d '{
    "urls": ["https://www.youtube.com/watch?v=dQw4w9WgXcQ"],
    "language": "en",
    "outputFormat": "both",
    "includeMetadata": true
  }' \
  -H 'Content-Type: application/json'

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const run = await client.actor('george.the.developer/youtube-transcript-scraper').call({
    urls: [
        'https://www.youtube.com/watch?v=dQw4w9WgXcQ',
        'https://www.youtube.com/playlist?list=PLrAXtmErZgOeiKm4sgNOknGvNjby9efdf',
    ],
    language: 'en',
    outputFormat: 'full-text',
    includeMetadata: true,
    maxVideos: 50,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(video => {
    console.log(`${video.title} (${video.wordCount} words)`);
    console.log(video.transcriptText.substring(0, 200) + '...');
});

Python — Build a RAG Pipeline

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

# Extract transcripts from an entire playlist
run = client.actor("george.the.developer/youtube-transcript-scraper").call(run_input={
    "urls": ["https://www.youtube.com/playlist?list=YOUR_PLAYLIST_ID"],
    "language": "en",
    "outputFormat": "full-text",
    "includeMetadata": True,
    "maxVideos": 100,
})

# Build documents for RAG/vector store
documents = []
for video in client.dataset(run["defaultDatasetId"]).iterate_items():
    if video.get("hasTranscript"):
        documents.append({
            "text": video["transcriptText"],
            "metadata": {
                "source": video["videoUrl"],
                "title": video["title"],
                "channel": video["channelName"],
                "date": video.get("publishDate", ""),
            }
        })

print(f"Built {len(documents)} documents for RAG pipeline")

# Feed into your vector store (Pinecone, Weaviate, Chroma, etc.)
# for doc in documents:
#     vector_store.add(doc["text"], metadata=doc["metadata"])

Use Cases

  • AI Training Data — Build datasets from YouTube transcripts for LLM fine-tuning or NLP research
  • RAG Pipelines — Index video content in vector databases for retrieval-augmented generation
  • Content Repurposing — Turn videos into blog posts, newsletters, social media threads
  • SEO Analysis — Analyze competitor video content and keyword usage
  • Accessibility — Generate text versions of video content
  • Research — Analyze speeches, lectures, interviews, and educational content at scale
  • Podcast Notes — Auto-generate show notes from video podcasts

Input Parameters

Parameter Type Default Description
urls string[] required YouTube video, playlist, or channel URLs
language string en Preferred language code
includeTimestamps boolean true Include start time per segment
outputFormat string both full-text, segments, or both
maxVideos number 50 Max videos to process (1-5000)
includeMetadata boolean true Include title, channel, views, etc.
maxConcurrency number 5 Concurrent requests

Supported URL Formats

  • https://www.youtube.com/watch?v=VIDEO_ID
  • https://youtu.be/VIDEO_ID
  • https://www.youtube.com/playlist?list=PLAYLIST_ID
  • https://www.youtube.com/@ChannelName
  • https://www.youtube.com/channel/CHANNEL_ID
  • https://www.youtube.com/shorts/VIDEO_ID

Run on Apify

Run this actor on Apify — extract transcripts from hundreds of videos in minutes.

Also Available on RapidAPI

Prefer a standard REST API? This extractor is also available on RapidAPI with simple API key authentication:

  • Free tier: 30 requests/month
  • Pro: $19/month (500 requests)
  • Ultra: $49/month (2,000 requests)
  • Mega: $129/month (10,000 requests)

Limitations

  • Not all YouTube videos have captions/transcripts. The extractor reports hasTranscript: false for videos without available captions.
  • Auto-generated captions may contain errors (especially for technical jargon or non-English content).
  • This tool does not bypass age restrictions or geo-blocked content. Using a proxy can help with geo-restrictions.

Related Tools

License

ISC License. See LICENSE for details.


Built by george.the.developer on Apify.

About

Extract YouTube video transcripts and captions. AI training data, RAG pipelines, content repurposing. Supports 100+ languages. No API key required.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors