sample JSON structure
{
"videoId": "abc123",
"title": "placeholder title",
"channel": "channel name",
"subtitles": [
{ "start": 0.0, "text": "..." }
]
}Collected subtitles and metadata from 2,900+ YouTube videos and packaged them into structured JSON datasets for further research and analysis.
The client needed to collect and structure content from several YouTube channels in order to analyze recurring topics and industry discussions. The goal was to prepare a dataset that could be used for further text analysis and research.
I developed a TypeScript CLI pipeline running on Node.js. The system retrieves video metadata, extracts automatic subtitles, stores each video as a separate JSON file, and then combines them into channel-level datasets.
Large-scale extraction from YouTube requires dealing with unstable access and platform limits. To keep the pipeline reliable during long runs, the system includes retries, randomized delays, cooldown periods, API key rotation, user-agent rotation, and a status registry that tracks processing attempts.
The repository contains processed data for more than 2,900 videos across three channels, exported as structured JSON datasets totaling about 164 MB. The resulting data can be used for topic analysis, keyword extraction, and content clustering.
{
"videoId": "abc123",
"title": "placeholder title",
"channel": "channel name",
"subtitles": [
{ "start": 0.0, "text": "..." }
]
}channel-a.json
channel-b.json
channel-c.json
videos: 2900+
size: ~164 MBchannels -> metadata fetch -> subtitle extraction -> per-video JSON -> channel dataset export