back to home

YouTube data collection pipeline for research

Collected subtitles and metadata from 2,900+ YouTube videos and packaged them into structured JSON datasets for further research and analysis.

overview

The client needed to collect and structure content from several YouTube channels in order to analyze recurring topics and industry discussions. The goal was to prepare a dataset that could be used for further text analysis and research.

solution

I developed a TypeScript CLI pipeline running on Node.js. The system retrieves video metadata, extracts automatic subtitles, stores each video as a separate JSON file, and then combines them into channel-level datasets.

challenges

Large-scale extraction from YouTube requires dealing with unstable access and platform limits. To keep the pipeline reliable during long runs, the system includes retries, randomized delays, cooldown periods, API key rotation, user-agent rotation, and a status registry that tracks processing attempts.

result

The repository contains processed data for more than 2,900 videos across three channels, exported as structured JSON datasets totaling about 164 MB. The resulting data can be used for topic analysis, keyword extraction, and content clustering.

stack

  • TypeScript
  • Node.js
  • YouTube internal endpoints
  • JSON dataset pipeline

proof / artifacts

sample JSON structure

{
  "videoId": "abc123",
  "title": "placeholder title",
  "channel": "channel name",
  "subtitles": [
    { "start": 0.0, "text": "..." }
  ]
}

dataset export preview

channel-a.json
channel-b.json
channel-c.json

videos: 2900+
size: ~164 MB

pipeline overview

channels -> metadata fetch -> subtitle extraction -> per-video JSON -> channel dataset export