YouTube data collection pipeline for research

Automates collection and structuring of YouTube data for scalable analysis and research workflows.

overview

This case is positioned as a clean CLI and data-pipeline showcase for collecting YouTube data and publishing stable JSON artifacts for downstream analysis. The emphasis is not just on grabbing subtitles, but on a repo with a clear public contract: cleaner README, runbook, architecture notes, curated examples, and predictable exports.

solution

I built a TypeScript CLI on Node.js with an explicit flow for single-video fetches, batch processing, and channel-level exports. The repository now documents the operational layer as well as the code: a cleaner README/runbook, `docs/architecture.md`, CI, typed response boundaries, and small representative fixtures under `examples/`.

challenges

In practice the pipeline had to deal with unstable player responses, platform limits, and long-running batches. The current repo captures that reliability work with retries, backoff, cooldown windows, key and user-agent rotation, deterministic exports, and transparent notes about the current `youtubei/v1/player` adapter, while keeping that endpoint detail out of the headline positioning.

result

The result is a polished public repo that makes the project contract easy to review: per-video JSON under `video_data/`, combined channel exports under `exports/`, a runbook for `fetch` / `batch` / `export`, architecture notes, and curated examples for quick inspection. It works as both an engineering showcase and a practical base for text-analysis workflows.

repo

repository

The public repository shows the current shape of the pipeline: README and runbook, architecture notes, CI, sample outputs, and canonical exports for review.

open yt-intel on GitHub

stack

TypeScript
Node.js
CLI workflow
Typed JSON artifacts
GitHub Actions CI
youtubei/v1/player adapter

artifacts

proof / artifacts

per-video JSON

{
  "video_id": "abc123",
  "channel": "DevOops_conf",
  "title": "video title",
  "views": 12345,
  "duration": 901,
  "published": "2024-05-01",
  "text": "normalized transcript text"
}

combined channel export

exports/DevOops_conf.json
exports/HighLoadChannel.json

array of sorted per-video records
stable shape for downstream analysis

examples directory

examples/single-video.json
examples/channel-export.json

small curated artifacts that mirror the public output contract

CLI / data flow

fetch -> typed per-video JSON -> batch status updates -> export -> canonical channel dataset