per-video JSON
{
"video_id": "abc123",
"channel": "DevOops_conf",
"title": "video title",
"views": 12345,
"duration": 901,
"published": "2024-05-01",
"text": "normalized transcript text"
}Automates collection and structuring of YouTube data for scalable analysis and research workflows.
01
This case is positioned as a clean CLI and data-pipeline showcase for collecting YouTube data and publishing stable JSON artifacts for downstream analysis. The emphasis is not just on grabbing subtitles, but on a repo with a clear public contract: cleaner README, runbook, architecture notes, curated examples, and predictable exports.
02
I built a TypeScript CLI on Node.js with an explicit flow for single-video fetches, batch processing, and channel-level exports. The repository now documents the operational layer as well as the code: a cleaner README/runbook, `docs/architecture.md`, CI, typed response boundaries, and small representative fixtures under `examples/`.
03
In practice the pipeline had to deal with unstable player responses, platform limits, and long-running batches. The current repo captures that reliability work with retries, backoff, cooldown windows, key and user-agent rotation, deterministic exports, and transparent notes about the current `youtubei/v1/player` adapter, while keeping that endpoint detail out of the headline positioning.
04
The result is a polished public repo that makes the project contract easy to review: per-video JSON under `video_data/`, combined channel exports under `exports/`, a runbook for `fetch` / `batch` / `export`, architecture notes, and curated examples for quick inspection. It works as both an engineering showcase and a practical base for text-analysis workflows.
repo
The public repository shows the current shape of the pipeline: README and runbook, architecture notes, CI, sample outputs, and canonical exports for review.
stack
artifacts
{
"video_id": "abc123",
"channel": "DevOops_conf",
"title": "video title",
"views": 12345,
"duration": 901,
"published": "2024-05-01",
"text": "normalized transcript text"
}exports/DevOops_conf.json
exports/HighLoadChannel.json
array of sorted per-video records
stable shape for downstream analysisexamples/single-video.json
examples/channel-export.json
small curated artifacts that mirror the public output contractfetch -> typed per-video JSON -> batch status updates -> export -> canonical channel dataset