YouTube Video Automation
An end-to-end pipeline that automatically converts kids' news articles into narrated YouTube videos with AI-generated tags and a human review workflow.
March 2026
Overview
YouTube Video Automation is a full-stack platform that turns kids' news articles into ready-to-publish YouTube videos — completely hands-off. It fetches articles from a news API, generates narration with ElevenLabs TTS, composes videos with full-screen background images and timed text overlays, generates tags with a local AI model, and uploads to YouTube with thumbnails and playlist support.
The motivation was simple: producing educational video content for kids is repetitive and time-consuming. This system automates the entire pipeline while keeping a human-in-the-loop review step so nothing goes live without approval. It runs on a daily schedule but can also be triggered manually from a web dashboard.
Features
- Automated content pipeline — Fetches articles, generates TTS audio, composes video, generates tags, and uploads to YouTube on a configurable daily schedule
- Full-screen video composition — Article image scaled to cover 1920x1080, title at top with text shadow, paragraphs shown one at a time at the bottom with a semi-transparent dark backdrop
- Human review workflow — Preview, approve, reject, or regenerate videos before they go to YouTube
- AI-powered tag generation — Local Ollama LLM generates topic-specific tags, merged with default tags, respecting YouTube's 500-character limit
- YouTube integration — OAuth 2.0 auth, automatic upload with metadata, custom thumbnails from article images, auto-add to playlist, and deletion support
- Background video regeneration — Non-blocking regeneration with real-time status polling and spinner UI
- Bulk operations — Dropdown menu for bulk deleting generated, failed, or rejected videos
- Retry failed uploads — One-click retry for videos that failed during upload
- Configurable settings — Daily article limit, age group targeting (3-6 or 7-10), scheduler timing, ElevenLabs voice selection
Architecture
The system follows a service-oriented architecture with a FastAPI backend orchestrating multiple specialized services:
Articles API → Article Fetcher → TTS Service (ElevenLabs) → Video Composer (MoviePy + PIL)
→ Tag Generator (Ollama) → YouTube Uploader (Google API) → Playlist
Backend (Python/FastAPI): Central REST API handling all business logic. Uses aiosqlite for async database operations and APScheduler for daily cron jobs. Each pipeline stage is a separate service module with its own error handling and status tracking.
Video Composition: PIL renders text to images (for precise control over alignment and shadows), which MoviePy composites over the full-screen background. Each paragraph is timed proportionally to its character count relative to the audio duration, shown only during its time slot.
Status Machine: Videos flow through 8 states — pending_tts → pending_video → generated → approved/rejected → uploading → uploaded/failed. Each state is independently recoverable, so a failure at any stage doesn't lose prior work.
Frontend (React + Vite): Single-page app with Axios API client, Tailwind CSS styling, and polling-based real-time updates for background operations.
Learnings
- Understanding the YouTube Data API — working with quota limits, OAuth token lifecycle, and undocumented upload restrictions required careful handling of auth refresh, retry logic, and user friendly error message handling.
- TTS cost management — ElevenLabs charges per character, so generated audio is saved separately from the video. This way, adjusting the video layout only requires recomposing the video without regenerating (and repaying for) the audio.
- Video composition is an iterative visual process — getting the layout right (image scaling, text positioning, font sizes, overlay opacity) required constant regeneration and review. Building a one-click regenerate button that works in the background was essential — without it, every layout tweak meant waiting for the full compose cycle.
- The balance between automation and manual control is the real design challenge — full automation sounds ideal but produces mistakes that are expensive to fix once they're on YouTube. The sweet spot is automating the repetitive work (fetching, TTS, composing, tagging) while keeping humans in the loop for quality decisions (approve/reject, tag editing, review before upload). This means every automated step needs a manual override: regenerate the video, edit the tags, retry the upload, delete from YouTube.
- Automated workflows need manual escape hatches at every stage — the pipeline runs daily on a schedule, but things go wrong: bad articles, weird TTS output, quota limits, auth expiry. The status-based pipeline design means each video's state is tracked independently, so a failure at one stage doesn't block others or lose prior work. Every action is reversible — you can regenerate without re-doing TTS, retry uploads without regenerating, delete from YouTube and re-upload. This saves time in the normal case while making it easy to correct mistakes when they happen.
- AI-generated content needs human curation, not replacement — Ollama generates useful tags most of the time, but occasionally produces irrelevant or overly generic suggestions. AI-generated tags are merged with a fixed set of predefined tags to ensure full coverage. The editable tag UI lets you review and adjust before publishing.