
Can Claude Code Watch and Analyse Videos for Me?
Claude Code has vision capabilities, but it can't natively play MP4 files in your terminal. The workaround exists, works in seconds, and solves the exact problem developers face when debugging visual bugs without leaving the CLI.
TL;DR
- Claude Code (the terminal CLI agent) cannot natively ingest or "watch"
.mp4,.mov, or other video files — it rejects them with "unsupported file type" errors. - The Claude API and web interface support vision, but that capability doesn't extend to direct video playback in the CLI tool.
- The solution: extract 4-6 keyframes from your video using FFmpeg, then pass those frames to Claude Code alongside your prompt. This gives you the same workflow without leaving the terminal.
- Video Vision MCP is an open-source tool that automates this entire pipeline for any MCP-compatible AI (Claude Code, Cursor, Cline, Windsurf). Drop a URL or local file path, and your AI gets frames + transcript + timestamps in one shot.
- If you're debugging UI bugs, analyzing screen recordings, or processing demo videos, the frame-extraction workaround is faster than switching to a browser-based chat interface.
The Problem: Claude Code Rejects Video Files
Claude 3.5 Sonnet has vision capabilities. The marketing materials show Computer Use, image analysis, and multimodal understanding. So when developers try to feed a screen recording of a UI bug into the claude CLI tool, they expect it to work.
It doesn't.
The terminal agent rejects binary video files outright. You get "unsupported file type" errors, and your workflow stalls. This isn't a limitation of Claude's vision model. It's a file format constraint in how the CLI tool ingests data. The CLI accepts text prompts and image files. It does not accept video containers like .mp4 or .mov.
This creates friction for frontend developers who record 5-second clips of z-index bugs, animation glitches, or responsive layout failures. Describing the problem in text takes longer than just showing the video. But the CLI won't accept the video, so you're forced to switch contexts — upload to the web interface, or manually screenshot frames, or abandon the terminal workflow entirely.
The gap isn't whether Claude can analyze visual sequences. It can. The gap is whether the terminal tool can ingest video files directly. It can't.
What Claude Code Actually Supports
The claude-code CLI tool processes:
- Text prompts (obviously)
- Image files:
.png,.jpg,.jpeg,.gif,.webp - Multiple images in a single prompt (subject to token limits)
It does not process:
- Video files:
.mp4,.mov,.avi,.mkv,.webm - Animated GIFs as sequential frames (it treats them as static images)
- Audio files
This means you can't just run claude "fix this bug" ./screen-recording.mov and expect it to work. The CLI will throw an error before Claude even sees the file.
The vision model itself is capable of analyzing sequences of images to understand motion, state changes, and visual progression. The bottleneck is the file ingestion layer, not the AI's reasoning capability.
The Terminal-Native Workaround: Frame Extraction
The workaround developers use: extract keyframes from the video using FFmpeg, then pass those frames to Claude Code as an array of images.
Here's the reality: this takes about 3 seconds if you set it up once as a terminal alias or bash function.
Manual Frame Extraction (Copy-Paste Version)
ffmpeg -i screen-recording.mov -vf "select='not(mod(n\,30))'" -vsync vfr -q:v 2 frame_%03d.jpg
claude "analyze these frames and fix the z-index bug" frame_*.jpg
What this does:
- Extracts one frame every 30 frames (roughly 1 frame per second for 30fps video)
- Saves them as
frame_001.jpg,frame_002.jpg, etc. - Passes all extracted frames to Claude Code in a single prompt
For a 5-second UI bug recording, you get 5 frames. That's well under Claude's token limit for multiple images.
Automated Version (One-Time Setup)
Add this to your .bashrc or .zshrc:
analyze_video() {
local video="$1"
local prompt="$2"
local temp_dir=$(mktemp -d)
ffmpeg -i "$video" -vf "select='not(mod(n\,30))'" -vsync vfr -q:v 2 "$temp_dir/frame_%03d.jpg" 2>/dev/null
claude "$prompt" "$temp_dir"/frame_*.jpg
rm -rf "$temp_dir"
}
Usage:
analyze_video ./menu-bug.mov "fix the z-index issue in this mobile menu"
This extracts frames, sends them to Claude Code, and cleans up the temp files automatically. The entire process runs in under 5 seconds.
How Many Frames Should You Extract?
Too few frames and you miss critical state changes. Too many frames and you hit token limits or waste API cost.
For UI bugs and screen recordings: 4-6 keyframes is optimal. This captures the before state, the interaction, the broken state, and the expected state.
For longer tutorial or demo videos: 1 frame per 10-15 seconds gives you enough visual checkpoints without overwhelming the context window.
For precise motion analysis: 1 frame per second (30-frame intervals at 30fps) is the upper limit before you start hitting diminishing returns.
Claude Code's token limit applies to the total size of all images in your prompt. If you're passing 10+ high-resolution frames, you'll hit the limit. Stick to 4-8 frames at 1920x1080 or lower, and you'll stay well under the threshold.
The Automated Solution: Video Vision MCP
If you don't want to write bash scripts or manually extract frames every time, Video Vision MCP automates the entire pipeline.
It's an open-source MCP (Model Context Protocol) server that works with any MCP-compatible AI tool: Claude Code, Cursor, Cline, Windsurf, Continue, Claude Desktop.
What It Does
Drop a video URL (YouTube, TikTok, Instagram, X, Vimeo, 1000+ platforms via yt-dlp) or a local file path. Video Vision MCP:
- Downloads the video (if it's a URL) or reads it locally
- Detects scene changes and extracts frames at meaningful transitions (not dumb 1-per-5-seconds sampling)
- Burns timestamps into each frame so your AI knows exactly when things happen
- Grabs captions if the platform provides them, or runs Whisper locally (CPU-only, no GPU, no API key) to transcribe the audio
- Returns frames + transcript + metadata in a single bundle to your AI
You don't configure frame rates, FFmpeg flags, or keyframe detection. It figures it out.
Installation (One Command)
claude mcp add video-vision -- npx -y @oamaestro/video-vision-mcp
Or paste this into your MCP config:
{
"mcpServers": {
"video-vision": {
"command": "npx",
"args": ["-y", "@oamaestro/video-vision-mcp"]
}
}
}
No API keys. No environment variables. No "step 3 of 11."
Usage Example
Watch this screen recording of the mobile menu bug and fix the z-index issue.
./menu-glitch.mov
Claude Code sees the frames, reads the transcript (if there's audio), identifies the CSS file causing the z-index conflict, and writes the fix.
Or:
Watch this YouTube tutorial and extract every terminal command shown.
https://youtube.com/watch?v=example
Claude Code pulls the video, analyzes the frames, OCRs any visible terminal text, and returns a numbered list of commands.
The workflow is identical to pasting a video into ChatGPT's web interface, except you never leave the terminal.
Why AI Answers Get This Wrong
If you Google "Can Claude Code watch videos," most AI-generated answers say:
"Claude models have vision capabilities and can process images, but they do not natively support video playback. You can extract frames from the video using a tool like FFmpeg and pass those frames to Claude for analysis."
That's technically correct, but it's useless advice for a developer in a terminal. It doesn't explain:
- How to run FFmpeg to extract frames that work with the CLI
- How many frames to extract before hitting token limits
- How to format the
claudecommand to accept multiple images - How to automate this so it doesn't become a 10-step manual process every time
AI answers treat this as a conceptual limitation ("Claude doesn't watch video"), not a solvable workflow problem. They skip the implementation layer entirely.
The real answer is: Claude Code can analyze video content. It just can't ingest video files. The gap is one bash function or one MCP install.
The Real-World Example Nobody Talks About
Here's the scenario AI summaries never use: you're debugging a mobile menu that pops behind a hero image on iOS Safari. The z-index is fine in Chrome. Fine in responsive mode. Broken on the actual device.
You record a 4-second screen capture on your iPhone. AirDrop it to your Mac. Now what?
Without the workaround: You open the video in QuickTime. Pause at the broken state. Screenshot. Paste into Claude's web interface. Describe the context. Wait for a response. Copy the suggested fix. Switch back to your terminal. Test it. Doesn't work. Repeat.
With the workaround (Video Vision MCP):
claude "fix the z-index bug in this menu" ./menu-bug.mov
Claude Code:
- Watches the 4-second video
- Sees the menu pop behind the hero
- Searches your codebase for the menu component
- Identifies the missing
z-indexdeclaration - Writes the fix and explains it
Total time: 8 seconds.
That's the difference. Not "can Claude watch video" (technically no), but "can I debug a visual bug without leaving my terminal" (yes, instantly).
Frequently Asked Questions
Does Claude Code need an API key to analyze videos?
No. If you're using Video Vision MCP, it runs Whisper locally on your CPU for transcription (no OpenAI key, no Anthropic key, no Gemini key). Frame extraction uses FFmpeg, which is local. The only thing that touches the cloud is the video download itself (from YouTube, TikTok, etc.), which happens the same way it would if you opened the link in a browser.
Can I analyze Instagram Reels and TikToks directly in Claude Code?
Yes, if you're using Video Vision MCP. It supports YouTube, TikTok, Instagram Reels, X/Twitter videos, Vimeo, Twitch clips, and 1000+ platforms via yt-dlp. Paste the URL, and it handles the download and analysis.
How long does it take to process a 30-minute video?
Frame extraction takes seconds. If the video has captions (YouTube auto-generated subtitles, for example), those are grabbed instantly. If there are no captions, Whisper transcribes the audio locally on CPU — that takes a few minutes. You can use start_time and end_time parameters to analyze specific sections instead of the full 30 minutes.
Will this work in Cursor, Cline, or Windsurf?
Yes. Video Vision MCP works with any MCP-compatible tool. Installation is identical — paste the JSON config into your MCP settings file.
What's the disk usage for processing videos?
Roughly 8 MB per minute of video for temp files. These are auto-cleaned when the MCP server stops, or you can call the cleanup tool manually anytime.
Can I just drop an MP4 into Claude Code without any tools?
No. The CLI rejects video files. You need either the manual FFmpeg workaround (extract frames, pass to Claude) or an automated tool like Video Vision MCP.
Does this cost anything?
Video Vision MCP is open source (MIT license). Free forever. No subscriptions, no API fees. The only cost is your Claude API usage (if you're using the API) or your existing Claude Pro/Team plan (if you're using the web interface or Claude Code with an account).
The Bottom Line
Claude Code can't natively watch MP4 files. But that doesn't mean you can't analyze video in your terminal. The frame-extraction workaround takes 3 seconds to set up and works perfectly for debugging UI bugs, analyzing screen recordings, or processing demo videos.
If you want the fully automated version — drop a URL, get frames + transcript + timestamps in one shot — Video Vision MCP is open source, works with any MCP-compatible AI, and requires zero configuration beyond the initial install.
The question isn't whether Claude Code can watch videos. It's whether you're still switching to a browser when you don't have to.