Visual Mode | summarize

By default, the app transcribes audio and summarizes the transcript. Visual mode skips transcription entirely and sends the video itself — including visible content and audio — to a video-capable model. Long videos are split into timestamped temporal chunks using the configured visual limits.

How to enable it:

CLI: Add the --visual flag to your command
Streamlit GUI: Toggle “Visual Mode” in the left sidebar

Why Visual Mode Exists

The idea for visual mode came from a community discussion in PR #13 and PR #14. Contributors pointed out a real gap: a significant amount of key information appears as on-screen text overlaid on the video, not spoken in the audio.

Use cases where visual mode shines:

Cooking reels — ingredients, quantities, and steps written on screen

Travel tips — locations, prices, and directions shown as text overlays

Meme videos — visual humor and context that speech-to-text misses entirely

Educational content — diagrams, code snippets, and formulas displayed visually

Any video where audio and visual together tell the full story

For long-form content like lectures and podcasts, audio-only summaries are usually more accurate. Visual mode is designed for content where what you see matters as much as what you hear.

Visual Input Modes

Mode	Behavior	Use it for
`base64` (default)	Downloads or reads the video, normalizes it if needed, splits long videos into time chunks, and sends each chunk as a base64 `video_url` payload.	Any OpenAI-compatible video endpoint: NVIDIA, OpenRouter, custom proxies, etc.
`url`	Sends the original YouTube URL directly to the provider without downloading or splitting.	Gemini models that can fetch YouTube URLs remotely.

base64 is the default. URL mode is enabled per provider with visual-input-mode: url. Local files still use base64. Non-YouTube remote URLs are rejected in URL mode; use base64 mode for those sources.

Direct Google Gemini’s OpenAI-compatible endpoint (generativelanguage.googleapis.com/v1beta/openai) does not accept video_url content parts. For Gemini YouTube URL passthrough, use an OpenAI-compatible provider that supports video_url, such as OpenRouter.

Playback Speed

The speed setting (CLI --speed, YAML speed, API speed) can speed up or slow down video before it is sent to the model.

Input mode	Behavior when `speed` ≠ `1.0`
`base64` (default)	Downloads or reads the video, applies speed via ffmpeg in `normalize_video`, then splits and encodes chunks. Video-only files (no audio track) are handled correctly.
`url`	Cannot preprocess a remote YouTube URL. The app falls back to base64 mode automatically: it downloads the video, applies `speed`, then continues the normal base64 pipeline.

At the default speed (1.0), URL mode still sends the original YouTube URL directly without downloading.

$ # 2× visual playback in base64 mode
$ python -m summarizer --source "URL" --provider nvidia --visual --speed 2.0
$ 
$ # URL-mode provider with non-default speed (auto-fallback to download + preprocess)
$ python -m summarizer --source "URL" --provider openrouter-youtube --visual --speed 1.5

Set the default in summarizer.yaml:

1 defaults:
2   speed: 1.5

Example Provider Configs

1 providers:
2   openrouter-video:
3     base_url: https://openrouter.ai/api/v1
4     model: minimax/minimax-m3
5 
6   openrouter-youtube:
7     base_url: https://openrouter.ai/api/v1
8     model: google/gemini-3.1-flash-lite
9     visual-input-mode: url
10 
11   nvidia:
12     base_url: https://integrate.api.nvidia.com/v1
13     model: nvidia/nemotron-3-nano-omni-30b-a3b-reasoning

CLI Examples

$ # YouTube video via NVIDIA visual mode
$ python -m summarizer --source "URL" --provider nvidia --visual
$ 
$ # Local file via NVIDIA visual mode
$ python -m summarizer --type "Local File" --source "./clip.mp4" --provider nvidia --visual
$ 
$ # Generic OpenRouter video model (base64 mode)
$ python -m summarizer --source "URL" \
>   --base-url "https://openrouter.ai/api/v1" \
>   --model "minimax/minimax-m3" \
>   --visual
$ 
$ # OpenRouter URL mode: sends the original YouTube URL directly
$ python -m summarizer --source "URL" \
>   --provider openrouter-youtube \
>   --visual

Visual Limits

Default conservative values:

Maximum duration per request: 120 seconds
Maximum file size: 100 MB
Supported formats: MP4, MPEG, MOV, WEBM

Long visual runs are split automatically when the provider supports chunking. For example, an 820 second NVIDIA video becomes seven visual requests: six 120 second chunks and one 100 second chunk. The final output is timestamped by segment.

Enable automatic compression in summarizer.yaml:

1 defaults:
2   visual-compression: auto
3   visual-chunk-seconds: auto
4   visual-chunk-overlap-seconds: 0

Or override limits per-provider:

1 defaults:
2   visual-max-size-mb: 200
3   visual-max-duration-seconds: 300
4   visual-chunk-seconds: 100

chunk-size remains text-only and is ignored in visual mode.