By default, the app transcribes audio and summarizes the transcript. Visual mode skips transcription entirely and sends the video itself — including visible content and audio — to a video-capable model. Long videos are split into timestamped temporal chunks using the configured visual limits.
How to enable it:
--visual flag to your commandThe idea for visual mode came from a community discussion in PR #13 and PR #14. Contributors pointed out a real gap: a significant amount of key information appears as on-screen text overlaid on the video, not spoken in the audio.
Use cases where visual mode shines:
- Cooking reels — ingredients, quantities, and steps written on screen
- Travel tips — locations, prices, and directions shown as text overlays
- Meme videos — visual humor and context that speech-to-text misses entirely
- Educational content — diagrams, code snippets, and formulas displayed visually
- Any video where audio and visual together tell the full story
For long-form content like lectures and podcasts, audio-only summaries are usually more accurate. Visual mode is designed for content where what you see matters as much as what you hear.
base64 is the default. URL mode is enabled per provider with visual-input-mode: url. Local files still use base64. Non-YouTube remote URLs are rejected in URL mode; use base64 mode for those sources.
Direct Google Gemini’s OpenAI-compatible endpoint (generativelanguage.googleapis.com/v1beta/openai) does not accept video_url content parts. For Gemini YouTube URL passthrough, use an OpenAI-compatible provider that supports video_url, such as OpenRouter.
Default conservative values:
Long visual runs are split automatically when the provider supports chunking. For example, an 820 second NVIDIA video becomes seven visual requests: six 120 second chunks and one 100 second chunk. The final output is timestamped by segment.
Enable automatic compression in summarizer.yaml:
Or override limits per-provider:
chunk-size remains text-only and is ignored in visual mode.