nano-banana-video-generation

The-Focus-AI/nano-banana-video-generation

AI & ML

About

SKILL.md

nano-banana-video-generation

The-Focus-AI/nano-banana-video-generation

AI & ML

About

Generate videos using Google Veo models via the nano-banana CLI. Use this skill when the user asks to create, generate, animate, or produce videos with AI.

SKILL.md

Nano Banana Video Generation

Generate videos using Google Veo 3.1 models via the nano-banana CLI.

Prerequisites

GEMINI_API_KEY environment variable must be set
The CLI is installed via npx @the-focus-ai/nano-banana

Quick Reference

# Generate a video from text
nano-banana --video "A sunset over mountains, slow dolly-in, cinematic lighting"

# Animate an existing image
nano-banana --video "The character slowly turns and smiles" --file portrait.png

# Cost-optimized development mode
nano-banana --video "Quick test scene" --video-fast --no-audio --resolution 720p

# Specify output path
nano-banana --video "A cat playing" --output cat-video.mp4

# Full control over settings
nano-banana --video "Dramatic reveal scene" \
  --duration 8 --aspect 16:9 --resolution 1080p --seed 42

Understanding Video Requests

Before generating, clarify these video-specific aspects:

Core Scene: What's the main action or subject?
Camera Movement: Static, dolly, pan, tracking, crane?
Style: Cinematic, documentary, commercial, casual?
Audio: Dialogue? Sound effects? Ambient sounds? Music?
Duration: 4, 6, or 8 seconds?
Orientation: Landscape (16:9) or portrait (9:16)?

The Five-Part Video Prompt Formula

Structure prompts with these elements:

[Camera Movement] + [Subject] + [Action] + [Environment] + [Audio/Style]

Example - Weak prompt:

"a person walking"

Example - Strong prompt:

"Slow dolly-in shot. A woman in her 30s, shoulder-length wavy black hair,
green jacket, walks confidently through a sunlit park. Golden hour lighting,
warm color grading. Ambient sounds: birds chirping, distant traffic.
Cinematic, aspirational mood. No subtitles, no text overlay."

Workflow

Step 1: Craft the Prompt

Use the prompting-guide.md for comprehensive guidance.

Key principles:

Start with camera movement (dolly, pan, static, tracking)
Describe subject in detail (appearance, wardrobe, expression)
Specify action with timing cues
Include lighting and environment
Add audio design (dialogue, SFX, ambient)
Always end with: "No subtitles, no text overlay, no captions"

Step 2: Consider Cost

Video generation is significantly more expensive than images:

Model	Cost per Second	8-Second Video
`veo-3.1-generate-001`	$0.40	$3.20
`veo-3.1-fast-generate-001`	$0.15	$1.20

Development workflow:

Iterate with --video-fast --no-audio (cheapest)
Test with --video-fast (add audio when needed)
Final render with default model (premium quality)

Step 3: Generate

nano-banana --video "your detailed prompt here"

Generation takes 2-4 minutes. Progress is shown in the terminal.

Step 4: Iterate

If the result isn't right:

Refine camera movement - Be more explicit (e.g., "slow dolly-in over 8 seconds")
Add negative guidance - Describe what to avoid
Simplify - Focus on one main action per clip
Try different duration - 4s or 6s may work better for quick actions

Commands

Text-to-Video

nano-banana --video "<prompt>"

Image-to-Video (Animation)

nano-banana --video "<motion description>" --file <input-image>

The motion description should describe how the image should animate:

"The character slowly turns their head and smiles"
"The scene comes alive with subtle wind movement"
"Zoom out to reveal the full landscape"

Options

Option	Description	Default
`--video`	Enable video mode	(required)
`--video-model <name>`	Veo model to use	veo-3.1-generate-001
`--video-fast`	Use fast/cheap model	(premium model)
`--duration <sec>`	4, 6, or 8 seconds	8
`--aspect <ratio>`	16:9 or 9:16	16:9
`--resolution <res>`	720p, 1080p, or 4K	1080p
`--audio`	Generate audio	(enabled)
`--no-audio`	Disable audio	-
`--seed <number>`	Reproducibility seed	(random)
`--output <file>`	Output path	output/video-.mp4
`--file <image>`	Input image to animate	-

Camera Movement Reference

Use these terms for precise camera control:

Movement	Description	Example Prompt
Static	No movement	"Static shot on tripod. A coffee cup steaming..."
Pan	Horizontal rotation	"Slow pan left across the city skyline..."
Tilt	Vertical rotation	"Tilt down from face to hands..."
Dolly In	Camera moves closer	"Slow dolly-in from medium to close-up..."
Dolly Out	Camera moves away	"Dolly-out revealing the vast landscape..."
Tracking	Parallel to subject	"Tracking shot following character walking..."
Crane	Sweeping vertical	"Crane shot ascending from ground level..."
Handheld	Realistic shake	"Handheld camera, documentary style..."

Important: Use ONE primary movement per shot. Don't combine multiple movements.

Dialogue Formatting

For spoken dialogue, use the colon format:

Character description says: "Exact dialogue here."

Example:

"A friendly young woman, excited and cheerful, says: 'Welcome to our store!'
Standing in bright retail environment. Natural lip-sync. No subtitles."

Guidelines:

Keep dialogue to 6-12 words for 8 seconds
Describe the speaker's tone and emotion
Always add "No subtitles, no text overlay"

Audio Design

Structure audio in layers:

Dialogue (highest priority) - Always clear
Sound Effects - Specific, timed actions
Ambient - 3-5 background elements max
Music - Lowest priority, "ducks under dialogue"

Example:

"Sound effects: Door closing at 2-second mark, footsteps on wood.
Ambient sounds: Quiet office hum, distant typing.
Background music: Soft jazz, low volume, ducks under dialogue."

Best Practices

For Better Results

Front-load important info - Camera, subject, action first
Use cinematic terms - "35mm lens", "shallow depth of field", "golden hour"
Be specific about lighting - "Soft window light from left", not just "good lighting"
Describe the mood - "Intimate", "epic", "suspenseful", "uplifting"
Include negative guidance - What to avoid

For Image-to-Video

Match the image - Describe motion that fits what's in the image
Start subtle - Small movements work better than dramatic changes
Keep lighting consistent - Don't describe lighting changes that differ from the image

For Consistency Across Shots

When creating multiple related videos:

Create a character description and reuse it exactly
Keep lighting style consistent
Use the same camera movement style family
Use --seed for more reproducible results

Troubleshooting

"Video generation timeout"

Generation can take 2-4 minutes
If persistent, try simpler prompts
Use --video-fast for faster generation

Poor quality or wrong content

Add more specific descriptions
Include negative guidance
Try the premium model instead of fast

Subtitles appearing in video

Always include "No subtitles, no text overlay, no captions" in prompt
Veo was trained on videos with subtitles and tends to add them

Audio doesn't match video

Be more specific about when sounds occur
Use "Sound effect: X at Y-second mark"
Simplify audio layers (fewer elements)

Safety filter rejection

Avoid violence, weapons, explicit content
Rephrase ambiguous terms
Try more generic descriptions

Cost Optimization

# Development (cheapest): ~$1.20 per video
nano-banana --video "test prompt" --video-fast --no-audio --resolution 720p

# Testing with audio: ~$1.20 per video
nano-banana --video "test prompt" --video-fast

# Production quality: ~$3.20 per video
nano-banana --video "final prompt" --resolution 1080p

Example Prompts

See the examples/ directory for complete prompt examples:

cinematic-shots.md - Camera movements
dialogue-and-audio.md - Speech and sound
image-to-video.md - Animating images

Environment Setup

Ensure GEMINI_API_KEY is set:

export GEMINI_API_KEY="your-api-key-here"

Or create a .env file in your project:

GEMINI_API_KEY=your-api-key-here

About

SKILL.md

About

Generate videos using Google Veo models via the nano-banana CLI. Use this skill when the user asks to create, generate, animate, or produce videos with AI.

SKILL.md

Nano Banana Video Generation

Generate videos using Google Veo 3.1 models via the nano-banana CLI.

Prerequisites

GEMINI_API_KEY environment variable must be set
The CLI is installed via npx @the-focus-ai/nano-banana

Quick Reference

# Generate a video from text
nano-banana --video "A sunset over mountains, slow dolly-in, cinematic lighting"

# Animate an existing image
nano-banana --video "The character slowly turns and smiles" --file portrait.png

# Cost-optimized development mode
nano-banana --video "Quick test scene" --video-fast --no-audio --resolution 720p

# Specify output path
nano-banana --video "A cat playing" --output cat-video.mp4

# Full control over settings
nano-banana --video "Dramatic reveal scene" \
  --duration 8 --aspect 16:9 --resolution 1080p --seed 42

Understanding Video Requests

Before generating, clarify these video-specific aspects:

Core Scene: What's the main action or subject?
Camera Movement: Static, dolly, pan, tracking, crane?
Style: Cinematic, documentary, commercial, casual?
Audio: Dialogue? Sound effects? Ambient sounds? Music?
Duration: 4, 6, or 8 seconds?
Orientation: Landscape (16:9) or portrait (9:16)?

The Five-Part Video Prompt Formula

Structure prompts with these elements:

[Camera Movement] + [Subject] + [Action] + [Environment] + [Audio/Style]

Example - Weak prompt:

"a person walking"

Example - Strong prompt:

"Slow dolly-in shot. A woman in her 30s, shoulder-length wavy black hair,
green jacket, walks confidently through a sunlit park. Golden hour lighting,
warm color grading. Ambient sounds: birds chirping, distant traffic.
Cinematic, aspirational mood. No subtitles, no text overlay."

Workflow

Step 1: Craft the Prompt

Use the prompting-guide.md for comprehensive guidance.

Key principles:

Start with camera movement (dolly, pan, static, tracking)
Describe subject in detail (appearance, wardrobe, expression)
Specify action with timing cues
Include lighting and environment
Add audio design (dialogue, SFX, ambient)
Always end with: "No subtitles, no text overlay, no captions"

Step 2: Consider Cost

Video generation is significantly more expensive than images:

Model	Cost per Second	8-Second Video
`veo-3.1-generate-001`	$0.40	$3.20
`veo-3.1-fast-generate-001`	$0.15	$1.20

Development workflow:

Iterate with --video-fast --no-audio (cheapest)
Test with --video-fast (add audio when needed)
Final render with default model (premium quality)

Step 3: Generate

nano-banana --video "your detailed prompt here"

Generation takes 2-4 minutes. Progress is shown in the terminal.

Step 4: Iterate

If the result isn't right:

Refine camera movement - Be more explicit (e.g., "slow dolly-in over 8 seconds")
Add negative guidance - Describe what to avoid
Simplify - Focus on one main action per clip
Try different duration - 4s or 6s may work better for quick actions

Commands

Text-to-Video

nano-banana --video "<prompt>"

Image-to-Video (Animation)

nano-banana --video "<motion description>" --file <input-image>

The motion description should describe how the image should animate:

"The character slowly turns their head and smiles"
"The scene comes alive with subtle wind movement"
"Zoom out to reveal the full landscape"

Options

Option	Description	Default
`--video`	Enable video mode	(required)
`--video-model <name>`	Veo model to use	veo-3.1-generate-001
`--video-fast`	Use fast/cheap model	(premium model)
`--duration <sec>`	4, 6, or 8 seconds	8
`--aspect <ratio>`	16:9 or 9:16	16:9
`--resolution <res>`	720p, 1080p, or 4K	1080p
`--audio`	Generate audio	(enabled)
`--no-audio`	Disable audio	-
`--seed <number>`	Reproducibility seed	(random)
`--output <file>`	Output path	output/video-.mp4
`--file <image>`	Input image to animate	-

Camera Movement Reference

Use these terms for precise camera control:

Movement	Description	Example Prompt
Static	No movement	"Static shot on tripod. A coffee cup steaming..."
Pan	Horizontal rotation	"Slow pan left across the city skyline..."
Tilt	Vertical rotation	"Tilt down from face to hands..."
Dolly In	Camera moves closer	"Slow dolly-in from medium to close-up..."
Dolly Out	Camera moves away	"Dolly-out revealing the vast landscape..."
Tracking	Parallel to subject	"Tracking shot following character walking..."
Crane	Sweeping vertical	"Crane shot ascending from ground level..."
Handheld	Realistic shake	"Handheld camera, documentary style..."

Important: Use ONE primary movement per shot. Don't combine multiple movements.

Dialogue Formatting

For spoken dialogue, use the colon format:

Character description says: "Exact dialogue here."

Example:

"A friendly young woman, excited and cheerful, says: 'Welcome to our store!'
Standing in bright retail environment. Natural lip-sync. No subtitles."

Guidelines:

Keep dialogue to 6-12 words for 8 seconds
Describe the speaker's tone and emotion
Always add "No subtitles, no text overlay"

Audio Design

Structure audio in layers:

Dialogue (highest priority) - Always clear
Sound Effects - Specific, timed actions
Ambient - 3-5 background elements max
Music - Lowest priority, "ducks under dialogue"

Example:

"Sound effects: Door closing at 2-second mark, footsteps on wood.
Ambient sounds: Quiet office hum, distant typing.
Background music: Soft jazz, low volume, ducks under dialogue."

Best Practices

For Better Results

Front-load important info - Camera, subject, action first
Use cinematic terms - "35mm lens", "shallow depth of field", "golden hour"
Be specific about lighting - "Soft window light from left", not just "good lighting"
Describe the mood - "Intimate", "epic", "suspenseful", "uplifting"
Include negative guidance - What to avoid

For Image-to-Video

Match the image - Describe motion that fits what's in the image
Start subtle - Small movements work better than dramatic changes
Keep lighting consistent - Don't describe lighting changes that differ from the image

For Consistency Across Shots

When creating multiple related videos:

Create a character description and reuse it exactly
Keep lighting style consistent
Use the same camera movement style family
Use --seed for more reproducible results

Troubleshooting

"Video generation timeout"

Generation can take 2-4 minutes
If persistent, try simpler prompts
Use --video-fast for faster generation

Poor quality or wrong content

Add more specific descriptions
Include negative guidance
Try the premium model instead of fast

Subtitles appearing in video

Always include "No subtitles, no text overlay, no captions" in prompt
Veo was trained on videos with subtitles and tends to add them

Audio doesn't match video

Be more specific about when sounds occur
Use "Sound effect: X at Y-second mark"
Simplify audio layers (fewer elements)

Safety filter rejection

Avoid violence, weapons, explicit content
Rephrase ambiguous terms
Try more generic descriptions

Cost Optimization

# Development (cheapest): ~$1.20 per video
nano-banana --video "test prompt" --video-fast --no-audio --resolution 720p

# Testing with audio: ~$1.20 per video
nano-banana --video "test prompt" --video-fast

# Production quality: ~$3.20 per video
nano-banana --video "final prompt" --resolution 1080p

Example Prompts

See the examples/ directory for complete prompt examples:

cinematic-shots.md - Camera movements
dialogue-and-audio.md - Speech and sound
image-to-video.md - Animating images

Environment Setup

Ensure GEMINI_API_KEY is set:

export GEMINI_API_KEY="your-api-key-here"

Or create a .env file in your project:

GEMINI_API_KEY=your-api-key-here