skill-kokoro-tts-tool

dnvriend/skill-kokoro-tts-tool

AI & ML

About

SKILL.md

skill-kokoro-tts-tool

dnvriend/skill-kokoro-tts-tool

AI & ML

About

Local text-to-speech using Kokoro TTS

SKILL.md

When to use

When you need to convert text to speech locally (no API keys)
When you need to generate audio from long documents (books, articles)
When you need seamless audiobook rendering without pop artifacts
When you need fast offline TTS rendering (20-50x real-time)

kokoro-tts-tool Skill

Purpose

This skill provides access to the kokoro-tts-tool CLI for local text-to-speech synthesis using the Kokoro-82M model. Runs entirely on-device with ONNX runtime, optimized for Apple Silicon.

When to Use This Skill

Use this skill when:

Converting text to speech without cloud APIs
Generating audio from markdown/text documents
Creating audiobooks from long-form content
Needing 60+ voices across 8 languages

Do NOT use this skill for:

Cloud-based TTS services
Real-time voice conversion
Speech-to-text (transcription)

CLI Tool: kokoro-tts-tool

Local text-to-speech CLI using Kokoro-82M (82 million parameters).

Installation

# Clone and install
git clone https://github.com/dnvriend/kokoro-tts-tool.git
cd kokoro-tts-tool
uv tool install .

Prerequisites

Python 3.14+
uv package manager
Apple Silicon Mac (recommended)

Quick Start

# Initialize (downloads ~350MB models)
kokoro-tts-tool init

# Synthesize text to speakers
kokoro-tts-tool synthesize "Hello world"

# Save to file
kokoro-tts-tool synthesize "Hello" --output speech.wav

# Stream a document
kokoro-tts-tool infinite --input book.md

Progressive Disclosure

📖 Core Commands (Click to expand)

init - Download TTS Models

Downloads the Kokoro ONNX model (~300MB) and voice embeddings (~50MB).

Usage:

kokoro-tts-tool init [OPTIONS]

Options:

--force, -f: Re-download models even if they exist

Examples:

# Download models (skips if already present)
kokoro-tts-tool init

# Force re-download
kokoro-tts-tool init --force

synthesize - Convert Text to Speech

Synthesizes text using the Kokoro TTS model. Audio can be played through speakers or saved to file.

Usage:

kokoro-tts-tool synthesize [TEXT] [OPTIONS]

Arguments:

TEXT: Text to synthesize (optional if using --stdin)

Options:

--stdin, -s: Read text from stdin
--voice, -v VALUE: Voice ID (default: af_heart)
--output, -o PATH: Save to WAV file
--speed FLOAT: Speech speed 0.5-2.0 (default: 1.0)
--silence INT: Trailing silence in ms (default: 200)

Examples:

# Play text with default voice
kokoro-tts-tool synthesize "Hello world"

# Use different voice
kokoro-tts-tool synthesize "Hello" --voice am_adam

# Save to file
kokoro-tts-tool synthesize "Hello" --output speech.wav

# Read from stdin
echo "Hello world" | kokoro-tts-tool synthesize --stdin

# Adjust speed
kokoro-tts-tool synthesize "Hello" --speed 1.5

# Multiple options
cat article.txt | kokoro-tts-tool synthesize --stdin \
    --voice bf_emma \
    --output article.wav \
    --speed 0.9

Output: Audio played through speakers (default) or saved as WAV file (24kHz, mono, 16-bit).

infinite - Stream Long Documents

Reads markdown or plain text, splits intelligently into chunks, and streams to speakers or renders to file.

Usage:

kokoro-tts-tool infinite [OPTIONS]

Options:

--input, -i PATH: Input text/markdown file
--stdin, -s: Read text from stdin
--output, -o PATH: Save to WAV file (fast offline mode)
--voice VALUE: Voice ID (default: af_heart)
--speed FLOAT: Speech speed 0.5-2.0 (default: 1.0)
--chunk-size INT: Target words per chunk 50-1000 (default: 200)
--pause INT: Pause between chunks in ms 0-2000 (default: 150)
--no-markdown: Treat input as plain text

Examples:

# Stream to speakers
kokoro-tts-tool infinite --input book.md

# Render to WAV (fast, ~2-3min for 1hr audio)
kokoro-tts-tool infinite --input book.md --output audiobook.wav

# Pipe from stdin
cat chapter.md | kokoro-tts-tool infinite --stdin

# With custom voice and speed
kokoro-tts-tool infinite --input notes.md \
    --voice am_adam \
    --speed 1.2

# Render audiobook with narrator voice
kokoro-tts-tool infinite --input book.md \
    --output book.wav \
    --voice bm_george \
    --speed 0.95

# Shorter chunks for studying
kokoro-tts-tool infinite --input study.md \
    --chunk-size 200 \
    --pause 600

Output:

Speaker mode: Real-time playback, seamless audio
File mode: Fast offline rendering (20-50x real-time on M4)

list-voices - List Available Voices

Lists voice information including ID, name, gender, accent, quality grade, and description.

Usage:

kokoro-tts-tool list-voices [OPTIONS]

Options:

--language, -l VALUE: Filter by language (English, Japanese, etc.)
--gender, -g VALUE: Filter by gender (Male, Female)
--json: Output as JSON for scripting

Examples:

# List all voices
kokoro-tts-tool list-voices

# Filter by language
kokoro-tts-tool list-voices --language English

# Filter by gender
kokoro-tts-tool list-voices --gender Female

# Combined filters
kokoro-tts-tool list-voices --language English --gender Male

# JSON output for scripting
kokoro-tts-tool list-voices --json

Voice ID Format:

Pattern: [language][gender]_[name]
First letter: language (a=American, b=British, j=Japanese, etc.)
Second letter: gender (f=Female, m=Male)

Quality Grades:

A/A-: Highest quality (af_heart, af_bella, am_adam)
B+/B: Good quality
B-: Acceptable quality

info - Display Configuration

Shows information about the Kokoro TTS installation.

Usage:

kokoro-tts-tool info

Examples:

kokoro-tts-tool info

Output:

Model status (Ready/Not downloaded)
Model file locations
Default settings
Supported languages

completion - Shell Completion

Generate shell completion scripts for bash, zsh, or fish.

Usage:

kokoro-tts-tool completion SHELL

Arguments:

SHELL: Shell type (bash, zsh, fish)

Examples:

# Bash (add to ~/.bashrc)
eval "$(kokoro-tts-tool completion bash)"

# Zsh (add to ~/.zshrc)
eval "$(kokoro-tts-tool completion zsh)"

# Fish
kokoro-tts-tool completion fish > ~/.config/fish/completions/kokoro-tts-tool.fish

⚙️ Advanced Features (Click to expand)

Multi-Level Verbosity Logging

Control logging detail with progressive verbosity levels. All logs output to stderr.

Logging Levels:

Flag	Level	Output	Use Case
(none)	WARNING	Errors and warnings only	Production, quiet mode
`-v`	INFO	+ High-level operations	Normal debugging
`-vv`	DEBUG	+ Detailed info, full tracebacks	Development
`-vvv`	TRACE	+ Library internals	Deep debugging

Examples:

# INFO level
kokoro-tts-tool -v synthesize "Hello"

# DEBUG level
kokoro-tts-tool -vv infinite --input book.md

# TRACE level
kokoro-tts-tool -vvv synthesize "Hello"

Pipeline Composition

Compose commands with Unix pipes for workflows.

Examples:

# Get voice IDs as JSON and filter
kokoro-tts-tool list-voices --json | jq '.[].id'

# Read from another command
cat document.md | kokoro-tts-tool infinite --stdin

# Chain with file processing
find . -name "*.md" -exec cat {} \; | kokoro-tts-tool infinite --stdin

🔧 Troubleshooting (Click to expand)

Common Issues

Issue: Command not found

# Verify installation
kokoro-tts-tool --version

# Reinstall if needed
cd kokoro-tts-tool
uv tool install . --reinstall

Issue: Models not downloaded

# Initialize models
kokoro-tts-tool init

# Force re-download
kokoro-tts-tool init --force

Issue: Audio not playing

Check system volume
Try saving to file: --output test.wav
Check with verbose: -vv

Issue: Voice not found

# List available voices
kokoro-tts-tool list-voices

# Check voice ID format
kokoro-tts-tool list-voices --json | jq '.[].id'

Getting Help

# General help
kokoro-tts-tool --help

# Command-specific help
kokoro-tts-tool synthesize --help
kokoro-tts-tool infinite --help

Exit Codes

0: Success
1: Error (validation, runtime, or unexpected)

Output Formats

Default Output:

Human-readable formatted output
Audio played through speakers

File Output (--output):

WAV format (24kHz, mono, 16-bit)

JSON Output (--json on list-voices):

Machine-readable voice data
Perfect for pipelines and processing

Best Practices

Initialize first: Run kokoro-tts-tool init before synthesis
Use appropriate voices: Match voice to content (am_adam for audiobooks, bf_emma for education)
Leverage infinite for documents: Better than synthesize for long content
Use file output for production: --output for consistent results
Check voice quality grades: A/A- voices recommended for production

Resources

GitHub: https://github.com/dnvriend/kokoro-tts-tool
Kokoro-82M Model: https://huggingface.co/hexgrad/Kokoro-82M
kokoro-onnx: https://github.com/thewh1teagle/kokoro-onnx

About

SKILL.md

About

Local text-to-speech using Kokoro TTS

SKILL.md

When to use

When you need to convert text to speech locally (no API keys)
When you need to generate audio from long documents (books, articles)
When you need seamless audiobook rendering without pop artifacts
When you need fast offline TTS rendering (20-50x real-time)

kokoro-tts-tool Skill

Purpose

This skill provides access to the kokoro-tts-tool CLI for local text-to-speech synthesis using the Kokoro-82M model. Runs entirely on-device with ONNX runtime, optimized for Apple Silicon.

When to Use This Skill

Use this skill when:

Converting text to speech without cloud APIs
Generating audio from markdown/text documents
Creating audiobooks from long-form content
Needing 60+ voices across 8 languages

Do NOT use this skill for:

Cloud-based TTS services
Real-time voice conversion
Speech-to-text (transcription)

CLI Tool: kokoro-tts-tool

Local text-to-speech CLI using Kokoro-82M (82 million parameters).

Installation

# Clone and install
git clone https://github.com/dnvriend/kokoro-tts-tool.git
cd kokoro-tts-tool
uv tool install .

Prerequisites

Python 3.14+
uv package manager
Apple Silicon Mac (recommended)

Quick Start

# Initialize (downloads ~350MB models)
kokoro-tts-tool init

# Synthesize text to speakers
kokoro-tts-tool synthesize "Hello world"

# Save to file
kokoro-tts-tool synthesize "Hello" --output speech.wav

# Stream a document
kokoro-tts-tool infinite --input book.md

Progressive Disclosure

📖 Core Commands (Click to expand)

init - Download TTS Models

Downloads the Kokoro ONNX model (~300MB) and voice embeddings (~50MB).

Usage:

kokoro-tts-tool init [OPTIONS]

Options:

--force, -f: Re-download models even if they exist

Examples:

# Download models (skips if already present)
kokoro-tts-tool init

# Force re-download
kokoro-tts-tool init --force

synthesize - Convert Text to Speech

Synthesizes text using the Kokoro TTS model. Audio can be played through speakers or saved to file.

Usage:

kokoro-tts-tool synthesize [TEXT] [OPTIONS]

Arguments:

TEXT: Text to synthesize (optional if using --stdin)

Options:

--stdin, -s: Read text from stdin
--voice, -v VALUE: Voice ID (default: af_heart)
--output, -o PATH: Save to WAV file
--speed FLOAT: Speech speed 0.5-2.0 (default: 1.0)
--silence INT: Trailing silence in ms (default: 200)

Examples:

# Play text with default voice
kokoro-tts-tool synthesize "Hello world"

# Use different voice
kokoro-tts-tool synthesize "Hello" --voice am_adam

# Save to file
kokoro-tts-tool synthesize "Hello" --output speech.wav

# Read from stdin
echo "Hello world" | kokoro-tts-tool synthesize --stdin

# Adjust speed
kokoro-tts-tool synthesize "Hello" --speed 1.5

# Multiple options
cat article.txt | kokoro-tts-tool synthesize --stdin \
    --voice bf_emma \
    --output article.wav \
    --speed 0.9

Output: Audio played through speakers (default) or saved as WAV file (24kHz, mono, 16-bit).

infinite - Stream Long Documents

Reads markdown or plain text, splits intelligently into chunks, and streams to speakers or renders to file.

Usage:

kokoro-tts-tool infinite [OPTIONS]

Options:

--input, -i PATH: Input text/markdown file
--stdin, -s: Read text from stdin
--output, -o PATH: Save to WAV file (fast offline mode)
--voice VALUE: Voice ID (default: af_heart)
--speed FLOAT: Speech speed 0.5-2.0 (default: 1.0)
--chunk-size INT: Target words per chunk 50-1000 (default: 200)
--pause INT: Pause between chunks in ms 0-2000 (default: 150)
--no-markdown: Treat input as plain text

Examples:

# Stream to speakers
kokoro-tts-tool infinite --input book.md

# Render to WAV (fast, ~2-3min for 1hr audio)
kokoro-tts-tool infinite --input book.md --output audiobook.wav

# Pipe from stdin
cat chapter.md | kokoro-tts-tool infinite --stdin

# With custom voice and speed
kokoro-tts-tool infinite --input notes.md \
    --voice am_adam \
    --speed 1.2

# Render audiobook with narrator voice
kokoro-tts-tool infinite --input book.md \
    --output book.wav \
    --voice bm_george \
    --speed 0.95

# Shorter chunks for studying
kokoro-tts-tool infinite --input study.md \
    --chunk-size 200 \
    --pause 600

Output:

Speaker mode: Real-time playback, seamless audio
File mode: Fast offline rendering (20-50x real-time on M4)

list-voices - List Available Voices

Lists voice information including ID, name, gender, accent, quality grade, and description.

Usage:

kokoro-tts-tool list-voices [OPTIONS]

Options:

--language, -l VALUE: Filter by language (English, Japanese, etc.)
--gender, -g VALUE: Filter by gender (Male, Female)
--json: Output as JSON for scripting

Examples:

# List all voices
kokoro-tts-tool list-voices

# Filter by language
kokoro-tts-tool list-voices --language English

# Filter by gender
kokoro-tts-tool list-voices --gender Female

# Combined filters
kokoro-tts-tool list-voices --language English --gender Male

# JSON output for scripting
kokoro-tts-tool list-voices --json

Voice ID Format:

Pattern: [language][gender]_[name]
First letter: language (a=American, b=British, j=Japanese, etc.)
Second letter: gender (f=Female, m=Male)

Quality Grades:

A/A-: Highest quality (af_heart, af_bella, am_adam)
B+/B: Good quality
B-: Acceptable quality

info - Display Configuration

Shows information about the Kokoro TTS installation.

Usage:

kokoro-tts-tool info

Examples:

kokoro-tts-tool info

Output:

Model status (Ready/Not downloaded)
Model file locations
Default settings
Supported languages

completion - Shell Completion

Generate shell completion scripts for bash, zsh, or fish.

Usage:

kokoro-tts-tool completion SHELL

Arguments:

SHELL: Shell type (bash, zsh, fish)

Examples:

# Bash (add to ~/.bashrc)
eval "$(kokoro-tts-tool completion bash)"

# Zsh (add to ~/.zshrc)
eval "$(kokoro-tts-tool completion zsh)"

# Fish
kokoro-tts-tool completion fish > ~/.config/fish/completions/kokoro-tts-tool.fish

⚙️ Advanced Features (Click to expand)

Multi-Level Verbosity Logging

Control logging detail with progressive verbosity levels. All logs output to stderr.

Logging Levels:

Flag	Level	Output	Use Case
(none)	WARNING	Errors and warnings only	Production, quiet mode
`-v`	INFO	+ High-level operations	Normal debugging
`-vv`	DEBUG	+ Detailed info, full tracebacks	Development
`-vvv`	TRACE	+ Library internals	Deep debugging

Examples:

# INFO level
kokoro-tts-tool -v synthesize "Hello"

# DEBUG level
kokoro-tts-tool -vv infinite --input book.md

# TRACE level
kokoro-tts-tool -vvv synthesize "Hello"

Pipeline Composition

Compose commands with Unix pipes for workflows.

Examples:

# Get voice IDs as JSON and filter
kokoro-tts-tool list-voices --json | jq '.[].id'

# Read from another command
cat document.md | kokoro-tts-tool infinite --stdin

# Chain with file processing
find . -name "*.md" -exec cat {} \; | kokoro-tts-tool infinite --stdin

🔧 Troubleshooting (Click to expand)

Common Issues

Issue: Command not found

# Verify installation
kokoro-tts-tool --version

# Reinstall if needed
cd kokoro-tts-tool
uv tool install . --reinstall

Issue: Models not downloaded

# Initialize models
kokoro-tts-tool init

# Force re-download
kokoro-tts-tool init --force

Issue: Audio not playing

Check system volume
Try saving to file: --output test.wav
Check with verbose: -vv

Issue: Voice not found

# List available voices
kokoro-tts-tool list-voices

# Check voice ID format
kokoro-tts-tool list-voices --json | jq '.[].id'

Getting Help

# General help
kokoro-tts-tool --help

# Command-specific help
kokoro-tts-tool synthesize --help
kokoro-tts-tool infinite --help

Exit Codes

0: Success
1: Error (validation, runtime, or unexpected)

Output Formats

Default Output:

Human-readable formatted output
Audio played through speakers

File Output (--output):

WAV format (24kHz, mono, 16-bit)

JSON Output (--json on list-voices):

Machine-readable voice data
Perfect for pipelines and processing

Best Practices

Initialize first: Run kokoro-tts-tool init before synthesis
Use appropriate voices: Match voice to content (am_adam for audiobooks, bf_emma for education)
Leverage infinite for documents: Better than synthesize for long content
Use file output for production: --output for consistent results
Check voice quality grades: A/A- voices recommended for production

Resources

GitHub: https://github.com/dnvriend/kokoro-tts-tool
Kokoro-82M Model: https://huggingface.co/hexgrad/Kokoro-82M
kokoro-onnx: https://github.com/thewh1teagle/kokoro-onnx