generating-tts

warrenzhu050413/generating-tts

AI & ML

1 installs

About

SKILL.md

generating-tts

warrenzhu050413/generating-tts

AI & ML

1 installs

About

Generate and play multilingual text-to-speech audio using mlx-audio with Kokoro model. Use when user asks to hear pronunciation, speak text aloud, or wants audio for language learning.

SKILL.md

Text-to-Speech Generation Skill

When to Use

Trigger: User asks to hear pronunciation, say something aloud, or wants audio for language learning.

Supported Languages

Code	Language	Notes
`a`	American English	Default
`b`	British English
`e`	Spanish
`f`	French
`h`	Hindi
`i`	Italian
`j`	Japanese	Requires `pip install misaki[ja]`
`p`	Portuguese (Brazilian)
`z`	Mandarin Chinese	Requires `pip install misaki[zh]`

Available Voices

Pattern: [language][gender]_[name] (e.g., af_heart = American Female Heart)

American Female:

af_heart - Warm, friendly ⭐ Default
af_nova - Clear, precise (best for pronunciation)
af_bella - Expressive
af_sky - Bright
af_sarah - Gentle

American Male:

am_adam - Strong
am_michael - Authoritative (great for language learning)
am_eric - Friendly

British Female:

bf_emma - Elegant
bf_isabella - Sophisticated

British Male:

bm_george - Distinguished
bm_lewis - Professional

Speed Control

Range: 0.5x to 2.0x (default 1.0x)

0.5-0.8x: Slow, for difficult pronunciation or beginners
1.0x: Natural pace
1.2-1.5x: Faster, for advanced learners
1.8-2.0x: Very fast, speed listening

Prerequisites & Setup

Required Installation

Before using TTS, install mlx-audio:

pip install mlx-audio

Optional Language Support

For Japanese and Chinese, install additional components:

pip install misaki[ja]    # For Japanese
pip install misaki[zh]    # For Chinese

Server Startup

The generate_tts function automatically starts the mlx-audio server if it's not running, but you can also start it manually:

# Start server on port 9876 (runs in background)
mlx_audio.server --port 9876 &

# Or start with log output to monitor
mlx_audio.server --port 9876 > /tmp/mlx_audio_server_9876.log 2>&1 &

First run startup time: 6-10 seconds (model loads and caches) Subsequent calls: 1-2 seconds per audio generation

Verify Server is Running

# Check if server is responding
curl http://127.0.0.1:9876/languages

# If you get JSON response, server is ready

Implementation

Bash Function

generate_tts() {
    local text="$1"
    local voice="${2:-af_heart}"
    local lang_code="${3:-a}"
    local speed="${4:-1.0}"
    local server_url="http://127.0.0.1:9876"

    # Validate
    [ -z "$text" ] && { echo "❌ No text provided"; return 1; }

    case "$lang_code" in
        a|b|e|f|h|i|j|p|z) ;;
        *) echo "❌ Invalid language code: $lang_code"; return 1 ;;
    esac

    # Language names
    declare -A lang_names=([a]="American English" [b]="British English" [e]="Spanish" [f]="French" [h]="Hindi" [i]="Italian" [j]="Japanese" [p]="Portuguese" [z]="Mandarin Chinese")

    # Start server if needed
    if ! curl -s "$server_url/languages" > /dev/null 2>&1; then
        echo "🚀 Starting mlx-audio server..."
        nohup mlx_audio.server --port 9876 > /tmp/mlx_audio_server_9876.log 2>&1 &

        for i in {1..20}; do
            curl -s "$server_url/languages" > /dev/null 2>&1 && { echo "✅ Server ready"; break; }
            sleep 0.5
        done

        curl -s "$server_url/languages" > /dev/null 2>&1 || { echo "❌ Server failed. Check: tail -f /tmp/mlx_audio_server_9876.log"; return 1; }
    fi

    # Generate audio
    echo "🎙️  Generating ${lang_names[$lang_code]} audio..."
    local response=$(curl -s -X POST "$server_url/tts" \
      -d "text=$text" -d "voice=$voice" -d "speed=$speed" \
      -d "language=$lang_code" -d "model=mlx-community/Kokoro-82M-4bit")

    # Extract filename
    echo "$response" | grep -q '"error"' && { echo "❌ TTS failed"; return 1; }
    local filename=$(echo "$response" | python3 -c "import json, sys; print(json.load(sys.stdin)['filename'])" 2>/dev/null)
    [ -z "$filename" ] && { echo "❌ No audio filename"; return 1; }

    # Download and play
    local output="/tmp/tts_$(date +%s).wav"
    curl -s "$server_url/audio/$filename" -o "$output"

    echo ""
    echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
    echo "🎤 ${voice} says (${lang_names[$lang_code]}, ${speed}x):"
    echo "   \"$text\""
    echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
    echo ""

    echo "▶️  Playing audio..."
    afplay "$output"
    echo "✅ Playback complete"
    rm "$output"
}

export -f generate_tts

Usage

generate_tts "text" "voice" "lang_code" "speed"

# Examples
generate_tts "Hello"                                    # Default (English, af_heart, 1.0x)
generate_tts "Hola" "am_michael" "e" "0.8"             # Spanish, slower
generate_tts "Bonjour" "bf_emma" "f" "1.0"             # French, British voice
generate_tts "Ciao" "af_bella" "i" "1.0"               # Italian

Workflow

When user requests TTS:

Extract text to speak
Determine language from context
Choose voice:
- Default: af_heart
- Clear pronunciation: af_nova
- Language learning: am_michael
Set speed:
- Beginners: 0.8x
- Normal: 1.0x
- Advanced: 1.2x
Call generate_tts with parameters

Voice Selection Guide

Use Case	Voice	Reason
General	`af_heart`	Warm, approachable
Clear pronunciation	`af_nova`	Precise
Language learning	`am_michael`	Authoritative
Professional	`bf_emma`, `bm_george`	Distinguished

Language	Best Voices	Speed
Spanish/Portuguese/Chinese	`am_michael`, `af_heart`	0.8-1.0x
French	`af_nova`, `bf_emma`	0.8x
Italian	`af_bella`, `am_adam`	1.0x
Japanese	`af_nova`, `af_heart`	1.0x

Best Practices

DO:

Show text before playing
Use appropriate speed for context
Keep text moderate length (1-3 sentences)
Generate only when user requests

DON'T:

Auto-generate without request
Use very long text (split into chunks)
Mix languages in one call

Example Interactions

Pronunciation:

User: "How do you pronounce 'entrepreneur'?"
Claude: "The word 'entrepreneur' is pronounced: /ˌɑːntrəprəˈnɜːr/"
[Calls: generate_tts "entrepreneur" "af_nova" "a" "0.8"]

Language Learning:

User: "How do you say 'good morning' in Spanish?"
Claude: "In Spanish: **Buenos días** (buenos = good, días = days/morning)"
[Calls: generate_tts "Buenos días" "am_michael" "e" "0.8"]

Troubleshooting

Server Won't Start

1. Check if mlx-audio is installed:

python3 -c "import mlx_audio; print('✅ mlx-audio installed')"

2. If not installed, install it:

pip install mlx-audio

3. Check if port 9876 is in use:

lsof -i :9876                    # List what's using the port
kill $(lsof -t -i:9876)          # Kill existing process

4. Start server manually and monitor logs:

mlx_audio.server --port 9876 > /tmp/mlx_audio_server_9876.log 2>&1 &
tail -f /tmp/mlx_audio_server_9876.log  # Watch startup logs

5. If server still fails to start:

Check available disk space (model cache requires ~2GB)
Verify Python 3.9+ is installed
Try on a machine with better hardware (requires GPU/CPU acceleration)

TTS Generation Fails

Server is running but audio generation fails:

Check server logs: tail -f /tmp/mlx_audio_server_9876.log
Verify curl can reach server: curl http://127.0.0.1:9876/languages
Check if text is valid (not empty, properly quoted)

Audio Not Playing

File generated but won't play:

# Test afplay works on macOS
afplay /System/Library/Sounds/Glass.aiff

# Check if audio files are being created
ls -lh /tmp/tts_*.wav

Missing Language Dependencies

Install optional language support if needed:

pip install misaki[ja]           # For Japanese
pip install misaki[zh]           # For Chinese

Performance Notes

Typical timing:

Server already running: ~1-2 seconds per call
Server cold start: ~6-10 seconds (model loads once)
First generation: ~3-5 seconds (model cached in memory)
Subsequent calls: ~1-2 seconds (model cached)

Memory usage:

Server baseline: ~200MB
Running model: ~2GB RAM
Cache: ~2GB disk

Optimization tips:

Start server once at session beginning if doing multiple TTS calls
Keep text moderate length (1-3 sentences) for faster generation
Don't stop server between calls - it stays ready in background

About

SKILL.md

About

Generate and play multilingual text-to-speech audio using mlx-audio with Kokoro model. Use when user asks to hear pronunciation, speak text aloud, or wants audio for language learning.

SKILL.md

Text-to-Speech Generation Skill

When to Use

Trigger: User asks to hear pronunciation, say something aloud, or wants audio for language learning.

Supported Languages

Code	Language	Notes
`a`	American English	Default
`b`	British English
`e`	Spanish
`f`	French
`h`	Hindi
`i`	Italian
`j`	Japanese	Requires `pip install misaki[ja]`
`p`	Portuguese (Brazilian)
`z`	Mandarin Chinese	Requires `pip install misaki[zh]`

Available Voices

Pattern: [language][gender]_[name] (e.g., af_heart = American Female Heart)

American Female:

af_heart - Warm, friendly ⭐ Default
af_nova - Clear, precise (best for pronunciation)
af_bella - Expressive
af_sky - Bright
af_sarah - Gentle

American Male:

am_adam - Strong
am_michael - Authoritative (great for language learning)
am_eric - Friendly

British Female:

bf_emma - Elegant
bf_isabella - Sophisticated

British Male:

bm_george - Distinguished
bm_lewis - Professional

Speed Control

Range: 0.5x to 2.0x (default 1.0x)

0.5-0.8x: Slow, for difficult pronunciation or beginners
1.0x: Natural pace
1.2-1.5x: Faster, for advanced learners
1.8-2.0x: Very fast, speed listening

Prerequisites & Setup

Required Installation

Before using TTS, install mlx-audio:

pip install mlx-audio

Optional Language Support

For Japanese and Chinese, install additional components:

pip install misaki[ja]    # For Japanese
pip install misaki[zh]    # For Chinese

Server Startup

The generate_tts function automatically starts the mlx-audio server if it's not running, but you can also start it manually:

# Start server on port 9876 (runs in background)
mlx_audio.server --port 9876 &

# Or start with log output to monitor
mlx_audio.server --port 9876 > /tmp/mlx_audio_server_9876.log 2>&1 &

First run startup time: 6-10 seconds (model loads and caches) Subsequent calls: 1-2 seconds per audio generation

Verify Server is Running

# Check if server is responding
curl http://127.0.0.1:9876/languages

# If you get JSON response, server is ready

Implementation

Bash Function

generate_tts() {
    local text="$1"
    local voice="${2:-af_heart}"
    local lang_code="${3:-a}"
    local speed="${4:-1.0}"
    local server_url="http://127.0.0.1:9876"

    # Validate
    [ -z "$text" ] && { echo "❌ No text provided"; return 1; }

    case "$lang_code" in
        a|b|e|f|h|i|j|p|z) ;;
        *) echo "❌ Invalid language code: $lang_code"; return 1 ;;
    esac

    # Language names
    declare -A lang_names=([a]="American English" [b]="British English" [e]="Spanish" [f]="French" [h]="Hindi" [i]="Italian" [j]="Japanese" [p]="Portuguese" [z]="Mandarin Chinese")

    # Start server if needed
    if ! curl -s "$server_url/languages" > /dev/null 2>&1; then
        echo "🚀 Starting mlx-audio server..."
        nohup mlx_audio.server --port 9876 > /tmp/mlx_audio_server_9876.log 2>&1 &

        for i in {1..20}; do
            curl -s "$server_url/languages" > /dev/null 2>&1 && { echo "✅ Server ready"; break; }
            sleep 0.5
        done

        curl -s "$server_url/languages" > /dev/null 2>&1 || { echo "❌ Server failed. Check: tail -f /tmp/mlx_audio_server_9876.log"; return 1; }
    fi

    # Generate audio
    echo "🎙️  Generating ${lang_names[$lang_code]} audio..."
    local response=$(curl -s -X POST "$server_url/tts" \
      -d "text=$text" -d "voice=$voice" -d "speed=$speed" \
      -d "language=$lang_code" -d "model=mlx-community/Kokoro-82M-4bit")

    # Extract filename
    echo "$response" | grep -q '"error"' && { echo "❌ TTS failed"; return 1; }
    local filename=$(echo "$response" | python3 -c "import json, sys; print(json.load(sys.stdin)['filename'])" 2>/dev/null)
    [ -z "$filename" ] && { echo "❌ No audio filename"; return 1; }

    # Download and play
    local output="/tmp/tts_$(date +%s).wav"
    curl -s "$server_url/audio/$filename" -o "$output"

    echo ""
    echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
    echo "🎤 ${voice} says (${lang_names[$lang_code]}, ${speed}x):"
    echo "   \"$text\""
    echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
    echo ""

    echo "▶️  Playing audio..."
    afplay "$output"
    echo "✅ Playback complete"
    rm "$output"
}

export -f generate_tts

Usage

generate_tts "text" "voice" "lang_code" "speed"

# Examples
generate_tts "Hello"                                    # Default (English, af_heart, 1.0x)
generate_tts "Hola" "am_michael" "e" "0.8"             # Spanish, slower
generate_tts "Bonjour" "bf_emma" "f" "1.0"             # French, British voice
generate_tts "Ciao" "af_bella" "i" "1.0"               # Italian

Workflow

When user requests TTS:

Extract text to speak
Determine language from context
Choose voice:
- Default: af_heart
- Clear pronunciation: af_nova
- Language learning: am_michael
Set speed:
- Beginners: 0.8x
- Normal: 1.0x
- Advanced: 1.2x
Call generate_tts with parameters

Voice Selection Guide

Use Case	Voice	Reason
General	`af_heart`	Warm, approachable
Clear pronunciation	`af_nova`	Precise
Language learning	`am_michael`	Authoritative
Professional	`bf_emma`, `bm_george`	Distinguished

Language	Best Voices	Speed
Spanish/Portuguese/Chinese	`am_michael`, `af_heart`	0.8-1.0x
French	`af_nova`, `bf_emma`	0.8x
Italian	`af_bella`, `am_adam`	1.0x
Japanese	`af_nova`, `af_heart`	1.0x

Best Practices

DO:

Show text before playing
Use appropriate speed for context
Keep text moderate length (1-3 sentences)
Generate only when user requests

DON'T:

Auto-generate without request
Use very long text (split into chunks)
Mix languages in one call

Example Interactions

Pronunciation:

User: "How do you pronounce 'entrepreneur'?"
Claude: "The word 'entrepreneur' is pronounced: /ˌɑːntrəprəˈnɜːr/"
[Calls: generate_tts "entrepreneur" "af_nova" "a" "0.8"]

Language Learning:

User: "How do you say 'good morning' in Spanish?"
Claude: "In Spanish: **Buenos días** (buenos = good, días = days/morning)"
[Calls: generate_tts "Buenos días" "am_michael" "e" "0.8"]

Troubleshooting

Server Won't Start

1. Check if mlx-audio is installed:

python3 -c "import mlx_audio; print('✅ mlx-audio installed')"

2. If not installed, install it:

pip install mlx-audio

3. Check if port 9876 is in use:

lsof -i :9876                    # List what's using the port
kill $(lsof -t -i:9876)          # Kill existing process

4. Start server manually and monitor logs:

mlx_audio.server --port 9876 > /tmp/mlx_audio_server_9876.log 2>&1 &
tail -f /tmp/mlx_audio_server_9876.log  # Watch startup logs

5. If server still fails to start:

Check available disk space (model cache requires ~2GB)
Verify Python 3.9+ is installed
Try on a machine with better hardware (requires GPU/CPU acceleration)

TTS Generation Fails

Server is running but audio generation fails:

Check server logs: tail -f /tmp/mlx_audio_server_9876.log
Verify curl can reach server: curl http://127.0.0.1:9876/languages
Check if text is valid (not empty, properly quoted)

Audio Not Playing

File generated but won't play:

# Test afplay works on macOS
afplay /System/Library/Sounds/Glass.aiff

# Check if audio files are being created
ls -lh /tmp/tts_*.wav

Missing Language Dependencies

Install optional language support if needed:

pip install misaki[ja]           # For Japanese
pip install misaki[zh]           # For Chinese

Performance Notes

Typical timing:

Server already running: ~1-2 seconds per call
Server cold start: ~6-10 seconds (model loads once)
First generation: ~3-5 seconds (model cached in memory)
Subsequent calls: ~1-2 seconds (model cached)

Memory usage:

Server baseline: ~200MB
Running model: ~2GB RAM
Cache: ~2GB disk

Optimization tips:

Start server once at session beginning if doing multiple TTS calls
Keep text moderate length (1-3 sentences) for faster generation
Don't stop server between calls - it stays ready in background