Generate and play multilingual text-to-speech audio using mlx-audio with Kokoro model. Use when user asks to hear pronunciation, speak text aloud, or wants audio for language learning.
Trigger: User asks to hear pronunciation, say something aloud, or wants audio for language learning.
| Code | Language | Notes |
|---|---|---|
a |
American English | Default |
b |
British English | |
e |
Spanish | |
f |
French | |
h |
Hindi | |
i |
Italian | |
j |
Japanese | Requires pip install misaki[ja] |
p |
Portuguese (Brazilian) | |
z |
Mandarin Chinese | Requires pip install misaki[zh] |
Pattern: [language][gender]_[name] (e.g., af_heart = American Female Heart)
American Female:
af_heart - Warm, friendly ⭐ Defaultaf_nova - Clear, precise (best for pronunciation)af_bella - Expressiveaf_sky - Brightaf_sarah - GentleAmerican Male:
am_adam - Strongam_michael - Authoritative (great for language learning)am_eric - FriendlyBritish Female:
bf_emma - Elegantbf_isabella - SophisticatedBritish Male:
bm_george - Distinguishedbm_lewis - ProfessionalRange: 0.5x to 2.0x (default 1.0x)
Before using TTS, install mlx-audio:
pip install mlx-audio
For Japanese and Chinese, install additional components:
pip install misaki[ja] # For Japanese
pip install misaki[zh] # For Chinese
The generate_tts function automatically starts the mlx-audio server if it's not running, but you can also start it manually:
# Start server on port 9876 (runs in background)
mlx_audio.server --port 9876 &
# Or start with log output to monitor
mlx_audio.server --port 9876 > /tmp/mlx_audio_server_9876.log 2>&1 &
First run startup time: 6-10 seconds (model loads and caches) Subsequent calls: 1-2 seconds per audio generation
# Check if server is responding
curl http://127.0.0.1:9876/languages
# If you get JSON response, server is ready
generate_tts() {
local text="$1"
local voice="${2:-af_heart}"
local lang_code="${3:-a}"
local speed="${4:-1.0}"
local server_url="http://127.0.0.1:9876"
# Validate
[ -z "$text" ] && { echo "❌ No text provided"; return 1; }
case "$lang_code" in
a|b|e|f|h|i|j|p|z) ;;
*) echo "❌ Invalid language code: $lang_code"; return 1 ;;
esac
# Language names
declare -A lang_names=([a]="American English" [b]="British English" [e]="Spanish" [f]="French" [h]="Hindi" [i]="Italian" [j]="Japanese" [p]="Portuguese" [z]="Mandarin Chinese")
# Start server if needed
if ! curl -s "$server_url/languages" > /dev/null 2>&1; then
echo "🚀 Starting mlx-audio server..."
nohup mlx_audio.server --port 9876 > /tmp/mlx_audio_server_9876.log 2>&1 &
for i in {1..20}; do
curl -s "$server_url/languages" > /dev/null 2>&1 && { echo "✅ Server ready"; break; }
sleep 0.5
done
curl -s "$server_url/languages" > /dev/null 2>&1 || { echo "❌ Server failed. Check: tail -f /tmp/mlx_audio_server_9876.log"; return 1; }
fi
# Generate audio
echo "🎙️ Generating ${lang_names[$lang_code]} audio..."
local response=$(curl -s -X POST "$server_url/tts" \
-d "text=$text" -d "voice=$voice" -d "speed=$speed" \
-d "language=$lang_code" -d "model=mlx-community/Kokoro-82M-4bit")
# Extract filename
echo "$response" | grep -q '"error"' && { echo "❌ TTS failed"; return 1; }
local filename=$(echo "$response" | python3 -c "import json, sys; print(json.load(sys.stdin)['filename'])" 2>/dev/null)
[ -z "$filename" ] && { echo "❌ No audio filename"; return 1; }
# Download and play
local output="/tmp/tts_$(date +%s).wav"
curl -s "$server_url/audio/$filename" -o "$output"
echo ""
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo "🎤 ${voice} says (${lang_names[$lang_code]}, ${speed}x):"
echo " \"$text\""
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo ""
echo "▶️ Playing audio..."
afplay "$output"
echo "✅ Playback complete"
rm "$output"
}
export -f generate_tts
generate_tts "text" "voice" "lang_code" "speed"
# Examples
generate_tts "Hello" # Default (English, af_heart, 1.0x)
generate_tts "Hola" "am_michael" "e" "0.8" # Spanish, slower
generate_tts "Bonjour" "bf_emma" "f" "1.0" # French, British voice
generate_tts "Ciao" "af_bella" "i" "1.0" # Italian
When user requests TTS:
af_heartaf_novaam_michael| Use Case | Voice | Reason |
|---|---|---|
| General | af_heart |
Warm, approachable |
| Clear pronunciation | af_nova |
Precise |
| Language learning | am_michael |
Authoritative |
| Professional | bf_emma, bm_george |
Distinguished |
| Language | Best Voices | Speed |
|---|---|---|
| Spanish/Portuguese/Chinese | am_michael, af_heart |
0.8-1.0x |
| French | af_nova, bf_emma |
0.8x |
| Italian | af_bella, am_adam |
1.0x |
| Japanese | af_nova, af_heart |
1.0x |
DO:
DON'T:
Pronunciation:
User: "How do you pronounce 'entrepreneur'?"
Claude: "The word 'entrepreneur' is pronounced: /ˌɑːntrəprəˈnɜːr/"
[Calls: generate_tts "entrepreneur" "af_nova" "a" "0.8"]
Language Learning:
User: "How do you say 'good morning' in Spanish?"
Claude: "In Spanish: **Buenos días** (buenos = good, días = days/morning)"
[Calls: generate_tts "Buenos días" "am_michael" "e" "0.8"]
1. Check if mlx-audio is installed:
python3 -c "import mlx_audio; print('✅ mlx-audio installed')"
2. If not installed, install it:
pip install mlx-audio
3. Check if port 9876 is in use:
lsof -i :9876 # List what's using the port
kill $(lsof -t -i:9876) # Kill existing process
4. Start server manually and monitor logs:
mlx_audio.server --port 9876 > /tmp/mlx_audio_server_9876.log 2>&1 &
tail -f /tmp/mlx_audio_server_9876.log # Watch startup logs
5. If server still fails to start:
Server is running but audio generation fails:
tail -f /tmp/mlx_audio_server_9876.logcurl http://127.0.0.1:9876/languagesFile generated but won't play:
# Test afplay works on macOS
afplay /System/Library/Sounds/Glass.aiff
# Check if audio files are being created
ls -lh /tmp/tts_*.wav
Install optional language support if needed:
pip install misaki[ja] # For Japanese
pip install misaki[zh] # For Chinese
Typical timing:
Memory usage:
Optimization tips: