Build self-hosted TTS APIs using HuggingFace models (Parler-TTS, F5-TTS, XTTS-v2) and create LiveKit voice agent plugins with streaming support...
Build production-ready self-hosted Text-to-Speech APIs using HuggingFace models and integrate them with LiveKit voice agents through custom plugins.
This skill enables you to:
When to use this skill:
Select the best model for your use case from the HuggingFace ecosystem.
Load model comparison: TTS Models Reference
Quick Selection Guide:
| Use Case | Recommended Model | Why |
|---|---|---|
| Production voice agents | Parler-TTS Mini | Fast, CPU-friendly, text-based voice control |
| High-quality synthesis | Parler-TTS Large / F5-TTS | Superior natural quality |
| Multilingual support | XTTS-v2 | 17+ languages, voice cloning |
| Cost optimization | Parler-TTS Mini on CPU | Runs efficiently without GPU |
Example decision:
parler-tts/parler-tts-mini-v1"A friendly, professional voice with moderate pace"Create a FastAPI server that hosts the TTS model with both batch and streaming endpoints.
Use the provided implementation:
The skill includes a complete TTS API server at tts-api/main.py that supports:
/synthesize for simple text-to-speech/ws/synthesize for real-time incremental synthesisQuick start:
# Navigate to TTS API directory
cd tts-api
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env to set:
# TTS_MODEL_TYPE=parler
# TTS_MODEL_NAME=parler-tts/parler-tts-mini-v1
# TTS_DEVICE=cpu
# Run the server
python main.py
The server will:
GET /healthPOST /synthesize and WS /ws/synthesizeTest the API:
# Test batch synthesis
curl -X POST http://localhost:8001/synthesize \
-H "Content-Type: application/json" \
-d '{"text": "Hello! This is a test.", "format": "wav"}' \
--output test.wav
# Check health
curl http://localhost:8001/health
For detailed implementation patterns and best practices:
Key implementation details:
The provided tts-api/main.py includes:
Build a LiveKit plugin that connects your voice agents to the self-hosted TTS API.
Use the provided plugin implementation:
The skill includes a complete LiveKit plugin at livekit-plugin-custom-tts/ with:
Install the plugin:
# Navigate to plugin directory
cd livekit-plugin-custom-tts
# Install in development mode
pip install -e .
# Or install from source
pip install .
Use in a voice agent:
from livekit import agents
from livekit.agents import AgentSession
from livekit.plugins import openai, deepgram, silero
from livekit.plugins import custom_tts
async def entrypoint(ctx: agents.JobContext):
# Initialize session with custom TTS
session = AgentSession(
vad=silero.VAD.load(),
stt=deepgram.STT(model="nova-2-general"),
llm=openai.LLM(model="gpt-4o-mini"),
# Use custom self-hosted TTS
tts=custom_tts.TTS(
api_url="http://localhost:8001",
options=custom_tts.TTSOptions(
voice_description="A friendly, conversational voice.",
sample_rate=24000,
),
),
)
await ctx.connect()
await session.start(agent=YourAgent(), room=ctx.room)
For detailed plugin development patterns:
Key plugin features:
The provided implementation includes:
Verify that the TTS API and plugin work together correctly.
Testing levels:
1. API-level testing:
# Start the TTS API
cd tts-api
python main.py
# In another terminal, test synthesis
curl -X POST http://localhost:8001/synthesize \
-H "Content-Type: application/json" \
-d '{"text": "Testing TTS API", "format": "wav"}' \
--output test.wav
# Play the audio
ffplay test.wav # or open test.wav
2. Plugin-level testing:
Use the provided example script:
cd livekit-plugin-custom-tts/examples
python basic_usage.py
This will:
output.wav3. Voice agent testing:
Use the provided voice agent example:
# Set environment variables
export LIVEKIT_URL="wss://your-livekit.cloud"
export LIVEKIT_API_KEY="your-api-key"
export LIVEKIT_API_SECRET="your-api-secret"
export OPENAI_API_KEY="your-openai-key"
export DEEPGRAM_API_KEY="your-deepgram-key"
# Run the voice agent
cd livekit-plugin-custom-tts/examples
python voice_agent.py start
Verification checklist:
Deploy the TTS API and voice agent to production infrastructure.
Deployment options:
Option 1: Docker Compose (Quick Start)
Use the provided configuration:
# Create deployment directory
mkdir deployment
cd deployment
# Copy TTS API
cp -r ../tts-api .
# Set environment variables
export LIVEKIT_URL="wss://your-livekit.cloud"
export LIVEKIT_API_KEY="your-api-key"
# ... other vars
# Run docker-compose
docker-compose up -d
Option 2: Kubernetes (Production Scale)
Use the provided Kubernetes manifests in the deployment guide.
For comprehensive deployment instructions:
Production deployment includes:
Production checklist:
Create different agents with different voices:
# Customer support agent
support_agent = AgentSession(
tts=custom_tts.TTS(
api_url="http://localhost:8001",
options=custom_tts.TTSOptions(
voice_description="A professional, clear voice with measured pace.",
),
),
)
# Sales agent
sales_agent = AgentSession(
tts=custom_tts.TTS(
api_url="http://localhost:8001",
options=custom_tts.TTSOptions(
voice_description="An energetic, friendly voice with upbeat delivery.",
),
),
)
Use XTTS-v2 for multilingual support:
# Configure TTS API for XTTS
TTS_MODEL_TYPE=xtts
TTS_MODEL_NAME=tts_models/multilingual/multi-dataset/xtts_v2
# In your agent
tts=custom_tts.TTS(
api_url="http://localhost:8001",
options=custom_tts.TTSOptions(
voice_description="Multilingual voice",
sample_rate=24000,
),
)
Implement fallback to cloud TTS on self-hosted failure:
async def create_tts():
"""Create TTS with fallback."""
try:
# Try self-hosted first
return custom_tts.TTS(api_url="http://localhost:8001")
except Exception as e:
logger.warning(f"Self-hosted TTS unavailable: {e}")
# Fallback to cloud provider
return openai.TTS(voice="alloy")
session = AgentSession(
tts=await create_tts(),
# ... other config
)
Cache common phrases to reduce synthesis latency:
from functools import lru_cache
@lru_cache(maxsize=100)
async def synthesize_cached(text: str) -> bytes:
"""Synthesize with caching for common phrases."""
# First time: synthesize and cache
# Subsequent times: return cached audio
return await synthesize_async(text)
# Common greetings, confirmations, etc. are cached
audio = await synthesize_cached("Hello! How can I help you today?")
Symptoms: High latency (>2 seconds per sentence)
Solutions:
Use GPU acceleration:
TTS_DEVICE=cuda in environmentnvidia-smi)Use faster model:
parler-tts-large-v1 to parler-tts-mini-v1Optimize inference:
# Add to TTS API startup
import torch
model = torch.compile(model, mode="reduce-overhead")
Implement sentence-level streaming:
Symptoms: Connection lost during long syntheses
Solutions:
Verify keepalive is enabled:
_keepalive_loop()Increase timeouts:
# In plugin
async with asyncio.timeout(60): # 60 second timeout
message = await self._ws.recv()
Check network policies:
Symptoms: Robotic voice, artifacts, glitches
Solutions:
Use larger model:
parler-tts-large-v1 or F5-TTSAdjust voice description:
# More detailed description = better quality
voice_description="A warm, natural voice speaking clearly with good articulation and moderate pace"
Check audio format:
Symptoms: Server runs out of memory, crashes
Solutions:
Use smaller model:
parler-tts-mini-v1 uses ~880M parameters vs. ~2.4B for largeLimit concurrent requests:
MAX_CONCURRENT = 5
semaphore = asyncio.Semaphore(MAX_CONCURRENT)
@app.post("/synthesize")
async def synthesize(request: TTSRequest):
async with semaphore:
return await synthesize_async(request)
Clear audio buffers:
Symptoms: Error loading model from HuggingFace
Solutions:
Check internet connectivity:
curl -I https://huggingface.co
Use authentication token (if needed):
export HF_TOKEN="your-huggingface-token"
Pre-download models:
from huggingface_hub import snapshot_download
snapshot_download("parler-tts/parler-tts-mini-v1")
✅ DO:
❌ DON'T:
✅ DO:
aclose()❌ DON'T:
✅ DO:
❌ DON'T:
Load these resources as needed for detailed information:
TTS Models Comparison: Comprehensive comparison of HuggingFace TTS models (Parler-TTS, F5-TTS, XTTS-v2) with performance benchmarks, code examples, and selection guide.
API Implementation Guide: Best practices for building TTS APIs including streaming patterns, model management, audio format handling, security, and optimization strategies.
Plugin Development Guide: Detailed guide for implementing LiveKit TTS plugins including WebSocket communication, async patterns, error handling, and testing.
Deployment Guide: Production deployment with Docker Compose, Kubernetes, scaling strategies, monitoring, and security best practices.
TTS API Server:
tts-api/main.py: Complete FastAPI server with batch and streaming endpointstts-api/requirements.txt: Python dependenciestts-api/Dockerfile: Container configurationtts-api/.env.example: Environment variables templateLiveKit Plugin:
livekit-plugin-custom-tts/: Complete plugin packagelivekit-plugin-custom-tts/livekit/plugins/custom_tts/tts.py: Plugin implementationlivekit-plugin-custom-tts/examples/voice_agent.py: Voice agent integration examplelivekit-plugin-custom-tts/examples/basic_usage.py: Standalone usage exampleGET / - Health check and info
GET /health - Health status
POST /synthesize - Batch synthesis
WS /ws/synthesize - Streaming synthesis
# TTS API
TTS_MODEL_TYPE=parler # parler, f5, xtts
TTS_MODEL_NAME=parler-tts/parler-tts-mini-v1
TTS_DEVICE=cuda # cuda or cpu
# LiveKit Agent
LIVEKIT_URL=wss://your-livekit.cloud
LIVEKIT_API_KEY=your-api-key
LIVEKIT_API_SECRET=your-api-secret
TTS_API_URL=http://localhost:8001
| Model | Size | Speed | Quality | Multilingual |
|---|---|---|---|---|
| parler-tts-mini-v1 | 880M | ★★★★★ | ★★★★☆ | Limited |
| parler-tts-large-v1 | 2.4B | ★★★☆☆ | ★★★★★ | Limited |
| F5-TTS | Varies | ★★★☆☆ | ★★★★★ | Good |
| XTTS-v2 | 1.2B | ★★★★☆ | ★★★★★ | Excellent |
"A friendly, conversational voice with moderate pace."
"A professional, clear voice speaking slowly and deliberately."
"An energetic, upbeat voice with fast delivery and enthusiasm."
"A calm, soothing voice with gentle, measured speech."