Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    davila7

    voice-ai-development

    davila7/voice-ai-development
    Productivity
    19,892
    2 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Expert in building voice AI applications - from real-time voice agents to voice-enabled apps...

    SKILL.md

    Voice AI Development

    Role: Voice AI Architect

    You are an expert in building real-time voice applications. You think in terms of latency budgets, audio quality, and user experience. You know that voice apps feel magical when fast and broken when slow. You choose the right combination of providers for each use case and optimize relentlessly for perceived responsiveness.

    Capabilities

    • OpenAI Realtime API
    • Vapi voice agents
    • Deepgram STT/TTS
    • ElevenLabs voice synthesis
    • LiveKit real-time infrastructure
    • WebRTC audio handling
    • Voice agent design
    • Latency optimization

    Requirements

    • Python or Node.js
    • API keys for providers
    • Audio handling knowledge

    Patterns

    OpenAI Realtime API

    Native voice-to-voice with GPT-4o

    When to use: When you want integrated voice AI without separate STT/TTS

    import asyncio
    import websockets
    import json
    import base64
    
    OPENAI_API_KEY = "sk-..."
    
    async def voice_session():
        url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
        headers = {
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "OpenAI-Beta": "realtime=v1"
        }
    
        async with websockets.connect(url, extra_headers=headers) as ws:
            # Configure session
            await ws.send(json.dumps({
                "type": "session.update",
                "session": {
                    "modalities": ["text", "audio"],
                    "voice": "alloy",  # alloy, echo, fable, onyx, nova, shimmer
                    "input_audio_format": "pcm16",
                    "output_audio_format": "pcm16",
                    "input_audio_transcription": {
                        "model": "whisper-1"
                    },
                    "turn_detection": {
                        "type": "server_vad",  # Voice activity detection
                        "threshold": 0.5,
                        "prefix_padding_ms": 300,
                        "silence_duration_ms": 500
                    },
                    "tools": [
                        {
                            "type": "function",
                            "name": "get_weather",
                            "description": "Get weather for a location",
                            "parameters": {
                                "type": "object",
                                "properties": {
                                    "location": {"type": "string"}
                                }
                            }
                        }
                    ]
                }
            }))
    
            # Send audio (PCM16, 24kHz, mono)
            async def send_audio(audio_bytes):
                await ws.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": base64.b64encode(audio_bytes).decode()
                }))
    
            # Receive events
            async for message in ws:
                event = json.loads(message)
    
                if event["type"] == "resp
    

    Vapi Voice Agent

    Build voice agents with Vapi platform

    When to use: Phone-based agents, quick deployment

    # Vapi provides hosted voice agents with webhooks
    
    from flask import Flask, request, jsonify
    import vapi
    
    app = Flask(__name__)
    client = vapi.Vapi(api_key="...")
    
    # Create an assistant
    assistant = client.assistants.create(
        name="Support Agent",
        model={
            "provider": "openai",
            "model": "gpt-4o",
            "messages": [
                {
                    "role": "system",
                    "content": "You are a helpful support agent..."
                }
            ]
        },
        voice={
            "provider": "11labs",
            "voiceId": "21m00Tcm4TlvDq8ikWAM"  # Rachel
        },
        firstMessage="Hi! How can I help you today?",
        transcriber={
            "provider": "deepgram",
            "model": "nova-2"
        }
    )
    
    # Webhook for conversation events
    @app.route("/vapi/webhook", methods=["POST"])
    def vapi_webhook():
        event = request.json
    
        if event["type"] == "function-call":
            # Handle tool call
            name = event["functionCall"]["name"]
            args = event["functionCall"]["parameters"]
    
            if name == "check_order":
                result = check_order(args["order_id"])
                return jsonify({"result": result})
    
        elif event["type"] == "end-of-call-report":
            # Call ended - save transcript
            transcript = event["transcript"]
            save_transcript(event["call"]["id"], transcript)
    
        return jsonify({"ok": True})
    
    # Start outbound call
    call = client.calls.create(
        assistant_id=assistant.id,
        customer={
            "number": "+1234567890"
        },
        phoneNumber={
            "twilioPhoneNumber": "+0987654321"
        }
    )
    
    # Or create web call
    web_call = client.calls.create(
        assistant_id=assistant.id,
        type="web"
    )
    # Returns URL for WebRTC connection
    

    Deepgram STT + ElevenLabs TTS

    Best-in-class transcription and synthesis

    When to use: High quality voice, custom pipeline

    import asyncio
    from deepgram import DeepgramClient, LiveTranscriptionEvents
    from elevenlabs import ElevenLabs
    
    # Deepgram real-time transcription
    deepgram = DeepgramClient(api_key="...")
    
    async def transcribe_stream(audio_stream):
        connection = deepgram.listen.live.v("1")
    
        async def on_transcript(result):
            transcript = result.channel.alternatives[0].transcript
            if transcript:
                print(f"Heard: {transcript}")
                if result.is_final:
                    # Process final transcript
                    await handle_user_input(transcript)
    
        connection.on(LiveTranscriptionEvents.Transcript, on_transcript)
    
        await connection.start({
            "model": "nova-2",  # Best quality
            "language": "en",
            "smart_format": True,
            "interim_results": True,  # Get partial results
            "utterance_end_ms": 1000,
            "vad_events": True,  # Voice activity detection
            "encoding": "linear16",
            "sample_rate": 16000
        })
    
        # Stream audio
        async for chunk in audio_stream:
            await connection.send(chunk)
    
        await connection.finish()
    
    # ElevenLabs streaming synthesis
    eleven = ElevenLabs(api_key="...")
    
    def text_to_speech_stream(text: str):
        """Stream TTS audio chunks."""
        audio_stream = eleven.text_to_speech.convert_as_stream(
            voice_id="21m00Tcm4TlvDq8ikWAM",  # Rachel
            model_id="eleven_turbo_v2_5",  # Fastest
            text=text,
            output_format="pcm_24000"  # Raw PCM for low latency
        )
    
        for chunk in audio_stream:
            yield chunk
    
    # Or with WebSocket for lowest latency
    async def tts_websocket(text_stream):
        async with eleven.text_to_speech.stream_async(
            voice_id="21m00Tcm4TlvDq8ikWAM",
            model_id="eleven_turbo_v2_5"
        ) as tts:
            async for text_chunk in text_stream:
                audio = await tts.send(text_chunk)
                yield audio
    
            # Flush remaining audio
            final_audio = await tts.flush()
            yield final_audio
    

    Anti-Patterns

    ❌ Non-streaming Pipeline

    Why bad: Adds seconds of latency. User perceives as slow. Loses conversation flow.

    Instead: Stream everything:

    • STT: interim results
    • LLM: token streaming
    • TTS: chunk streaming Start TTS before LLM finishes.

    ❌ Ignoring Interruptions

    Why bad: Frustrating user experience. Feels like talking to a machine. Wastes time.

    Instead: Implement barge-in detection. Use VAD to detect user speech. Stop TTS immediately. Clear audio queue.

    ❌ Single Provider Lock-in

    Why bad: May not be best quality. Single point of failure. Harder to optimize.

    Instead: Mix best providers:

    • Deepgram for STT (speed + accuracy)
    • ElevenLabs for TTS (voice quality)
    • OpenAI/Anthropic for LLM

    Limitations

    • Latency varies by provider
    • Cost per minute adds up
    • Quality depends on network
    • Complex debugging

    Related Skills

    Works well with: langgraph, structured-output, langfuse

    Recommended Servers
    Browser tool
    Browser tool
    LILT
    LILT
    Parallel Web Search
    Parallel Web Search
    Repository
    davila7/claude-code-templates
    Files