parlor-on-device-ai

On-device, real-time multimodal AI voice and vision assistant powered by Gemma 4 E2B and Kokoro TTS, running entirely locally via FastAPI WebSocket server.

Skill file

Preview skill file
---
name: parlor-on-device-ai
description: On-device, real-time multimodal AI voice and vision assistant powered by Gemma 4 E2B and Kokoro TTS, running entirely locally via FastAPI WebSocket server.
triggers:
  - "set up parlor on-device AI"
  - "run local voice AI with camera"
  - "configure parlor multimodal assistant"
  - "use Gemma 4 with Kokoro TTS locally"
  - "build real-time voice assistant on device"
  - "parlor websocket voice vision server"
  - "on-device speech and vision AI"
  - "run parlor with Apple Silicon"
---

# Parlor On-Device AI

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.

Parlor is a real-time, on-device multimodal AI assistant. It combines Gemma 4 E2B (via LiteRT-LM) for speech and vision understanding with Kokoro TTS for voice output. Everything runs locally — no API keys, no cloud calls, no cost per request.

## Architecture

```
Browser (mic + camera)
    │
    │  WebSocket (audio PCM + JPEG frames)
    ▼
FastAPI server
    ├── Gemma 4 E2B via LiteRT-LM (GPU)  →  understands speech + vision
    └── Kokoro TTS (MLX on Mac, ONNX on Linux)  →  speaks back
    │
    │  WebSocket (streamed audio chunks)
    ▼
Browser (playback + transcript)
```

Key features:
- **Silero VAD** in browser — hands-free, no push-to-talk
- **Barge-in** — interrupt AI mid-sentence by speaking
- **Sentence-level TTS streaming** — audio starts before full response is ready
- **Platform-aware TTS** — MLX backend on Apple Silicon, ONNX on Linux

## Requirements

- Python 3.12+
- macOS with Apple Silicon **or** Linux with a supported GPU
- ~3 GB free RAM
- [`uv`](https://github.com/astral-sh/uv) package manager

## Installation

```bash
git clone https://github.com/fikrikarim/parlor.git
cd parlor

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

cd src
uv sync
uv run server.py
```

Open [http://localhost:8000](http://localhost:8000), grant camera and microphone permissions, and start talking.

Models download automatically on first run (~2.6 GB for Gemma 4 E2B, plus TTS models).

## Configuration

Set environment variables before running:

```bash
# Use a pre-downloaded model instead of auto-downloading
export MODEL_PATH=/path/to/gemma-4-E2B-it.litertlm

# Change server port (default: 8000)
export PORT=9000

uv run server.py
```

| Variable     | Default                        | Description                                    |
|--------------|-------------------------------|------------------------------------------------|
| `MODEL_PATH` | auto-download from HuggingFace | Path to local `.litertlm` model file           |
| `PORT`       | `8000`                         | Server port                                    |

## Project Structure

```
src/
├── server.py              # FastAPI WebSocket server + Gemma 4 inference
├── tts.py                 # Platform-aware TTS (MLX on Mac, ONNX on Linux)
├── index.html             # Frontend UI (VAD, camera, audio playback)
├── pyproject.toml         # Dependencies
└── benchmarks/
    ├── bench.py           # End-to-end WebSocket benchmark
    └── benchmark_tts.py   # TTS backend comparison
```

## Key Components

### server.py — FastAPI WebSocket Server

The server handles two WebSocket connections: one for receiving audio/video from the browser, one for streaming audio back.

```python
# Simplified pattern from server.py
from fastapi import FastAPI, WebSocket
import asyncio

app = FastAPI()

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    async for data in websocket.iter_bytes():
        # data contains PCM audio + optional JPEG frame
        response_text = await run_gemma_inference(data)
        audio_chunks = await run_tts(response_text)
        for chunk in audio_chunks:
            await websocket.send_bytes(chunk)
```

### tts.py — Platform-Aware TTS

Kokoro TTS selects backend based on platform:

```python
# tts.py uses platform detection
import platform

def get_tts_backend():
    if platform.system() == "Darwin":
        # Apple Silicon: use MLX backend for GPU acceleration
        from kokoro_mlx import KokoroMLX
        return KokoroMLX()
    else:
        # Linux: use ONNX backend
        from kokoro import KokoroPipeline
        return KokoroPipeline(lang_code='a')

tts = get_tts_backend()

# Sentence-level streaming — yields audio as each sentence is ready
async def synthesize_streaming(text: str):
    for sentence in split_sentences(text):
        audio = tts.synthesize(sentence)
        yield audio
```

### Gemma 4 E2B Inference via LiteRT-LM

```python
# LiteRT-LM inference pattern
from litert_lm import LiteRTLM
import os

model_path = os.environ.get("MODEL_PATH", None)

# Auto-downloads if MODEL_PATH not set
model = LiteRTLM.from_pretrained(
    "google/gemma-4-E2B-it",
    local_path=model_path
)

async def run_gemma_inference(audio_pcm: bytes, image_jpeg: bytes = None):
    inputs = {"audio": audio_pcm}
    if image_jpeg:
        inputs["image"] = image_jpeg
    
    response = ""
    async for token in model.generate_stream(**inputs):
        response += token
    return response
```

## Running Benchmarks

```bash
cd src

# End-to-end WebSocket latency benchmark
uv run benchmarks/bench.py

# Compare TTS backends (MLX vs ONNX)
uv run benchmarks/benchmark_tts.py
```

## Performance Reference (Apple M3 Pro)

| Stage                            | Time          |
|----------------------------------|---------------|
| Speech + vision understanding    | ~1.8–2.2s     |
| Response generation (~25 tokens) | ~0.3s         |
| Text-to-speech (1–3 sentences)   | ~0.3–0.7s     |
| **Total end-to-end**             | **~2.5–3.0s** |

Decode speed: ~83 tokens/sec on GPU.

## Common Patterns

### Extending the System Prompt

Modify the prompt in `server.py` to change the AI's persona or task:

```python
SYSTEM_PROMPT = """You are a helpful language tutor. 
Respond conversationally in 1-3 sentences.
If the user makes a grammar mistake, gently correct them.
You can see through the user's camera and discuss what you observe."""
```

### Adding a New Language for TTS

Kokoro supports multiple language codes. Set `lang_code` in `tts.py`:

```python
# Language codes: 'a' = American English, 'b' = British English
# 'e' = Spanish, 'f' = French, 'z' = Chinese, 'j' = Japanese
pipeline = KokoroPipeline(lang_code='e')  # Spanish
```

### Customizing VAD Sensitivity (index.html)

The Silero VAD threshold can be tuned in the frontend:

```javascript
// In index.html — lower positiveSpeechThreshold = more sensitive
const vad = await MicVAD.new({
  positiveSpeechThreshold: 0.6,   // default ~0.8, lower = triggers more easily
  negativeSpeechThreshold: 0.35,  // how quickly it stops detecting speech
  minSpeechFrames: 3,
  onSpeechStart: () => { /* UI feedback */ },
  onSpeechEnd: (audio) => sendAudioToServer(audio),
});
```

### Sending Frames Programmatically (WebSocket Client Example)

```python
import asyncio
import websockets
import json
import base64

async def send_audio_frame(audio_pcm_bytes: bytes, jpeg_bytes: bytes = None):
    uri = "ws://localhost:8000/ws"
    async with websockets.connect(uri) as ws:
        payload = {
            "audio": base64.b64encode(audio_pcm_bytes).decode(),
        }
        if jpeg_bytes:
            payload["image"] = base64.b64encode(jpeg_bytes).decode()
        
        await ws.send(json.dumps(payload))
        
        # Receive streamed audio response
        async for message in ws:
            audio_chunk = message  # raw PCM bytes
            # play or save audio_chunk
```

## Troubleshooting

### Model download fails
```bash
# Pre-download manually via huggingface_hub
uv run python -c "
from huggingface_hub import hf_hub_download
path = hf_hub_download('google/gemma-4-E2B-it', 'gemma-4-E2B-it.litertlm')
print(path)
"
export MODEL_PATH=/path/shown/above
uv run server.py
```

### Microphone/camera not working in browser
- Must access via `http://localhost` (not IP address) — browsers block media APIs on non-localhost HTTP
- Check browser permissions: address bar → lock icon → reset permissions

### TTS not loading on Linux
```bash
# Ensure ONNX runtime is installed
uv add onnxruntime
# Or for GPU:
uv add onnxruntime-gpu
```

### High latency or slow inference
- Verify GPU is being used: check for Metal (Mac) or CUDA (Linux) in startup logs
- Close other GPU-heavy applications
- On Linux, confirm CUDA drivers match installed `onnxruntime-gpu` version

### Port already in use
```bash
export PORT=8080
uv run server.py
# Or kill the existing process:
lsof -ti:8000 | xargs kill
```

### `uv sync` fails — Python version mismatch
```bash
# Parlor requires Python 3.12+
python3 --version
# Install 3.12 via pyenv or system package manager, then:
uv python pin 3.12
uv sync
```

## Dependencies (pyproject.toml)

Key packages installed by `uv sync`:
- `litert-lm` — Google AI Edge inference runtime for Gemma
- `fastapi` + `uvicorn` — async web/WebSocket server
- `kokoro` — Kokoro TTS ONNX backend
- `kokoro-mlx` — Kokoro TTS MLX backend (Mac only)
- `silero-vad` — voice activity detection (browser-side via CDN)
- `huggingface-hub` — model auto-download

Source

Creator's repository · aradotso/trending-skills

View on GitHub

Security

Security checks in progress
Results will appear here once audits complete
What this skill can do
Reads your filesConnects to the internetRuns code on your machine
Checked by 3 independent security firms
Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub
Does it sneak in hidden code?Not yet checkedPending · Socket
Does it have known bugs?Not yet checkedPending · Snyk