You think I went quiet for no reason?
I built a speech synthesis server for Linux. From scratch. In Go. It's called voicego. SSIP-compatible, so Orca and spd-say just work — but that's the least interesting part.
Not "yet another TTS for screen readers." A network-native, security-first speech server for distributed/IoT — that also happens to be a killer screen reader backend. In that order.
The Linux speech stack was architected decades ago. Single machine, local socket, zero thought to "what if this goes over a network." It shows. So I rebuilt it on tools that didn't exist back then: Go, gRPC, TLS 1.3, actual multicore.
The architecture:
Engines aren't linked into the daemon. Each one — espeak-ng, RHVoice, piper — is a separate executable launched over gRPC via hashicorp/go-plugin. No more symbol collisions between espeak and espeak-ng exporting the same C names. Per-engine sample rates handled cleanly. And a segfault in one engine can't corrupt the server — it surfaces as 503 ERR ENGINE_DEAD and everything else keeps talking.
Real parallelism. Every (client, engine, voice) triple gets its own worker goroutine with a bounded multi-lane priority queue: IMPORTANT → MESSAGE → TEXT → NOTIFICATION → PROGRESS. High-priority speech preempts queued lower-priority text within a triple. Separate triples never block each other. No global lock choking the whole pipeline.
Now the part nobody thinks about: security. While you're on a local Unix socket, fine. But distribute voice across devices and synthesis goes over the wire — you're shipping commands, sensitive text (spoken passwords, private messages), getting PCM back. In most setups that's UNENCRYPTED. A reverse open mic.
voicego wraps any TCP transport in mutual TLS — 1.3 only, RequireAndVerifyClientCert. Cert-authed clients, encrypted traffic, rogue devices stay out. Load-bearing wall, not a checkbox.
Audio sinks are pluggable: null for CI, alsa for compat, or a direct JACK client that connects straight to system playback ports and resamples each PCM chunk via libsamplerate. First-chunk latency target ≤50ms, synth-start ≤30ms p99 on short utterances.
There's also an optional binary framing mode on top of text SSIP, auto-detected per connection — no flag dance, it just negotiates.
Where it's going: IoT and the smart home. Heavy synth on a beefy node, a kitchen speaker or hallway box does the talking — each mTLS-authed, encrypted, process-isolated. A voice layer across your whole home. Without that foundation it's a sieve.
Honesty: NOT stable. First release. It runs, it talks, it needs you to break it. Safety net — the Orca add-on auto-falls back to speech-dispatcher if voicego is unreachable. Nobody's left in silence. Non-negotiable.
So break it. Screen reader on Linux? Run it, tell me what exploded. Write Go? Tear the architecture apart. Write your own engine plugin against the gRPC interface — that's the whole point.
Why I care: I'm a blind engineer. I live inside a screen reader. I wanted synth that doesn't choke or die on one bad component. But once I got serious, the real problem wasn't "a tool for me" — it was a secure networked speech server where the screen reader is just one client.
Start from the real constraint. Design for where it's going. Ship it open.
AGPL. Come break it:
github.com/Ravino/voicego
#Go #Golang #gRPC #OpenSource #a11y #Linux #IoT #InfoSec #DistributedSystems #SystemsProgramming