Made VANTA,
A neural target speaker extraction system I’ve been building to isolate one specific voice from the messiest audio recordings.
Live demo:
vanta.komalpreet.me
Code:
github.com/Komalpreet2809/Va…
Current audio separation tools are a blunt instrument.
When dealing with audio files, users often struggle with:
• Overlapping voices
• Loud background chatter
• Complex room acoustics
• Standard noise cancellation tools that blindly suppress "noise"
• Systems that don't know who to focus on when multiple people are speaking
Most AI audio tools either act like a black box, aggressively muffle everything, or leave the target speaker sounding metallic and robotic.
I wanted to build something different.
Vanta is an informed audio separator.
Instead of guessing what to suppress, it uses a 5-second reference clip of your target speaker to learn their exact voice fingerprint.
It then scans the messy mixture and extracts only that person, returning a crystal-clear track of their voice, plus a residue track of everything it removed.
What it can do:
• Ingest a 5-second reference voice fingerprint
• Isolate the target speaker from highly noisy mixtures
• Mask out interfering voices (even at similar volumes)
• Preserve the natural phase of the audio (no STFT/robotic artifacts)
• Generate a residue track of the removed noise/speakers
• Operate robustly across different simulated room environments
The main principle behind the project is:
More signal.
More informed extraction.
Zero metallic artifacts.
Less blind noise cancellation.
The Tech stack:
• PyTorch for the core ML architecture and training
• Time-domain 1-D Convolutions to avoid spectrogram artifacts
• Frozen ECAPA-TDNN (VoxCeleb) for robust voice fingerprinting
• Temporal Convolutional Networks (TCN) with speaker conditioning
• FastAPI for the backend API
• Next.js Tailwind for the frontend shell
• Hugging Face Spaces & Vercel for deployment
One of the biggest goals is audio purity. Your isolated audio shouldn't sound like it's trapped in a tin can.
• Time-domain architecture: Operates directly on the raw audio waveform.
• SI-SDR optimization: Maximizes waveform purity over volume differences.
• Continuous conditioning: Voice fingerprint injected at every block to never lose the target speaker.
• Explainable separation: Outputs a separate residue track so you can hear exactly what was removed.
If you’re an ML engineer, audio researcher, developer, or someone who has felt the pain of noisy recordings and overlapping voices, I’d love your feedback, ideas, issues, PRs, or even just a star ⭐
#OpenSource #MachineLearning #DeepLearning #AudioAI #SpeechProcessing #PyTorch #FastAPI #NextJS #SpeechSeparation #TargetSpeakerExtraction