🤖 ✨ Following up on my earlier post on my 'Home Robot' with reactive obstacle avoidance navigation nearly a month ago, I am super excited to share my next technical milestone: proactive voice vision navigation assistant, integrated and all running on-device.
Goal: Autonomous home assistant 🤖 (say Alexa on wheels 😉)
Capabilities:
- Completely vision based (the only sensor is an RGB-D camera). Maps the whole area and stores landmarks used for navigation.
- Hands-free conversation: wake-word activation, then continuous follow-up dialogue with short-term memory; auto-sleep on silence. Voice-commanded autonomous navigation to landmarks.
- Multimodal scene understanding: "what do you see?" answered by an on-device VLM from the camera feed
- Real-time object detection via YOLOv8n
Tech Stack / Tools (100% on-device):
Master controller:
@NVIDIAAI Jetson Orin Nano Super
Chassis: Waveshare wave rover 4WD
Audio: ReSpeaker XVF-3000 mic array (AEC/beamforming), Piper TTS
Speech: openWakeWord whisper.cpp (CUDA / GPU accelerated)
Language/Vision:
@GoogleAI Gemma 4 E2B multimodal via llama.cpp (GPU accelerated)
Perception: YOLOv8n on TensorRT (GPU accelerated)
Autonomy:
@OpenRoboticsOrg ROS2 Humble, Nav2 (Planner: Smac Hybrid, Controller: RPP), RTAB-MAP VSLAM (EKF fused RTAB-MAP's visual odometry with dead-reckoning odometry)
Peak memory: ~6.7GB / 8GB with the full stack live
Key engineering takeaways:
- NVIDIA developer forums are great, they helped me solve a lot of issues really fast.
- It took me a while to figure out the right VLSAM approach for my vision based sensor. The learnings here have been immense RTAB-Map for RGB-D camera, SLAM Toolbox for LiDAR based nav, CuVSLAM for Stereo camera
- On constrained edge hardware, the system-level bottleneck is memory and GPU contention, not model quality. Right-sizing the models (2B over 4B, small over medium STT) was what made concurrent operation stable.
- A tiered runtime: a lightweight always-on tier an on-demand navigation tier, was essential to verify everything fit in 8GB.
- The bulk of the engineering effort was robustness: surviving audio glitches, boot-time race conditions, and resource pressure, not adding capabilities.
Here's 🥂 to my technical PoC: a robot that sees, hears, reasons, and moves, with nothing leaving the device.