📢SAM4D: Segment Anything in Camera and LiDAR Streams
SAM4D introduces a 4D foundation model for promptable segmentation across camera and LiDAR streams, addressing the limitations of frame-centric and modality-isolated approaches in autonomous driving.
Key Highlights:
✅Promptable Multi-modal Segmentation (PMS) – Enables interactive segmentation across sequences from both modalities using diverse prompts (points, boxes, masks), allowing cross-modal propagation and long-term object tracking.
✅Unified Multi-modal Positional Encoding (UMPE) – Aligns image and LiDAR features in a shared 3D space using sinusoidal and MLP-based encoding for seamless cross-modal interaction while preserving modality-specific structure.
✅Motion-aware Cross-modal Memory Attention (MCMA) – Incorporates ego-motion compensation into memory attention, enabling temporally consistent retrieval and robust segmentation in dynamic scenes.
✅Multi-modal Architecture – Builds on SAM2 with Hiera for image encoding and MinkUNet (via TorchSparse) for LiDAR voxelization, allowing efficient 2D-3D joint segmentation.
✅Efficient Prompt Handling – Supports point, box, and mask prompts from either modality, using a unified decoder to produce temporally consistent masks across the stream.
✅Waymo-4DSeg Dataset – A large-scale pseudo-labeled dataset containing 15M image masks, 30M LiDAR masks, and 300k cross-modal masklets, generated via VFM segmentation, 4D LiDAR reconstruction, and ray casting.
✅Cross-Modal Label Fusion Pipeline – Builds dense pixel-to-voxel mappings, filters noisy masklets using DBSCAN clustering, and merges multi-view data into high-quality voxel masklets.
✅Cross-Dataset Generalization – Demonstrates strong zero-shot and fine-tuned performance on nuScenes, validating robust transferability across sensor configurations and environments.
✅Quantitative Performance – Achieves 69.8% mIoU on images and 55.7% on LiDAR with 80.1% J&F, significantly outperforming single-modality and projection-based baselines.
✅Scalable & Efficient Design – 119.88M parameter model optimized with memory banks, FIFO queues, and prompt imitation logic for high-throughput 4D segmentation.
✅Future-Proof Foundation – Roadmap includes natural language prompting via LLMs, multi-sensor scaling, weak/self-supervised learning, and improved memory and compute efficiency.
➡️Project:Â
SAM4D-Project.github.io
➡️Github Repo:Â
github.com/CN-ADLab/SAM4D
➡️LearnopenCV blog post: Â
learnopencv.com/sam-2/
#SegmentAnything #SAM4D #LiDAR #Camera #4DPerception #AutonomousDriving #MultiModal #PromptableSegmentation