📢 Segment Any Motion in Videos: fine-grained video object segmentation — without flow supervision or manual annotations during inference.
By integrating long-range motion trajectories, DINO-based semantics, and SAM2 prompting, SAMotion delivers dynamic segmentation masks per object even in complex, real-world scenes.
Key Highlights:
✅ Spatio-Temporal Trajectory Attention (ST-ATT) – Encodes long-range motion by alternating spatial attention (across trajectories) and temporal attention (along each trajectory), capturing both global inter-object relationships and local motion evolution.
✅Motion-Semantic Decoupled Embedding (MSDE) – Separates motion and semantic reasoning in the decoder: motion-only attention is followed by DINO-based semantic augmentation through cross-attention, ensuring semantic cues refine but do not dominate motion prediction.
✅BootsTAP-Based Track Generation – Leverages high-confidence 2D trajectories from BootsTAP with visibility and confidence filtering, enriching motion cues with depth and frame-to-frame deltas (Δu, Δv, Δd) for enhanced temporal modeling.
✅Frequency-Based Positional Encoding (PE) – Adopts NeRF-style sinusoidal embeddings on spatial and temporal signals to avoid oversmoothing and preserve fine-grained motion localization across trajectories.
✅Depth-Enhanced Motion Encoding – Incorporates monocular depth estimates from Depth-Anything to model scene structure and occlusions, enabling better segmentation under 3D layout variations and partial visibility.
✅Two-Stage SAM2 Prompting –
1. Groups tracks per object (spatial/frame heuristics)
2. Uses long-range point prompts and merges fragmented masks.
✅Fine-Grained Instance-Level Masks – Handles multiple similarly-moving objects, complex articulation, clothing, limbs, etc.
✅Superior Benchmark Results – Outperforms state-of-the-art MOS and fine-grained MOS baselines (e.g., RCF, ABR, OCLR) across DAVIS17, SegTrackv2, FBMS59:
DAVIS17-Moving (Fine-grained MOS): J=77.4, F=83.6
DAVIS16-Moving (MOS): J=89.0, F=89.2
✅Robust in Challenging Conditions – Demonstrates resilience to:
Camouflage textures and motion blur
Transparent surfaces and reflections
Strong camera motion and partial occlusion
✅Ablation-Backed Architecture – Removing DINO, MSDE, or ST-ATT leads to significant drops (up to -17 % J&F), confirming the necessity of decoupled semantic integration and spatio-temporal modeling.
✅Modular & Data-Efficient Training – Trained on a mix of synthetic (Kubric, DynamicReplica) and real-world (HOI4D) datasets, showing generalization across scene types without needing dense motion annotations at inference.
Paper:
lnkd.in/giH-YuFr
Github:
lnkd.in/gquJ_TwP
Project:
lnkd.in/gxmiJ6q9
Related articles from LearnOpenCV:
SAM2:
lnkd.in/gkG7dx65
MedSAM2:
lnkd.in/gg78Pri3
#SAM2 #Segmentation #SegmentAnything