Many of the eCommerce product showcase videos you now see on Instagram and TikTok are AI-generated. It's generally fairly easy to tell because the hands may be mangled, the faces don't retain consistency throughout, and/or the person in the video interacts in odd ways with the product (hand passing through it, etc.). This is because diffusion transformers (DiT) are trained primarily on RGB representations, meaning they infer human-object boundaries from appearance alone.
A new paper from Alibaba introduces CoInteract, which trains a DiT on two parallel representations of the same scene: a standard RGB video stream and an auxiliary Human-Object Interaction (HOI) stream that strips away texture from the human representation while preserving body structure and interaction geometry. The two streams are trained jointly through a shared backbone with co-attention, allowing the model to learn physical interaction priors during training. The HOI stream is discarded at inference, so most of the quality gains come with almost no additional generation cost.
The framework also introduces face- and hand-specific experts in an MoE architecture that routes tokens via a "spatially supervised router." The training pipeline relies heavily on off-the-shelf tools for object segmentation, human mesh recovery, and hand/face detection to create human- and object-masked streams and to encode them both into a shared latent representation. The model was trained on 12k clips with "RGB–HOI representations, hand/face bounding boxes, and silhouette masks." CoInteract outperformed other SOTA models on nearly every benchmark.
Paper linked below.