DLP decomposes scenes to particles with several attributes (keypoints, bounding-boxes, masks), fully unsupervised.
These act as visual ātokens,ā making cross-modal long-horizon reasoning (vision ā language) far more natural than the standard pixel patches.
3/n