I am still not fully convinced you need more architectures, MLA compresses key value pairs into a single latent, why not compress them further along the time dimension? Maybe PerceiverIO, or maybe after having read enough tokens you compress them into few latents decoder only.
PerReg : New prediction method from Valeo. Claims SOTA on nuScenes, Argo, and Waymo.
Uses a PerceiverIO arch. Adds registers for scene-level information. Combines self distillation and reconstruction losses. Pretraining on all datasets fine tuning the decoder only works best.
Maybe because of the architectural differences?
The base video models:
- infer all output tokens at once, instead of auto-regressively
- progressively iterates on the output instead of one-shotting it
Maybe a better text archi would be something like PerceiverIO diffusion
SceneDiffuser: neat idea from @Waymo using diffusion to simulate traffic: cast scene-related tasks as in-painting. Runs with PerceiverIO, transformer backbone, AdaLN denoising, L2 denosing loss.
Perks: easy to inject constraints, controllability, good results #CVPR2024#DDADS
To compress our high-dimension input sequence of spike-level tokens, we use a PerceiverIO backbone to map a sequence of spikes to a sequence of behavior outputs.
Taking lessons from our DeFiNe (sites.google.com/view/tri-de…) work, we find that camera embeddings ( a PerceiverIO-like architecture) allow for generalizable scale transfer. ZeroDepth has strong metric zero-shot results on KITTI and NYUv2, despite a large domain (and depth scale) gap
Dieter Fox showed @mohito1905's "Perceiver-Actor" at #ICRA2023 today. It produces voxelspace action outputs from RGBD language input, using PerceiverIO for the backbone. I was thinking a 3D CNN would be a strong backbone too, but actually the exps show Perceiver winning by a lot
Muy interesante este trabajo, similar a la idea del PerceiverIO (sensiocoders.com/blog/079_pe…) en el que se le puede pedir "cosas" a las características calculadas por una red neuronal usando mecanismos de atención (como no), pero enfocado a la segmentación de imágenes.
Today we're releasing the Segment Anything Model (SAM) — a step toward the first foundation model for image segmentation.
SAM is capable of one-click segmentation of any object from any photo or video zero-shot transfer to other segmentation tasks ➡️ bit.ly/433YuBI
Our paper proposes a PerceiverIO-based approach and applies it to synthetic scenarios and audio-visual datasets.
A2MT is a challenging task with many practical applications, and we would be excited for the community to join our efforts!
6/17
To deal with high-dimensional inputs, we adapt PerceiverIO by @drew_jaegle et al (Deepmind). PerceiverIO uses a small set of latent vectors that are cross-attended with the input. These latent vectors are randomly initialized and trained end-to-end.
Rewatched @Tesla's AI day recently, and when @karpathy introduced the Transformer used in AutoPilot, it immediately reminded me of @DeepMind's #PerceiverIO which I recently contributed @huggingface. Wonder whether Tesla's approach was inspired by it...
Ok now the new #PerceiverIO deserves my attention…pardon! My self-attention!
Let’s start from these wonderful jupyter notebooks: mlm (masked language modeling), image and text classification just to begin!
github.com/NielsRogge/Transf…
DeepMind is doing a lot of interesting stuff this year, with AlphaFold, PerceiverIO, more recently with applying AI to pure math, and now Gopher! Looking forward to seeing what other advances DeepMind makes in the field of AI!
Our brain can take any kind of input: video, audio, text. So can we have one single architecture that can handle all these types of input? @DeepMind has come out with this new model: #PerceiverIO that can do that!