Image/video models can transfer knowledge from Internet data to robot agents by generating goal images. But what happens when images have harmful visual artifacts? We present GHIL-Glue, a method to align image/video models and low-level policies.
ghil-glue.github.io/
We study the interplay between image/video foundation models and low-level robot control policies, and present GHIL-Glue, a simple method for aligning image/video foundation models and low-level policies for robotic control.