Understanding and predicting movement is a elementary part of visible intelligence. Though trendy video fashions exhibit robust comprehension of scene dynamics, exploring a number of doable futures by way of full video synthesis stays prohibitively inefficient. We mannequin scene dynamics orders of magnitude extra effectively by straight working on a long-term movement embedding that’s discovered from large-scale trajectories obtained from tracker fashions. This permits environment friendly era of lengthy, practical motions that fulfill objectives specified through textual content prompts or spatial pokes. To attain this, we first study a extremely compressed movement embedding with a temporal compression issue of 64×. On this house, we prepare a conditional flow-matching mannequin to generate movement latents conditioned on job descriptions. The ensuing movement distributions outperform these of each state-of-the-art video fashions and specialised task-specific approaches.
- †CompVis at LMU, Germany
- ‡ Munich Middle for Machine Studying






