LA-Pose
tl;dr. inverse dynamics latent action pretraining + extract pose from latent action during fine-tuning
Method
Latent Action Pretraining: Use driving scenes. Spatiotemporal Transformer trained on next-frame prediction using an inverse and forward dynamics model: assume state s_t and s_{t+1}, and a transition / action a_t. Inverse Dynamics: [s_t, s_{t+1}] -> a_t Forward Dynamics: [s_t, a_t] -> s_{t+1}
Camera Pose Posttraining: Remove the Forward Dynamics model, add a head to a_t that decodes a relative camera pose between s_t and s_{t+1}
Results
- outperforms VGGT on driving scenes
- Ablation on Latent Action Dimension: 50 vs 1536. -> higher dimension leads to worse downstream performance (“information leakage and weaker abstraction”)
Thoughts
- can this handle dynamic scenes out of the box? since it focusses on ego-motion, and not on trying to find correspondences?
- cool alternative self-supervision pretraining to something like E-RayZer.
- seems super promising for embodied AI
- I would like to have a more detailed analysis on the latent actions