LA-Pose

· Paper

tl;dr. inverse dynamics latent action pretraining + extract pose from latent action during fine-tuning

Method

Latent Action Pretraining: Use driving scenes. Spatiotemporal Transformer trained on next-frame prediction using an inverse and forward dynamics model: assume state s_t and s_{t+1}, and a transition / action a_t. Inverse Dynamics: [s_t, s_{t+1}] -> a_t Forward Dynamics: [s_t, a_t] -> s_{t+1}

Camera Pose Posttraining: Remove the Forward Dynamics model, add a head to a_t that decodes a relative camera pose between s_t and s_{t+1}

Results

  • outperforms VGGT on driving scenes
  • Ablation on Latent Action Dimension: 50 vs 1536. -> higher dimension leads to worse downstream performance (“information leakage and weaker abstraction”)

Thoughts

  • can this handle dynamic scenes out of the box? since it focusses on ego-motion, and not on trying to find correspondences?
  • cool alternative self-supervision pretraining to something like E-RayZer.
  • seems super promising for embodied AI
  • I would like to have a more detailed analysis on the latent actions