LA-Pose

May 13, 2026 · Paper

tl;dr. inverse dynamics latent action pretraining + extract pose from latent action during fine-tuning

Method

Latent Action Pretraining: Use driving scenes. Spatiotemporal Transformer trained on next-frame prediction using an inverse and forward dynamics model: assume state s_t and s_{t+1}, and a transition / action a_t. Inverse Dynamics: [s_t, s_{t+1}] -> a_t Forward Dynamics: [s_t, a_t] -> s_{t+1}

Camera Pose Posttraining: Remove the Forward Dynamics model, add a head to a_t that decodes a relative camera pose between s_t and s_{t+1}

Results

outperforms VGGT on driving scenes
Ablation on Latent Action Dimension: 50 vs 1536. -> higher dimension leads to worse downstream performance (“information leakage and weaker abstraction”)

Thoughts

can this handle dynamic scenes out of the box? since it focusses on ego-motion, and not on trying to find correspondences?
cool alternative self-supervision pretraining to something like E-RayZer.
seems super promising for embodied AI
I would like to have a more detailed analysis on the latent actions