LVSM

November 8, 2025 · Paper

Goal

scalable and generalizable NVS (from sparse-view). They want to avoid/minimize 3D inductive biases:
(a) representation-level bias: using NeRF or 3DGS as an intermediate representation introduces bias not only in the structure, but also in the rendering function.
Images -> scene representation -> specific rendering model.
(b) architecture-level bias using specialized architectures like CNNs, cost volumes etc. LVSM instead just uses the transformer architecture and MLPs.

Method

Introduce two methods, one that has a latent scene representation, and one that removes the concept of it in general:

encoder-decoder: encoder is a reconstructor, latent tokens are the scene representation, decoder is a neural renderer \
decoder-only: patchify images + target view plücker rays, decoder only to synthesize target view

Results

Overall, the decoder-only architecture has better quality and scales better. However, when it comes to pure rendering speed of novel views, encoder-decoder is faster as it needs to only calculate the latent scene representation once from which it can render views.

Ablation: MLP Tokenizer > CNN-based tokenizer.

Thoughts

Cool to see that even on low-gpu budget they can train competitive models.