LagerNVS

· Paper

Hypothesis & Idea:

Novel View Synthesis (NVS) has long been approached through per-scene optimization (eg NeRF, 3DGS): Images -> Encoder(Optimization) -> Render. Recently, people have started to use learned feed-forward reconstuction models: Images -> Encoder(Learned) -> Render. The next, pioneered by [LVSM], is to not only learn the scene encoding but also the rendering end-to-end. Authors explore this idea and architectural choices for it.

Method

The main architectural contribution is the following: LagerNVS Architecture They follow the encoder-decoder paradigm. Instead of having a “bottleneck” that restricts the size of the latent representation, they use a “highway” latent representation, which size corresponds to the number of input images.

Results

  1. highway > bottleneck
  2. VGGT pretrained weights >> Dino pretrained weights > random init.
  3. LagerNVS >> Feed-forward 3DGS methods

Thoughts

  • I think that the results that highway > bottleneck is expected as is just increases the capacity of the latent scene representation.
  • The fact that VGGT pretrained weights >> Dino pretrained weights shows that some understanding of 3D is beneficial (and that VGGT has a consistent latent representation)
  • Most interesting and somewhat surprising is that LagerNVS is way better than feed-forward 3DGS methods. The authors say that this is in part because the neural approach can fill simple unseen regions, which 3DGS cannot by design.