True Self-Supervised Novel View Synthesis is Transferable

· Paper

tl;dr: camera trajectory from one scene should render same trajectory in other scene.

Claim: The key criterion for determining if a model is capable of Novel View Synthesis (NVS) is transferability (of poses): given a camera trajectory extracted from one scene, it should render the same trajectory in a different scene. Current self-supervied methods like RayZer do not produce transferable poses. Authors hypothesize that these methods use latent pose to encode how to interpolate context views to synthesize target view instead.

Current Self-Supervised NVS systems consistent of 3 main components:

  • PoseEncPoseEnc: Images II \to Poses PP
  • SceneEncSceneEnc: Images + Poses \to Scene Representation SS
  • RenderRender: Target Pose + Scene Representation \to Target Render

For traditional methods, PoseEnc=OraclePoseEnc = Oracle, where OracleOracle is often chosen as COLMAP or VGGT.

To measure transferability, they introduce True Pose Similarity (TPS):

TPS(IA,IB)=dR(Oracle(IA),Oracle(IB))TPS(I^A, I^B) = d_R(Oracle(I^A), Oracle(I^B))

where IAI^A and IBI^B are sequences of images from two scenes, and dR(,)d_R(\cdot, \cdot) is some pose distance metric, like RRA/RTA/AUC.

Applying the function to our self-supervised framework results in:

TPS(IA,IB)=d(IA,Render(SB,PA))TPS(I^A, I^B) = d(I^A, Render(S^B, P^A))

XFactor

Based on the Transferability insight above, to truly disentangle poses, authors derive Transferability Objective, rendering novel views in sequence AA by using pose encodings from sequence BB. Assume both IAI^A and IBI^B are split into context ICI_C and target ITI_T frames. Then the training objective becomes

L=dI(ITA,Render(ICA,PoseEnc(IB)))L = d_I(I_T^A, Render(I_C^A, PoseEnc(I_B)))

where the scene encoding is absorbed into the renderer: Render(IC,)Render(SceneEnc(IC),)Render(I_C,\cdot) \equiv Render(SceneEnc(I_C), \cdot). The main issue in this formulation lies in the fact that we cannot control that two image sequences have the exact same camera trajectory. Therefore, the authors propose to use pose-preserving augmentations that separate a single sequence into two non-overlapping subsequences:

XFactor Method

They train a stereo-monocular model (one context frame, one target frame).

Results

True Pose Similarity: XFactor >> RayZer/RUST both qualitatively and quantitatively.

XFactor Result

Left: First compute pairwise PoseEncPoseEnc from first to all other frames. Then render novel views providing context frame.
Right: VGGT poses of rendered images.

Pose Probe: Probing the pose shows that while XFactor outperforms RayZer/RUST, these models do still encode useful information about it (it just doesn’t show up in the TPS score).

Ablations: Adding additional view to PoseEncPoseEnc / RenderRender (to make it multi-view) degrades TPS and probed poses. Why?

What does this mean? That networks starts to try and interpolate and reuse the pose encoding for something else?

Thoughts

  • TPS is only necessary because XFactor has purely latent poses - RayZer eg has explicit poses than can be used directly at inference - but for XFactor (latent variable model), controllability=transferability
  • Evaluation focusses on poses, not on NVS quality (appentix still shows that its significantly better than RayZes or RUST)
  • PoseEnc is always pairwise. Making it multi-view while ensuring pose latents are still well-behaved could be interesting.
  • Claim: “NVS is simply the ability to render a scene from a user-controllable viewpoint: It is critical that the same camera pose always results in the same viewpoint being rendered. If the model cannot do this, it is not a true NVS model, but rather, a frame interpolator.”
    • I feel like NVS can be seen as frame interpolation.
    • scale ambiguity. what is the correct way to transfer poses.
  • How does rendering quality behave when adding more context frames?