Intrinsic Image Fusion

· Paper

Idea:

Inverse Rendering - the task of decomposing single RGB observations into PBR material space (albedo, roughness, metallic) - is heavily underconstrained. Optimization-based approaches based on analysis-by-synthesis are brittle. Recently, Diffusion models have been shown successfully to decompose RGB -> PBR space. However, since they model a distribution, you need to draw multiple samples from the ambiguious solution space. The paper focusses on how to fuse these independent predictions into a 3D model.

Method

Input: Images + Poses + 3d reconstruction (mesh).

  1. Image -> PBR (using RGB-X)
  2. Fit a laplace distribution per-image-per-object over the pixels of an image. Think of it as: every pixel for every object models some distribution.
  3. Distill the different 2d image distributions into 3d (instantNGP-like)
  4. Analysis by Synthesis to optimize and refine

Results

  • Tab2: the parametric aggregation is way better than naive per-object or per-texel averaging. (30dB vs 13dB PSNR)
  • Having multiple predictions per image improves the

Thoughts

  • super engineering heavy but cool!
  • the fact that multiple predictions only slightly improve performance is weird and breaks the hypothesis that diffusion models draw diverse samples from a ambiguous solution space.
  • I have no idea if Tab2 shows something meaningful or not? I think it makes sense that it is better because it has more parameters?
  • I would like to see how a per-texel averaging performs downstream
  • What does Tab3 with a single image look like? Does this mean we dont fit a laplace distribution for 2d?