Intrinsic Image Fusion
Idea:
Inverse Rendering - the task of decomposing single RGB observations into PBR material space (albedo, roughness, metallic) - is heavily underconstrained. Optimization-based approaches based on analysis-by-synthesis are brittle. Recently, Diffusion models have been shown successfully to decompose RGB -> PBR space. However, since they model a distribution, you need to draw multiple samples from the ambiguious solution space. The paper focusses on how to fuse these independent predictions into a 3D model.
Method
Input: Images + Poses + 3d reconstruction (mesh).
- Image -> PBR (using RGB-X)
- Fit a laplace distribution per-image-per-object over the pixels of an image. Think of it as: every pixel for every object models some distribution.
- Distill the different 2d image distributions into 3d (instantNGP-like)
- Analysis by Synthesis to optimize and refine
Results
- Tab2: the parametric aggregation is way better than naive per-object or per-texel averaging. (30dB vs 13dB PSNR)
- Having multiple predictions per image improves the
Thoughts
- super engineering heavy but cool!
- the fact that multiple predictions only slightly improve performance is weird and breaks the hypothesis that diffusion models draw diverse samples from a ambiguous solution space.
- I have no idea if Tab2 shows something meaningful or not? I think it makes sense that it is better because it has more parameters?
- I would like to see how a per-texel averaging performs downstream
- What does Tab3 with a single image look like? Does this mean we dont fit a laplace distribution for 2d?