Intrinsic Image Fusion

May 11, 2026 · Paper

Idea:

Inverse Rendering - the task of decomposing single RGB observations into PBR material space (albedo, roughness, metallic) - is heavily underconstrained. Optimization-based approaches based on analysis-by-synthesis are brittle. Recently, Diffusion models have been shown successfully to decompose RGB -> PBR space. However, since they model a distribution, you need to draw multiple samples from the ambiguious solution space. The paper focusses on how to fuse these independent predictions into a 3D model.

Method

Input: Images + Poses + 3d reconstruction (mesh).

Image -> PBR (using RGB-X)
Fit a laplace distribution per-image-per-object over the pixels of an image. Think of it as: every pixel for every object models some distribution.
Distill the different 2d image distributions into 3d (instantNGP-like)
Analysis by Synthesis to optimize and refine

Results

Tab2: the parametric aggregation is way better than naive per-object or per-texel averaging. (30dB vs 13dB PSNR)
Having multiple predictions per image improves the

Thoughts

super engineering heavy but cool!
the fact that multiple predictions only slightly improve performance is weird and breaks the hypothesis that diffusion models draw diverse samples from a ambiguous solution space.
I have no idea if Tab2 shows something meaningful or not? I think it makes sense that it is better because it has more parameters?
I would like to see how a per-texel averaging performs downstream
What does Tab3 with a single image look like? Does this mean we dont fit a laplace distribution for 2d?