In this work we address the challenging problem of multiview 3D surface reconstruction. We introduce Implicit Differentiable Renderer (IDR): a neural network architecture that simultaneously learns the unknown geometry, camera parameters, and a neural renderer that approximates the light reflected from the surface towards the camera. The geometry is represented as a zero level-set of a neural network, while the neural renderer, derived from the rendering equation, is capable of (implicitly) modeling a wide set of lighting conditions and materials. We trained our network on real world 2D images of objects with different material properties, lighting conditions, and noisy camera initializations from the DTU MVS dataset. We found our model to produce state of the art 3D surface reconstructions with high fidelity, resolution and detail.
Given a set of input masked 2D images, our goal is to infer the following three unknowns: (i) the geometry of the scene, represented as a zero level-set of an MLP f; (ii) the light and reflectance properties of the scene; and (iii) the unknown camera parameters. Toward that goal we simulate the rendering process of an implicit neural geometry inspired by the rendering equation.
The IDR forward model produces differentiable RGB values for a learnable camera position c and some fixed image pixel p as follows: the camera parameters and pixel define a viewing direction v, and we denote by x the intersection of the viewing ray c+tv with the implicit surface.
A Sample Network module represents x, and the normal to the surface n as differentiable functions of the implicit geometry and camera parameters.
The final radiance reflected from the geometry toward the camera c in direction v, i.e., RGB, is approximated by the Neural Renderer M, an MLP that takes as input the surface point x and normal n, the viewing direction v, and a global geometry feature vector z.
In turn, the IDR model is incoporated in a loss comparing it to the ground truth pixel color, that enables learning simultaneously the geometry, its appearance and camera parameters.
Examples of reconstructed 3D geometry and rendering of novel views computed from 49-64 input 2D images of the DTU dataset. We show results in two different setups: (1) fixed ground truth cameras, and (2) trainable camera parameters with noisy initializations.
Geometry and novel views computed from 11 images of the Fountain-P11 image collections.
Towards the goal of disentangling geometry and appearance, we show the transfer of the appearance learned from one scene to unseen geometry. For two models trained on two different DTU scenes, we show (left to right): the reconstructed geometry; novel views using the trained renderer; and novel views rendered using the renderer from the other (unseen) scene.