pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

2023

pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

David Charatan, Sizhe Li, Andrea Tagliasacchi, and 1 more author

Dec 2023

Paper Abstract

We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D radiance fields parameterized by 3D Gaussian primitives from pairs of images. Our model features real-time and memory-efficient rendering for scalable training as well as fast 3D reconstruction at inference time. To overcome local minima inherent to sparse and locally supported representations, we predict a dense probability distribution over 3D and sample Gaussian means from that probability distribution. We make this sampling operation differentiable via a reparameterization trick, allowing us to back-propagate gradients through the Gaussian splatting representation. We benchmark our method on wide-baseline novel view synthesis on the real-world RealEstate10k and ACID datasets, where we outperform state-of-the-art light field transformers and accelerate rendering by 2.5 orders of magnitude while reconstructing an interpretable and editable 3D radiance field.

@article{2312.12337v4,
  author = {Charatan, David and Li, Sizhe and Tagliasacchi, Andrea and Sitzmann, Vincent},
  title = {pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable
    Generalizable 3D Reconstruction},
  eprint = {2312.12337v4},
  archiveprefix = {arXiv},
  primaryclass = {cs.CV},
  year = {2023},
  month = dec,
  url = {http://arxiv.org/abs/2312.12337v4},
  file = {2312.12337v4.pdf},
  eprintnover = {2312.12337}
}

Important Points

1. Adaptive Density Control Revisited

In 3D-GS, Gaussian primitives initialized randomly must move through space to arrive at their final location. During this process, two issues lead to local minima: gradients vanish if the distance to the optimal location exceeds more than a few standard deviations (the paper dubs this as “local support”), and Gaussians require a path where the loss is monotonically decreasing to its final location. Prior 3D-GS required “Adaptive Density Control” to address this problem.

pixelSplat revisits Adaptive Density Control via a differentiable reparameterization trick. It roughly goes like this:

For each pixel, the network outputs logits which are softmaxed into a discrete distribution
\[\phi = [\phi_1, \phi_2, \dots, \phi_Z], \quad \phi_z = P(z)\]
over $Z$ predefined depth-bins.
Sample
\[z \sim \mathrm{Categorical}(\phi)\,,\]
to choose a bin index. Then, calculate the depth:
\[d = b_z + \delta_z\]
(where $b_z$ is the bin center and $\delta_z$ a learned offset), and unproject along the camera ray:
\[\mu = o + d \, d_u\]
with camera origin $o$ and ray direction $d_u$.
Instead of a hard spawn/prune, set the Gaussian’s opacity to
\[\alpha = \phi_z\,.\]
During backpropagation, by the chain rule:
\[\frac{\partial L}{\partial \phi_z} = \frac{\partial L}{\partial \alpha} \frac{\partial \alpha}{\partial \phi_z} = \frac{\partial L}{\partial \alpha}\,,\]
effectively “reparameterizing” the sampling step and allowing gradients to update $\phi$ directly.

By restructuring Adaptive Density Control into a probabilistic sampling approach combined with the reparameterization trick, PixelSplat generates Gaussians where the loss indicates that more density is needed, and prunes in low-error regions while remaining differentiable.

2. Scale Ambiguity Problem

Most existing datasets for 3D reconstruction provide poses that are computed using structure from-motion (SfM) software. Because SfM reconstructs each scene only up to scale, different scenes are scaled by individual, arbitrary scale factors. Intuitively, this can be thought of you only ever recovering relative distances; there is no fixed correspondence between units in the 3D reconstruction and units of the real world.

The implication is that a neural network making predictions about the geometry of a scene from a single image cannot predict the depth that matches the poses reconstructed by structure-from-motion. pixelSplat uses a two-view encoder in attempt to resolve scale ambiguity by performing depth estimations.

Concluding Thoughts

This paper is on the older side but is definitely influential in setting an example of how future works can use probabilistic models in 3D-GS. Since its publishment, a steady flow of follow-up papers has built on and refined its ideas.