LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors

LiftImage3D:
Lifting Any Single Image to 3D Gaussians
with Video Generation Priors

1Shanghai Jiao Tong University 2Huawei Inc.
* equal contributions in no particular order.   † project lead.
TL;DR: We present LiftImage3D, a framework for reconstructing 3D-consistent scenes from single images using Latent Video Diffusion Models (LVDMs), overcoming the distortions inherent in LVDM generation to achieve high-quality 3D consistency and rendering across diverse inputs.

Single Image to 3D Scene

Abstract

Single-image 3D reconstruction remains a fundamental challenge in computer vision due to inherent geometric ambiguities and limited viewpoint information. Recent advances in Latent Video Diffusion Models (LVDMs) offer promising 3D priors learned from large-scale video data. However, leveraging these priors effectively faces three key challenges: (1) degradation in quality across large camera motions, (2) difficulties in achieving precise camera control, and (3) geometric distortions inherent to the diffusion process that damage 3D consistency. We address these challenges by proposing LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency. Specifically, we design an articulated trajectory strategy to generate video frames, which decomposes video sequences with large camera motions into ones with controllable small motions. Then we use robust neural matching models, i.e. MASt3R, to calibrate the camera poses of generated frames and produce corresponding point clouds. Finally, we propose a distortion-aware 3D Gaussian splatting representation, which can learn independent distortions between frames and output undistorted canonical Gaussians. Extensive experiments demonstrate that LiftImage3D achieves state-of-the-art performance on two challenging datasets, i.e. LLFF and DL3DV, and generalizes well to diverse in-the-wild images, from cartoon illustrations to complex real-world scenes.

overview

Interactive Viewer

Click on the images below to render 3D scenes in real-time in your browser, powered by Brush!
Note that this is experimental and quality may be reduced.

Framework

The overall pipeline of LiftImage3D. Firstly, we extend LVDM to generate diverse video clips from a single image using an articulated camera trajectory strategy. Then all generated frames are matching using the robust neural matching module and registered in to a point cloud. After that we initialize Gaussians from registered point clouds and construct a distortion field to model the independent distortion of each video frame upon canonical 3DGS.

overview

Single Image to 3D Scene which Can Be Dragged freely

Citation

@misc{chen2024liftimage3d,
    title={LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors},
    author={Yabo Chen and Chen Yang and Jiemin Fang and Xiaopeng Zhang and Lingxi Xie and Wei Shen and Wenrui Dai and Hongkai Xiong and Qi Tian},
    year={2024},
    eprint={2412.09597},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
    }