TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling

Overview of our proposed method, where AGVS and T\(\mathbf{^2}\)GR denote Attention-Guided View Sampling and Text&Texture-Guided Resampling, respectively. First of all, we sample \(N\) viewpoints across the objects. As shown in (a), our texture sampling strategy is an interleaved process of texture generation and diffusion denoising. Specifically, our texture sampling process is structured into \(T\) desnoising steps of diffusion process, and a complete RGB texture map (\(\hat{U}_{t}^N\)) is generated at the end of each step. As shown in (b), at denoising step \(t\), each AGVS module receives noisy latent features \(x_{t}^i\) as input to sample an image and produce a partial texture map \(\hat{U}_{t}^i\), along with noise estimation \(\epsilon_\theta(x_t^i)\). The generated \(\hat{U}_{t}^i\) serves as guidance for sampling the subsequent view. Subsequently, a complete texture map \(\hat{U}_{t}^N\) is employed to refine the noise estimation of each view within T\(^2\)GR modules, facilitating the prediction of noisy features for the ensuing denoising step (\(x_{t-1}^{1...N}\)).

Details of denoising for view \(i+1\) at step \(t\). The AGVS module is designed to generate denoised observation \(\hat{x}_0^{i+1}(x_t^{i+1})\) which will be assembled onto UV space to form intermediate texture \(\hat{U}_{t}^{i+1}\). The attention guidance is omitted in the figure for simplification. After iterating over all sampled views starting from \(i=1\) to \(N\), we obtain a complete texture map \(\hat{U}_{t}^N\) for each denoising step. Conditioned on the current aggragated texture map, the T\(^2\)GR module will update the noise estimation \(\epsilon_\theta(x_t^i)\) with the multi-conditioned classifier-free guidance (CFG) to calculate the noisy latent feature \(x_{t-1}^{i+1}\) of the next denoising step.

An oil painted apple

A wooden refrigerator

A coca cola vending machine

A telephone with golden dials

A Mandalorian helmet in silver

A pottery with flowers

A metal turtle with red eyes

A lion looking forward

A high quality color photograph of Benedict Cumberbatch

A chocolate doughnut

A ps5 controller with black buttons on the top

A fireplug, red and yellow

A metal owl with glowing eyes

A metal keg surrounded in silver

A high quality color photograph of Tom Cruise

A cute 3D cartoon lion with brown hair

BibTeX

@article{huo2024texgen,
        author    = {Huo, Dong and Guo, Zixin and Zuo, Xinxin and Shi, Zhihao and Lu, Juwei and Dai, Peng and Xu, Songcen and Cheng, Li and Yang, Yee-Hong},
        title     = {TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling},
        journal   = {ECCV},
        year      = {2024},
    }

TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling

Method Overview

Method Details

Comparison with TEXTure, Text2Tex, Fantasia3D, and ProlificDreamer

TEXTure

Text2Tex

Ours

TEXTure

Text2Tex

Ours

Fantasia3D

ProlificDreamer

Ours

Fantasia3D

ProlificDreamer

Ours

Video Gallery

An oil painted apple

A medieval armor

A brick fireplace

A wooden refrigerator

A coca cola vending machine

A telephone with golden dials

A stone lantern

A Mandalorian helmet in silver

A pottery with flowers

A metal turtle with red eyes

A lion looking forward

A high quality color photograph of Benedict Cumberbatch

A dark blue shark

A chocolate doughnut

An ironman monitor

A ps5 controller with black buttons on the top

A fireplug, red and yellow

A metal owl with glowing eyes

A metal keg surrounded in silver

A wooden crate

A high quality color photograph of Tom Cruise

A wooden dresser

A cute 3D cartoon lion with brown hair

Statue of a wolf

BibTeX