MeshAvatar: Learning High-quality Triangular
Human Avatars from Multi-view Videos

Tsinghua University1, NNKosmos Technology2
ECCV 2024
MY ALT TEXT

Given the multi-view videos of a specific subject, our method learns his triangular avatar with intrinsic material decomposition. After training, the avatar not only supports synthesis under novel poses and novel lighting conditions, but also enables texture editing and material manipulation.

Abstract

We present a novel pipeline for learning high-quality triangular human avatars from multi-view videos. Recent methods for avatar learning are typically based on neural radiance fields (NeRF), which is not compatible with traditional graphics pipeline and poses great challenges for operations like editing or synthesizing under different environments. To overcome these limitations, our method represents the avatar with an explicit triangular mesh extracted from an implicit SDF field, complemented by an implicit material field conditioned on given poses. Leveraging this triangular avatar representation, we incorporate physics-based rendering to accurately decompose geometry and texture. To enhance both the geometric and appearance details, we further employ a 2D UNet as the network backbone and introduce pseudo normal ground-truth as additional supervision. Experiments show that our method can learn triangular avatars with high-quality geometry reconstruction and plausible material decomposition, inherently supporting editing, manipulation or relighting operations.

Method Overview

MY ALT TEXT

Our pipeline learns a hybrid human avatar represented in the form of (a) an explicit skinned mesh and (b) implicit pose-dependent material fields. Such a representation inherently supports (c) physics-based ray tracing and can be trained in an end-to-end manner using (d) normal estimation as an additional supervision signal.

Video Presentation

Comparisons

Quantitative Results

Ours AvatarReX Animatable
Gaussians
Animatable
Gaussians*
Xu et al. Lin et al. Intrinsic
Avatar
Representation mesh SDF 3DGS 3DGS SDF SDF SDF
Relightable?
Training Time
(~100 frames)
~3h 2.5 days 4h
(mono.)
Training Time
(~1000 frames)
~16h 2 days 2 days
(RTX 4090)
2 days
(RTX 4090)
30h
Inference Time
(per image)
180ms 30s 100ms 4~10s 5s 40s 20s

Comparisons with recent SOTA methods on neural avatars. We achieved 20x faster at inference.

Qualitative Results

MY ALT TEXT

We evaluated our method on AvatarReX[1] and ActorsHQ[2] datasets. Our method could reconstruct fine-grained dynamic human geometry.

References

[1] Zheng, Zerong, et al. "Avatarrex: Real-time expressive full-body avatars." ACM Transactions on Graphics (TOG) 42.4 (2023): 1-19.
[2] Işık, Mustafa, et al. "Humanrf: High-fidelity neural radiance fields for humans in motion." ACM Transactions on Graphics (TOG) 42.4 (2023): 1-12.

BibTeX

@misc{chen2024meshavatar,
    title={MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos}, 
    author={Yushuo Chen and Zerong Zheng and Zhe Li and Chao Xu and Yebin Liu},
    year={2024},
    eprint={2407.08414},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2407.08414}, 
}