SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting

Abstract

Monocular facial performance capture in-the-wild is challenging due to varied capture conditions, face shapes, and expressions. Most current methods rely on linear 3D Morphable Models, which represent facial expressions independently of identity at the vertex displacement level. We propose SEREP (Semantic Expression Representation), a model that disentangles expression from identity at the semantic level. It first learns an expression representation from unpaired 3D facial expressions using a cycle consistency loss. Then we train a model to predict expression from monocular images using a novel semi-supervised scheme that relies on domain adaptation. In addition, we introduce MultiREX, a benchmark addressing the lack of evaluation resources for the expression capture task. Our experiments show that SEREP outperforms state-of-the-art methods, capturing challenging expressions and transferring them to novel identities.

We release a new benchmark, named MultiREX (Multiface Region-based Expression evaluation), that evaluates the estimated geometry of monocular face capture systems considering complex expression sequences under multiple camera views. In particular, the protocol evaluates mesh deformations related to expression alone, treating the identity as a given.

The benchmark is based on the Multiface dataset, and includes 8 identities captured simultaneously from five viewpoints: Frontal, two Angled views (yaw rotation around 40 degrees), and two Profile views (yaw rotation around 60 degrees). Each subject performs a range-of-motion sequence covering a wide range of expressions, including extreme and asymmetrical motions. The benchmark comprises 10k ground truth meshes and 49k images.

We obtain the ground truth identity (i.e., neutral mesh) by manually selecting a neutral frame for each subject and retopologizing the corresponding mesh to the FLAME topology using the Wrap 3D commercial software. From these two meshes, we compute a per-subject sparse conversion matrix that enables fast conversion from the FLAME to the Multiface topology and later quantitative evaluation.

The benchmark can be found in this link. We release: (i) the code to download assets, (ii) neutral meshes in the FLAME topology alongside the code to convert between FLAME and Multiface topologies, (iii) code to run the benchmark and compute the metrics.

BibTeX

@article{josi2024serep, title={SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting}, author={Josi, Arthur and Hafemann, Luiz Gustavo and Dib, Abdallah and Got, Emeline and Cruz, Rafael MO and Carbonneau, Marc-Andre}, journal={arXiv preprint arXiv:2412.14371}, year={2024} }

SEREP: Semantic Facial Expression Representation
for Robust In-the-Wild Capture and Retargeting
(ICCV 2025)

SEREP captures and retarget facial expressions from in-the-wild videos^**. It handles extremes facial expression and headpose while accounting for the target face morphology.

SEREP face performance capture and retargeting

Abstract

Video (with audio)

Face capture and retargeting

Comparison to state-of-the-art

MultiREX benchmark

BibTeX

SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting (ICCV 2025)

SEREP captures and retarget facial expressions from in-the-wild videos**. It handles extremes facial expression and headpose while accounting for the target face morphology.

SEREP face performance capture and retargeting

Abstract

Video (with audio)

Face capture and retargeting

Comparison to state-of-the-art

MultiREX benchmark

BibTeX

SEREP: Semantic Facial Expression Representation
for Robust In-the-Wild Capture and Retargeting
(ICCV 2025)

SEREP captures and retarget facial expressions from in-the-wild videos^**. It handles extremes facial expression and headpose while accounting for the target face morphology.