SEREP: Semantic Facial Expression Representation
for Robust In-the-Wild Capture and Retargeting

1Ecole de Technologie Supérieure, 2Ubisoft LaForge
*Equal contribution

SEREP captures and retarget facial expressions from in-the-wild videos**. It handles extremes facial expression and headpose while accounting for the target face morphology.

SEREP face performance capture and retargeting

SVG Example

Abstract

Monocular facial performance capture in-the-wild is challenging due to varied capture conditions, face shapes, and expressions. Most current methods rely on linear 3D Morphable Models, which represent facial expressions independently of identity at the vertex displacement level. We propose SEREP (Semantic Expression Representation), a model that disentangles expression from identity at the semantic level. It first learns an expression representation from unpaired 3D facial expressions using a cycle consistency loss. Then we train a model to predict expression from monocular images using a novel semi-supervised scheme that relies on domain adaptation. In addition, we introduce MultiREX, a benchmark addressing the lack of evaluation resources for the expression capture task. Our experiments show that SEREP outperforms state-of-the-art methods, capturing challenging expressions and transferring them to novel identities.

Video (with audio)

Face capture and retargeting

For each video: (left): input video; (center): capture result, (right) retargeting to another face morphology.**
In-the-wild videos have sound (muted by default)



Comparison to state-of-the-art

For each video: (left): input video, (center): EMICA results, (right) SEREP results.
Top row shows results processing a frontal camera view; Bottom row shows results processing a side view.**

MultiREX benchmark

We release a new benchmark, named MultiREX (Multiface Region-based Expression evaluation), that evaluates the estimated geometry of monocular face capture systems considering complex expression sequences under multiple camera views. In particular, the protocol evaluates mesh deformations related to expression alone, treating the identity as a given.

The benchmark is based on the Multiface dataset, and includes 8 identities captured simultaneously from five viewpoints: Frontal, two Angled views (yaw rotation around 40 degrees), and two Profile views (yaw rotation around 60 degrees). Each subject performs a range-of-motion sequence covering a wide range of expressions, including extreme and asymmetrical motions. The benchmark comprises 10k ground truth meshes and 49k images.


We obtain the ground truth identity (i.e., neutral mesh) by manually selecting a neutral frame for each subject and retopologizing the corresponding mesh to the FLAME topology using the Wrap 3D commercial software. From these two meshes, we compute a per-subject sparse conversion matrix that enables fast conversion from the FLAME to the Multiface topology and later quantitative evaluation.

The benchmark is being made public (coming soon), and we release: (i) the code to download assets, (ii) neutral meshes in the FLAME topology alongside the code to convert between FLAME and Multiface topologies, (iii) code to run the benchmark and compute the metrics.

Dataset samples
Dataset samples

Camera views and identities included in the MultiREX benchmark, along with their corresponding neutral meshes in MultiFace and FLAME topologies

BibTeX

@article{josi2024serep,
    title={SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting},
    author={Josi, Arthur and Hafemann, Luiz Gustavo and Dib, Abdallah and Got, Emeline and Cruz, Rafael MO and Carbonneau, Marc-Andre},
    journal={arXiv preprint arXiv:2412.14371},
    year={2024}
}