Noise-Robust Audio-Visual Speech-Driven

Body Language Synthesis

Abstract

With the continuous advancement of video generation, researchers have achieved speech-driven body language synthesis, such as co-speech gestures. However, due to the lack of paired data for visual speech (i.e., lip movements) and body languages, existing methods typically rely solely on audio-only speech, which struggles to correctly synthesize target results in noisy environments. To overcome this limitation, we propose an \textbf{A}udio-\textbf{V}isual \textbf{S}peech-\textbf{D}riven \textbf{S}ynthesis (\textbf{AV-SDS}) method tailored for body language synthesis, aiming for robust synthesis even under noisy conditions. Given that each body language modality data has its corresponding audio speech, AV-SDS adopts a two-stage synthesis framework based on speech discrete units, consisting of the \texttt{AV-S2UM} and \texttt{Unit2X} modules. It uses speech discrete units as carriers to construct a direct mapping from audio-visual speech to each body language. Considering the distinct characteristics of different body languages, AV-SDS can be implemented based on semantic and acoustic discrete units, respectively, to achieve high-semantic and high-rhythm body language synthesis. Experimental results demonstrate that our AV-SDS achieves superior performance in synthesizing multiple body language modalities in noisy environments, delivering noise-robust body language synthesis.

A.Unit-Based Synthesis for Co-Speech Modalities.

A.1 Talking Head & Lip movements

Driving Audio Audio2X U2S+S2X U2X

A.2 3D LandMark

Ground Truth Audio2X U2S+S2X U2X

A.3 Co-Speech Gesture

Audio2X U2S+S2X U2X

A.4 Mesh

Driving Audio Ground Truth Audio2X U2S+S2X U2X



B.Audio-Visual Speech-Driven Body Language Synthesis.

B.1 Audio-Visual Speech-Driven Talking Head Generation

The sample presented here has been dubbed with clear audio.

Method SNR=15 SNR=5 SNR=-5 SNR=-15
Sample1 Driven speech
S2X
AV-SDS
Sample2 Driven speech
S2X
AV-SDS

B.2 Audio-Visual Speech-Driven 3D Facial Animation

The sample presented here has been dubbed with clear audio.

Method SNR=15 SNR=5 SNR=-5 SNR=-15
Sample1 Driven speech
S2X
AV-SDS
Sample1 Driven speech
S2X
AV-SDS



C.Speech Enhancement.

SNR Target.Audio Inp.Audio RESYNTHESIS ReVISE (AV-S2UM(semantic)) AV-S2UM(acoustic)(ours)
SNR = 15
SNR = 5
SNR = -5
SNR = -15