Noise-Robust Audio-Visual Speech-Driven Body Language Synthesis

With the continuous advancement of video generation, researchers have achieved speech-driven body language synthesis, such as co-speech gestures. However, due to the lack of paired data for visual speech (i.e., lip movements) and body languages, existing methods typically rely solely on audio-only speech, which struggles to correctly synthesize target results in noisy environments. To overcome this limitation, we propose an \textbf{A}udio-\textbf{V}isual \textbf{S}peech-\textbf{D}riven \textbf{S}ynthesis (\textbf{AV-SDS}) method tailored for body language synthesis, aiming for robust synthesis even under noisy conditions. Given that each body language modality data has its corresponding audio speech, AV-SDS adopts a two-stage synthesis framework based on speech discrete units, consisting of the \texttt{AV-S2UM} and \texttt{Unit2X} modules. It uses speech discrete units as carriers to construct a direct mapping from audio-visual speech to each body language. Considering the distinct characteristics of different body languages, AV-SDS can be implemented based on semantic and acoustic discrete units, respectively, to achieve high-semantic and high-rhythm body language synthesis. Experimental results demonstrate that our AV-SDS achieves superior performance in synthesizing multiple body language modalities in noisy environments, delivering noise-robust body language synthesis.

Driving Audio	Audio2X	U2S+S2X	U2X

Ground Truth	Audio2X	U2S+S2X	U2X

Driving Audio	Ground Truth	Audio2X	U2S+S2X	U2X

	Method	SNR=15	SNR=5	SNR=-5	SNR=-15
Sample1	Driven speech
	S2X
	AV-SDS
Sample2	Driven speech
	S2X
	AV-SDS

	Method	SNR=15	SNR=5	SNR=-5	SNR=-15
Sample1	Driven speech
	S2X
	AV-SDS
Sample1	Driven speech
	S2X
	AV-SDS

Noise-Robust Audio-Visual Speech-Driven

Abstract

AV-SDS

A.Unit-Based Body Language Synthesis.

B.Audio-Visual Speech-Driven Body Language Synthesis.

C.Speech Enhancement.

A.Unit-Based Synthesis for Co-Speech Modalities.

A.1 Talking Head & Lip movements

A.2 3D LandMark

A.3 Co-Speech Gesture

A.4 Mesh

B.Audio-Visual Speech-Driven Body Language Synthesis.

B.1 Audio-Visual Speech-Driven Talking Head Generation

B.2 Audio-Visual Speech-Driven 3D Facial Animation

C.Speech Enhancement.

SNR	Target.Audio	Inp.Audio	RESYNTHESIS	ReVISE (AV-S2UM(semantic))	AV-S2UM(acoustic)(ours)
SNR = 15
SNR = 5
SNR = -5
SNR = -15