With the continuous advancement of video generation, researchers have achieved speech-driven body language synthesis, such as co-speech gestures. However, due to the lack of paired data for visual speech (i.e., lip movements) and body languages, existing methods typically rely solely on audio-only speech, which struggles to correctly synthesize target results in noisy environments. To overcome this limitation, we propose an \textbf{A}udio-\textbf{V}isual \textbf{S}peech-\textbf{D}riven \textbf{S}ynthesis (\textbf{AV-SDS}) method tailored for body language synthesis, aiming for robust synthesis even under noisy conditions. Given that each body language modality data has its corresponding audio speech, AV-SDS adopts a two-stage synthesis framework based on speech discrete units, consisting of the \texttt{AV-S2UM} and \texttt{Unit2X} modules. It uses speech discrete units as carriers to construct a direct mapping from audio-visual speech to each body language. Considering the distinct characteristics of different body languages, AV-SDS can be implemented based on semantic and acoustic discrete units, respectively, to achieve high-semantic and high-rhythm body language synthesis. Experimental results demonstrate that our AV-SDS achieves superior performance in synthesizing multiple body language modalities in noisy environments, delivering noise-robust body language synthesis.