X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

Intra-Lingual Voice Cloning

Given a source speech in one language, synthesize speech in the same language.

Cross-Lingual Voice Cloning: Less Accent Leakage, Stronger Generalization

Better Intelligibility: Reduced Accent Leakage in Cross-Lingual Synthesis

Under the same prompt and target text, we compare Chinese outputs across models. Our model preserves speaker identity while producing cleaner pronunciation with less accent leakage.

Reference (Italian)

Target text: (Chinese) 凯尔投资的钱与其他投资者一样人间蒸发。

Stronger Generalization: No Need for Reference Audio Transcripts

Any voice input is supported, including dialectal and unwritten speech. No reference transcript required.
Below, we take Cantonese and a dialect from Hunan, China (both are unseen during training) as the reference audio and generate 30 languages.

Reference (Cantonese)

Reference (Hunan Dialect)

Emotion and Speech-Rate Synthesis

This block shows emotion-aware cloning and variable speech-rate generation.