M-Voice

Revisiting Voice Large Language Models as Scalable Multi-Lingual and Multi-Task Learners

Anonymous Authors

Abstract. Large language models (LLMs) have successfully served as a general-purpose interface across multiple tasks and languages, while the adaptation of voice LLMs is mostly designed for specific purposes (either single-task or monolingual), where the advantages of LLMs especially for low-resource language processing and zero-shot task generalization are less exploited in the audio community. To bridge the gap, we introduce MVoice as a multi-modal voice LLM and conduct a comprehensive study on its capability to deal with multiple tasks/languages. When trained on ~200K hours of 6-language data for 4 voice generation applications, MVoice emerges notable advantages: 1) as scalable learners to improve performance with end-to-end local and global multiscale transformers; and 2) as multitask learners by adjusting prompts to share common knowledge across modalities (speech/singing) and present in-context learning abilities by generalizing to unseen tasks not explicitly train on; 3) as multilingual learners to alleviate data scarcity of low-resource languages by including rich-resource language training data. Experimental results demonstrate that MVoice exhibits superior audio quality and style similarity compared with competitive baseline models in monolingual/cross-lingual voice generation.

Overview

Text to Speech
Voice Conversion
Singing Voice Synthesis
Singing Voice Conversion
Crosslingual Text to Speech
Comparison with VALL-E
Comparison with SPEAR-TTS
Comparison with VoiceBox
Ablation on Model Scales
Ablation on Multilingual / Monolingual Data
Cross Lingual Style Transfer
Emotion Transfer
Noise Condition Transfer

Text to Speech

In this section, we provide the generated audio samples with other systems on the text-to-speech task.

Text	GT	GenerSpeech	YourTTS	M-Voice (ours)

Voice Conversion

In this section, we provide the generated audio samples with other systems on the voice-conversion task.

Source Audio	Prompt	NANSY	ppg-vc	M-Voice (ours)

Singing Voice Synthesis

In this section, we provide the generated audio samples with other systems on the singing-voice-synthesis task.

Text	Ground-truth	FFT-Singer	DiffSinger	M-Voice (ours)

Singing Voice Conversion

In this section, we provide the generated audio samples with other systems on the singing-voice-conversion task.

Ground-truth	Prompt	M-Voice (ours)

Crosslingual Text to Speech

In this section, we provide the generated audio samples with other systems on the crosslingual text-to-speech task.

X-to-English

Text	Prompt Language	Prompt	Voicebox	YourTTS	M-Voice (ours)
	French
	Spanish
	German

X-to-German

Text	Prompt Language	Prompt	Voicebox	YourTTS	M-Voice (ours)
	English
	French
	Spanish

X-to-Chinese

Text	Prompt Language	Prompt	YourTTS	M-Voice (ours)
	German
	English
	Spanish
	French

Comparison with VALL-E

In this section, we compare our results with demo samples of VALL-E.

Text	Prompt	VALL-E	M-Voice (ours)

Comparison with SPEAR-TTS

In this section, we compare our results with demo samples of SPEAR-TTS.

Text	Prompt	SPEAR-TTS	M-Voice (ours)

Comparison with VoiceBox

In this section, we compare our results with demo samples of VoiceBox.

Text	Prompt	VoiceBox	M-Voice (ours)

Ablation on Model Scales

In this section, we compare the generated audio samples of models with different sizes.

Text to Speech

Text	GT	Base	Medium	Large

Voice Conversion

Source	Prompt	Base	Medium	Large

Ablation on Multilingual / Monolingual Data

In this section, we compare the generated audio samples of monolingual and multilingual models.

Text to Speech

Text	Language	Prompt	Monolingual	Multilingual
	German
	German
	German
	French
	French

Voice Conversion

Source	Language	Prompt	Monolingual	Multilingual
	German
	German
	German
	German
	French
	French

Cross Lingual Style Transferring

In this section, we provide samples to show the cross-lingual style tranferring ability of our model.

Source Audio	Prompt	M-Voice

Emotion Transferring

In this section, we provide samples to show the emotion tranferring ability of our model.

Emotion	Source Audio	Prompt	M-Voice
Angry
Sad
Happy

Noise Condition Transferring

In this section, we provide samples to show the noise condition tranferring ability of our model.

Text or Source	Prompt	M-Voice

Revisiting Voice Large Language Models as Scalable Multi-Lingual and Multi-Task Learners

Overview

Table of Contents

Text to Speech

X-to-English

X-to-German

X-to-Chinese

Text to Speech

Voice Conversion

Text to Speech

Voice Conversion