Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis

Paper: [arXiv]

Authors: Julian Zaïdi, Hugo Seuté, Benjamin van Niekerk, Marc-André Carbonneau

Abstract: This paper presents Daft-Exprt, a multi-speaker acoustic model advancing the state-of-the-art for cross-speaker prosody transfer on any text. This is one of the most challenging, and rarely directly addressed, task in speech synthesis, especially for highly expressive data. Daft-Exprt uses FiLM conditioning layers to strategically inject different prosodic information in all parts of the architecture. The model explicitly encodes traditional low-level prosody features such as pitch, loudness and duration, but also higher level prosodic information that helps generating convincing voices in highly expressive styles. Speaker identity and prosodic information are disentangled through an adversarial training strategy that enables accurate prosody transfer across speakers. Experimental results show that Daft-Exprt significantly outperforms strong baselines on inter-text cross-speaker prosody transfer tasks, while yielding naturalness comparable to state-of-the-art expressive models. Moreover, results indicate that the model discards speaker identity information from the prosody representation, and consistently generate speech with the desired voice.

If you are having trouble listening to the audios, try refreshing the page.

Contents

Prosody Transfer & Naturalness
Prosody and Speaker Identity Disentanglement
Supplementary Results

Seen Reference Speakers
Unseen Reference Speakers

Local Prosody Control

Prosody Transfer & Naturalness

We present some speech samples that were used in Section 3.3. and 3.4. of our paper.
We compare our proposed model Daft-Exprt against several strong baseline models capable of prosody transfer.
Prosody is extracted from a single reference utterance (leftmost column) and is transferred to a different speaker and text.

Models comparison
Reference	Daft-Exprt	GST-Tacotron	VAE-Tacotron	Flowtron
"Had this been common practice?"	"Breaking the threads gently, one by one."	"Breaking the threads gently, one by one."	"Breaking the threads gently, one by one."	"Breaking the threads gently, one by one."
"It probably means death."	"For the relief of physical suffering."	"For the relief of physical suffering."	"For the relief of physical suffering."	"For the relief of physical suffering."
"You are not so important after all, Pau Amma, he said."	"On the mossy trunk of the fallen tree."	"On the mossy trunk of the fallen tree."	"On the mossy trunk of the fallen tree."	"On the mossy trunk of the fallen tree."

Prosody and Speaker Identity Disentanglement

We present some speech samples that were used in Section 3.5. of our paper.
We evaluate the impact of the adversarial loss on the disentanglement between prosody and speaker identity.
Prosody is extracted from a single reference utterance (leftmost column) and is transferred to a different speaker and text.
All samples are generated using Daft-Exprt.

Daft-Exprt
Reference	No Adversarial Loss (λ_a = 0.)	Medium Adversarial Loss (λ_a = 0.01)	Strong Adversarial Loss (λ_a = 1.)
"Please, let me in!"	"But on the table she saw the half of a stale loaf."	"But on the table she saw the half of a stale loaf."	"But on the table she saw the half of a stale loaf."
"Are you going to tell a story?"	"And unsubstantial, and transparent, instead it had gone opaque."	"And unsubstantial, and transparent, instead it had gone opaque."	"And unsubstantial, and transparent, instead it had gone opaque."
"I haven't the least idea what you're talking about, said Alice."	"And as he drew the cover over her knees, he simply said."	"And as he drew the cover over her knees, he simply said."	"And as he drew the cover over her knees, he simply said."

Supplementary Results

We present additional prosody transfer examples where we showcase the ability of Daft-Exprt to transfer prosody in different situations.
We show that the model can transfer expressive prosody to speakers for which we have only neutral training samples.
We also show that it can transfer prosody from speakers never seen during training.
For simplicity, we fix the text to "Recent advancements in speech synthesis are exciting".
Prosody is extracted from a single reference utterance (leftmost column) and is transferred to several speakers.
All samples are generated using Daft-Exprt.

Seen Reference Speakers

Daft-Exprt
Reference	Expressive Target Speakers			Neutral Target Speakers
Reference	Speaker 1	Speaker 2	Speaker 3	Speaker 4	Speaker 5	Linda Johnson [LJ Speech]
"A yacht slid around the point into the bay."
"This new fact sent them spinning into the background."
"Had she enjoyed the experience?"
"He seized Richmond by the arm and led him to the door."
"I shall, this very night."
"The desk and both chairs were painted tan."
"Ponto slept on the rug."

Unseen Reference Speakers

Daft-Exprt
Reference	Expressive Target Speakers			Neutral Target Speakers
Reference	Speaker 1	Speaker 2	Speaker 3	Speaker 4	Speaker 5	Linda Johnson [LJ Speech]
"For the first time in her life she had been danced tired."
"That's right! shouted the Queen."
"Pull his canoe home with your line, Fisherman."
"Not in that way."
"I charge you by all that is sacred, not to attempt concealment."
"Kids are talking by the door"
"Dogs are sitting by the door"

Local Prosody Control

Daft-Exprt is capable of fine-grained prosody control because it predicts prosodic attributes such as duration and pitch at the phoneme level.
We showcase some examples of local prosody control combined with prosody transfer.
Prosody is extracted from a single reference utterance (leftmost column) and is transferred to a different speaker and text. Then, one prosodic attribute is modified before generation.
All samples are generated using Daft-Exprt.

Daft-Exprt - Speed Control
Reference	Normal	Slow	Fast
"A shocked gasp arose from the circle."	"It is possible to control the speed of each phoneme in the sentence."	"It is possible to control the speed of each phoneme in the sentence."	"It is possible to control the speed of each phoneme in the sentence."
"There was nothing that matter with Pete Allen's shooting or with his nerve."	"It is possible to control the speed of each phoneme in the sentence."	"It is possible to control the speed of each phoneme in the sentence."	"It is possible to control the speed of each phoneme in the sentence."

Daft-Exprt - Pitch Control
Reference	Normal	Pitch Shift (+50Hz)	Pitch Shift (-50Hz)	Pitch Amplified	Pitch Inverted	Pitch Flattened
"Do you scold them for not admiring her?"	"It is possible to control the pitch of each phoneme in the sentence."	"It is possible to control the pitch of each phoneme in the sentence."	"It is possible to control the pitch of each phoneme in the sentence."	"It is possible to control the pitch of each phoneme in the sentence."	"It is possible to control the pitch of each phoneme in the sentence."	"It is possible to control the pitch of each phoneme in the sentence."
"Kick the ball straight, and follow through?"	"It is possible to control the pitch of each phoneme in the sentence."	"It is possible to control the pitch of each phoneme in the sentence."	"It is possible to control the pitch of each phoneme in the sentence."	"It is possible to control the pitch of each phoneme in the sentence."	"It is possible to control the pitch of each phoneme in the sentence."	"It is possible to control the pitch of each phoneme in the sentence."