Rhythm Modeling for Voice Conversion

Authors: Benjamin van Niekerk, Marc-André Carbonneau, Herman Kamper

Abstract: Voice conversion aims to transforms source speech into a different target voice. However, typical voice conversion systems do not account for rhythm, which is an important factor in the perception of speaker identity. To bridge this gap, we introduce Urhythmic—an unsupervised method for rhythm conversion that does not require parallel data or text transcriptions. Using self-supervised representations, we first divide source audio into segments approximating sonorants, obstruents, and silences. Then we model rhythm by estimating speaking rate or the duration distribution of each segment type. Finally, we match the target speaking rate or rhythm by time-stretching the speech segments. Experiments show that Urhythmic outperforms existing unsupervised methods in terms of quality and prosody.

Code and pretrained models are available here.

If you are having trouble listening to the audio, please refresh the page.

Samples from Urhythmic

This section contains speech samples from our fine-grained approach.
We pick the three fastest and three slowest speakers from VCTK. To avoid conflating accent and speaker identity, we limit the selection to a single region (Southern England).
The target speakers in the following table are ordered by speaking rate from slowest on the left to fastest on the right.

	targets
	p228	p268	p225	p232	p257	p231
source

Comparison to baselines

Here we present samples from the subjective evaluations.
We compare against two baselines: AutoPST and DISSC.

source	target	AutoPST	DISSC	Urhythmic global	Urhythmic fine

Duration Control

Next, we use Urhythmic to edit the rhythm of an utterance without supervision. For this example, we cut the dendrogram (Fig.4 in the paper) into eight clusters to illustrate control over the different sound types (vowel, approximant, nasal, fricative, stop, silence).
We visulaize the effect of stretching or contracting some clusters in Fig.2.

sound type control — ^{Fig.2 - An example of fine-grained rhythm control. (middle) The converted
utterance without modification.
The result of (top) stretching segments corresponding to nasals, or (bottom) shortening vowels.}

Next, we strech segments in different clusters slow different sound types down by a factor of two.

sound type	cluster#
no-modification
fricatives	6,7
vowels	2,4
silences	0
stops	3
approximants	1
nasals	5

Supplementary Results

This section presents an additional visualization of the correlation between the estimated speaking rate and the syllable rate.