la forge logo
La Forge
stellenbosch university logo
ubisoft logo

Rhythm Modeling for Voice Conversion

Authors: Benjamin van Niekerk, Marc-André Carbonneau, Herman Kamper

Abstract: Voice conversion aims to transforms source speech into a different target voice. However, typical voice conversion systems do not account for rhythm, which is an important factor in the perception of speaker identity. To bridge this gap, we introduce Urhythmic—an unsupervised method for rhythm conversion that does not require parallel data or text transcriptions. Using self-supervised representations, we first divide source audio into segments approximating sonorants, obstruents, and silences. Then we model rhythm by estimating speaking rate or the duration distribution of each segment type. Finally, we match the target speaking rate or rhythm by time-stretching the speech segments. Experiments show that Urhythmic outperforms existing unsupervised methods in terms of quality and prosody.

Code and pretrained models are available here.

If you are having trouble listening to the audio, please refresh the page.

Samples from Urhythmic

This section contains speech samples from our fine-grained approach.
We pick the three fastest and three slowest speakers from VCTK. To avoid conflating accent and speaker identity, we limit the selection to a single region (Southern England).
The target speakers in the following table are ordered by speaking rate from slowest on the left to fastest on the right.

p228 p268 p225 p232 p257 p231

Comparison to baselines

Here we present samples from the subjective evaluations.
We compare against two baselines: AutoPST and DISSC.

source target AutoPST DISSC Urhythmic global Urhythmic fine

Duration Control

Next, we use Urhythmic to edit the rhythm of an utterance without supervision. For this example, we cut the dendrogram (Fig.4 in the paper) into eight clusters to illustrate control over the different sound types (vowel, approximant, nasal, fricative, stop, silence).
We visulaize the effect of stretching or contracting some clusters in Fig.2.

sound type control
Fig.2 - An example of fine-grained rhythm control. (middle) The converted utterance without modification. The result of (top) stretching segments corresponding to nasals, or (bottom) shortening vowels.

Next, we strech segments in different clusters slow different sound types down by a factor of two.

sound type cluster#
fricatives 6,7
vowels 2,4
silences 0
stops 3
approximants 1
nasals 5

Supplementary Results

This section presents an additional visualization of the correlation between the estimated speaking rate and the syllable rate.

correlation between the estimated speaking rate and the syllable rate
Fig.3 - Syllable rate versus estimated speaking rate using Urhythmic. Each point represents a speaker from the LibriSpeech dev or test split.