Controllable Context-aware Conversational Speech Synthesis

Jian Cong¹, Shan Yang², Na Hu², Guangzhi Li², Lei Xie¹, Dan Su² ¹ Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China ² Tencent AI Lab, China

Abstract

In spoken conversation, spontaneous behavior is a common phenomenon and conversational partner tend to align features of their speech with their interlocutor which is known as entrainment. To produce human-like conversations, we propose a unified controllable spontaneous conversational speech synthesis framework to model the above two phenomena. Specifically, we use explicit labels to represent two typical spontaneous behaviors filled-pause and prolongation in the acoustic model and develop a neural network based predictor to predict the occurrences of the two behaviors from text. We subsequently develop an algorithm based on the predictor to control the occurrence frequency of the behaviors, making th synthesized speech varies in disfluency. To model the speech entrainment at acoustic level, we utilize a context acoustic encoder to extract a global style embedding from the previous speech conditioning on the synthesizing of current speech. Furthermore, since the current and previous utterances belong to different speakers in a conversation, we add a domain adversarial training module, eliminating the speaker-related information in the acoustic encoder while maintaining the style-related information. Experiments show that our proposed approach can synthesize realistic conversations and control the occurrences of the spontaneous behaviors naturally.

Evaluation on acoustic model
Controllablity for the occurrence frequency of the spontaneous behaviors
Comparison of predicted label and random label

1. Evaluation on acoustic model

M1: baseline Taco-like model without spontaneous labels.
M2: baseline model with explicit spontaneous labels.
M3: M2 with extra BERT encoder.
M4: M3 with audio context encoder to model entrainment.

M1	M2	M3	M4
A: 这个景点在哪儿？ B:嗯[filed-pause]这个也不远，就在东城区景山前街那边。

A:哎[filled-pause]，那个[prolongation]学校的事儿，你帮我问了吗。B: 呃[pl+fp]，学校的事[prolongation]，我朋友说他过几天才有空，嗯[pl+fp]你别太担心了

A: 你哪儿不舒服啊? B:我跟你说过，我感冒了。

A: 你就别装了。B: 哎，我原本不想告诉你的，我的脑袋里啊长了个瘤。

2. Controllablity for the occurrence frequency of the spontaneous behaviors

The synthesized samples with different occurrence frequency of spontaneous behaviors. With the increase of the frequency, we can tell that the speaking is from less to more disfluency