Abstract: Recently, compared with the conventional "front-end"--"back-end"--"vocoder" structure, based on the attention mechanism, end-to-end speech synthesis systems directly train and synthesize from text sequence to the acoustic feature sequence as a whole. More recently, a more calculation efficient architecture named Transformer, which is solely based on self-attention, was proposed to model global dependencies between the input and output sequences. However, although with many advantages, ``Transformer" lacks position information in its structure. And also, the weighted sum form in self-attention may disperse the attention to the whole input sequence other than focusing on the more important neighbouring positions. In order to solve the problems, in this paper, we proposes a hybrid self-attention structure which combines self-attention with the recurrent neural networks (RNNs), and then enhance the proposed structure with relative-position-aware biases. Experiments are conducted to compare proposed hybrid structure with other comparable speech synthesis structures. Mean opinion score (MOS) test results indicate that by enhancing hybrid self-attention structure with relative-position-aware biases, proposed system achieves the best performance with only 0.11 MOS score lower than natural recording.
Figure 1. MOS values of different systems
1. Comparing the solely self-attention model with and without relative-position-aware(RPA) (SELF-P vs SELF-R ):
1.1. The top time for severe weather in the south-central Plains and the Ohio Valley is late spring.
Natural:
SELF-P (without RPA):
SELF-R (with RPA):
1.2. I made a raft of far-reaching promises and improbable bargains..
Natural:
SELF-P (without RPA):
SELF-R (with RPA):
1.3. The final report is due on October twenty seventh.
Natural:
SELF-P (without RPA):
SELF-R (with RPA):
1.4. In cancer, cells ignore the normal signals from the body and proliferate uncontrollably.
Natural:
SELF-P (without RPA):
SELF-R (with RPA):
2. Using CNN pre-net, comparing the self-attention tower with and without relative-position-aware (CNN-P vs CNN-R ):
2.1. Daughtry elaborated on the counting trick by bringing Cocky along.
Natural:
CNN-P (without RPA):
CNN-R (with RPA):
2.2. The reorganization of these countries took the form of revolution.
Natural:
CNN-P (without RPA):
CNN-R (with RPA):
2.3. We threaten to be of the one mind before the voyage is completed.
Natural:
CNN-P (without RPA):
CNN-R (with RPA):
2.4. Those later years are, of course, a time of adjustments.
Natural:
CNN-P (without RPA):
CNN-R (with RPA):
3. Comparing the CBHG tower with and without relative-position-aware (CBHG-P vs CBHG-R ):
3.1. Just go to Google dot org and then check out flu trends.
Natural:
CBHG-P (without RPA):
CBHG-R (with RPA):
3.2. They have another more powerful drug coursing through their veins.
Natural:
CBHG-P (without RPA):
CBHG-R (with RPA):
3.3. A big push is under way to step up the live cattle trade between northern Australia and Asia.
Natural:
CBHG-P (without RPA):
CBHG-R (with RPA):
3.4. He may toss coins; he may take slips of paper from a grab bag.
Natural:
CBHG-P (without RPA):
CBHG-R (with RPA):
4. Comparing the hybrid system with and without relative-position-aware (HYBRID-P vs HYBRID-R ):
4.1. I also understand that similar branch organizations have made their appearance in Europe.
Natural:
HYBRID-P (without RPA):
HYBRID-R (with RPA):
4.2. Edward Chandler is a board observer for Lessac Technologies.
Natural:
HYBRID-P (without RPA):
HYBRID-R (with RPA):
4.3. All of these have been policies of Mayor Giuliani.
Natural:
HYBRID-P (without RPA):
HYBRID-R (with RPA):
4.4. She is messianic in sharing her linguistic, spatial, and personal intelligence gifts.