Samples are from a model trained with the
Blizzard 2013 database. This year Challenge provides about 200 hours unsegmented speech and 9741 segmented waveforms. I did all the experiments based the 9741 segmented waveforms since it's hard for me to split the unsegmented data (Google's paper used 147 hours data read by the 2013 Blizzard Challenge speaker).
The first column is generated speech with global style token(GST) after 100K steps.
The second column is generated speech with GSTs after 200K steps.
The third column is generated speech without GSTs(only reference encoder) after 150K steps.
The last column is the reference audio.
1. All the horses would come to him , but I think we were his favorites.
With GST (100K step)
With GST (200K step)
Without GST (150K step)
reference audio
2. The oldest of the colts raised his head , pricked his ears , and said , There are the hounds.
With GST (100K step)
With GST (200K step)
Without GST (150K step)
reference audio
3. He was gone again, glad to get away even from Fanny.