Audio Samples

Samples are from a model trained with the Blizzard 2013 database. This year Challenge provides about 200 hours unsegmented speech and 9741 segmented waveforms. I did all the experiments based the 9741 segmented waveforms since it's hard for me to split the unsegmented data (Google's paper used 147 hours data read by the 2013 Blizzard Challenge speaker).
The first column is generated speech with global style token(GST) after 100K steps.
The second column is generated speech with GSTs after 200K steps.
The third column is generated speech without GSTs(only reference encoder) after 150K steps.
The last column is the reference audio.

1. All the horses would come to him , but I think we were his favorites.