Rajan Saha Raju, Prithwiraj Bhattacharjee, Arif Ahmad, Mohammad Shahidur Rahman, “A Bangla Text-to-Speech System using Deep Neural Networks,” 2019 International Conference on Bangla Speech and Language Processing (ICBSLP), Sylhet, 2019.
We present a Deep Neural Network (DNN) based statistical parametric Text-to-Speech (TTS) system for Bangla (also known as Bengali). A first step in building a DNN-based TTS system is having large speech data. Since good speech dataset for Bangla TTS is not available publicly, we created our own dataset for our system. We prepared a phonetically rich studio-quality speech database containing more than 40 hours of speech. The database consists of 12,500 utterances. We also prepared a pronunciation dictionary (lexicon) of 1,35,000 words for front-end text processing, which, to our knowledge, is the largest lexicon for Bangla. Our system extracts linguistic features from input text. Then it uses deep neural networks for mapping these linguistic features to acoustic features. We developed two TTS voices using our dataset – one male and one female voice. Both objective and subjective evaluation tests show that our system performs significantly better than the traditional Bangla TTS systems and is comparable to the commercially available best Bangla TTS system.