Controlling Prosody in End-to-End TTS: A Case Study on Contrastive Focus Generation
Paper
| Paper/Presentation Title | Controlling Prosody in End-to-End TTS: A Case Study on Contrastive Focus Generation | 
|---|---|
| Presentation Type | Paper | 
| Authors | Latif, Siddique (Author), Kim, Inyoung (Author), Calapodescu, Ioan (Author) and Besacier, Laurent (Author) | 
| Journal or Proceedings Title | Proceedings of the 25th Conference on Computational Natural Language Learning (CoNLL 2021) | 
| ERA Conference ID | 42652 | 
| Number of Pages | 8 | 
| Year | 2021 | 
| Place of Publication | Stroudsburg, Pennsylvania | 
| ISBN | 9781955917056 | 
| Digital Object Identifier (DOI) | https://doi.org/10.18653/v1/2021.conll-1.42 | 
| Web Address (URL) of Paper | https://aclanthology.org/2021.conll-1.42/ | 
| Conference/Event | 25th Conference on Computational Natural Language Learning (CoNLL 2021) | 
| Conference on Natural Language Learning | |
| Event Details | Conference on Natural Language Learning CoNLL Rank A A A A A A A A A A | 
| Event Details | 25th Conference on Computational Natural Language Learning (CoNLL 2021) Event Date 10 to end of 11 Nov 2021 Event Location Punta Cana, Dominican Republic | 
| Abstract | While End-2-End Text-to-Speech (TTS) has made significant progresses over the past few years, these systems still lack intuitive user controls over prosody. For instance, generating speech with fine-grained prosody control (prosodic prominence, contextually appropriate emotions) is still an open challenge. In this paper, we investigate whether we can control prosody directly from the input text, in order to code information related to contrastive focus which emphasizes a specific word that is contrary to the presuppositions of the interlocutor. We build and share a specific dataset for this purpose and show that it allows to train a TTS system were this fine-grained prosodic feature can be correctly conveyed using control tokens. Our evaluation compares synthetic and natural utterances and shows that prosodic patterns of contrastive focus (variations of Fo, Intensity and Duration) can be learnt accurately. Such a milestone is important to allow, for example, smart speakers to be programmatically controlled in terms of output prosody. | 
| Keywords | End-to-End TTS, fine-grained prosody control, contrastive focus, interrogative/assertive sentences | 
| ANZSRC Field of Research 2020 | 460211. Speech production | 
| 460208. Natural language processing | |
| 461104. Neural networks | |
| 461103. Deep learning | |
| Byline Affiliations | School of Sciences | 
| NAVER LABS, United Kingdom | |
| Institution of Origin | University of Southern Queensland | 
https://research.usq.edu.au/item/q6y81/controlling-prosody-in-end-to-end-tts-a-case-study-on-contrastive-focus-generation
Download files
Published Version
| 2021.conll-1.42.pdf | ||
| License: CC BY 4.0 | ||
| File access level: Anyone | ||
Other Documentation
| 2021.conll-Proceedings Front matter.pdf | ||
| License: CC BY 4.0 | ||
| File access level: Anyone | ||
- 180total views
- 714total downloads
- 4views this month
- 30downloads this month