Controlling Prosody in End-to-End TTS: A Case Study on Contrastive Focus Generation

Paper


Latif, Siddique, Kim, Inyoung, Calapodescu, Ioan and Besacier, Laurent. 2021. "Controlling Prosody in End-to-End TTS: A Case Study on Contrastive Focus Generation." 25th Conference on Computational Natural Language Learning (CoNLL 2021). Punta Cana, Dominican Republic 10 - 11 Nov 2021 Stroudsburg, Pennsylvania. https://doi.org/10.18653/v1/2021.conll-1.42
Paper/Presentation Title

Controlling Prosody in End-to-End TTS: A Case Study on Contrastive Focus Generation

Presentation TypePaper
AuthorsLatif, Siddique (Author), Kim, Inyoung (Author), Calapodescu, Ioan (Author) and Besacier, Laurent (Author)
Journal or Proceedings TitleProceedings of the 25th Conference on Computational Natural Language Learning (CoNLL 2021)
ERA Conference ID42652
Number of Pages8
Year2021
Place of PublicationStroudsburg, Pennsylvania
ISBN9781955917056
Digital Object Identifier (DOI)https://doi.org/10.18653/v1/2021.conll-1.42
Web Address (URL) of Paperhttps://aclanthology.org/2021.conll-1.42/
Conference/Event25th Conference on Computational Natural Language Learning (CoNLL 2021)
Conference on Natural Language Learning
Event Details
Conference on Natural Language Learning
CoNLL
Rank
A
A
A
A
A
A
A
A
A
A
Event Details
25th Conference on Computational Natural Language Learning (CoNLL 2021)
Event Date
10 to end of 11 Nov 2021
Event Location
Punta Cana, Dominican Republic
Abstract

While End-2-End Text-to-Speech (TTS) has made significant progresses over the past few years, these systems still lack intuitive user controls over prosody. For instance, generating speech with fine-grained prosody control (prosodic prominence, contextually appropriate emotions) is still an open challenge. In this paper, we investigate whether we can control prosody directly from the input text, in order to code information related to contrastive focus which emphasizes a specific word that is contrary to the presuppositions of the interlocutor. We build and share a specific dataset for this purpose and show that it allows to train a TTS system were this fine-grained prosodic feature can be correctly conveyed using control tokens. Our evaluation compares synthetic and natural utterances and shows that prosodic patterns of contrastive focus (variations of Fo, Intensity and Duration) can be learnt accurately. Such a milestone is important to allow, for example, smart speakers to be programmatically controlled in terms of output prosody.

KeywordsEnd-to-End TTS, fine-grained prosody control, contrastive focus, interrogative/assertive sentences
ANZSRC Field of Research 2020460211. Speech production
460208. Natural language processing
461104. Neural networks
461103. Deep learning
Byline AffiliationsSchool of Sciences
NAVER LABS, United Kingdom
Institution of OriginUniversity of Southern Queensland
Permalink -

https://research.usq.edu.au/item/q6y81/controlling-prosody-in-end-to-end-tts-a-case-study-on-contrastive-focus-generation

Download files


Published Version
2021.conll-1.42.pdf
License: CC BY 4.0
File access level: Anyone


Other Documentation
2021.conll-Proceedings Front matter.pdf
License: CC BY 4.0
File access level: Anyone

  • 138
    total views
  • 429
    total downloads
  • 2
    views this month
  • 19
    downloads this month

Export as

Related outputs

Medicine's New Rhythm: Harnessing Acoustic Sensing via the Internet of Audio Things for Healthcare
Pervez, Farrukh, Shoukat, Moazzam, Suresh, Varsha, Farooq, Muhammad Umar Bin, Sandhu, Moid, Qayyum, Adnan, Usama, Muhammad, Girardi, Adnan, Latif, Siddique and Qadir, Junaid. 2024. "Medicine's New Rhythm: Harnessing Acoustic Sensing via the Internet of Audio Things for Healthcare." IEEE Open Journal of the Computer Society. 5, pp. 491-510. https://doi.org/10.1109/OJCS.2024.3462812
SSMD-UNet: semi-supervised multi-task decoders network for diabetic retinopathy segmentation
Ullah, Zahid, Akram, Muhammad, Latif, Siddique, Khan, Asifullah and Gwak, Jeonghwan. 2023. "SSMD-UNet: semi-supervised multi-task decoders network for diabetic retinopathy segmentation." Scientific Reports. 13 (1). https://doi.org/10.1038/s41598-023-36311-0
Densely attention mechanism based network for COVID-19 detection in chest X-rays
Ullah, Zahid, Usman, Muhammad, Latif, Siddique and Gwak, Jeonghwan. 2023. "Densely attention mechanism based network for COVID-19 detection in chest X-rays." Scientific Reports. 13 (1). https://doi.org/10.1038/s41598-022-27266-9
Selective Deeply Supervised Multi-Scale Attention Network for Brain Tumor Segmentation
Rehman, Azka, Usman, Muhammad, Shahid, Abdullah, Latif, Siddique and Qadir, Junaid. 2023. "Selective Deeply Supervised Multi-Scale Attention Network for Brain Tumor Segmentation." Sensors. 23 (4). https://doi.org/10.3390/s23042346
Multitask Learning From Augmented Auxiliary Data for Improving Speech Emotion Recognition
Latif, Siddique, Rana, Rajib, Khalifa, Sara, Jurdak, Raja and Schuller, Bjorn W.. 2023. "Multitask Learning From Augmented Auxiliary Data for Improving Speech Emotion Recognition ." IEEE Transactions on Affective Computing. 14 (4), pp. 3164-3176. https://doi.org/10.1109/TAFFC.2022.3221749
Self Supervised Adversarial Domain Adaptation for Cross-Corpus and Cross-Language Speech Emotion Recognition
Latif, Siddique, Rana, Rajib, Khalifa, Sara, Jurdak, Raja and Schuller, Bjorn. 2023. "Self Supervised Adversarial Domain Adaptation for Cross-Corpus and Cross-Language Speech Emotion Recognition." IEEE Transactions on Affective Computing. 14 (3), pp. 1912-1926. https://doi.org/10.1109/TAFFC.2022.3167013
A survey on deep reinforcement learning for audio‑based applications
Latif, Siddique, Cuayahuitl, Heriberto, Pervez, Farrukh, Shamshad, Fahad, Ali, Hafiz Shehbaz and Cambria, Erik. 2023. "A survey on deep reinforcement learning for audio‑based applications." Artificial Intelligence Review: an international survey and tutorial journal. 56 (3), p. 2193–2240. https://doi.org/10.1007/s10462-022-10224-2
Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition
Latif, Siddique, Rana, Rajib, Khalifa, Sara, Jurdak, Raja, Epps, Julien and Schuller, Bjorn W.. 2022. "Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition." IEEE Transactions on Affective Computing. 13 (2), pp. 992-1004. https://doi.org/10.1109/TAFFC.2020.2983669
Privacy Enhanced Speech Emotion Communication using Deep Learning Aided Edge Computing
Ali, Hafiz Shehbaz, Hassan, Fakhar ul, Latif, Siddique, Manzoor, Habib Ullah and Qadir, Junaid. 2021. "Privacy Enhanced Speech Emotion Communication using Deep Learning Aided Edge Computing." IEEE International Conference on Communications Workshops (2021). Montreal, Canada 14 - 23 Jun 2021 United States. https://doi.org/10.1109/ICCWorkshops50388.2021.9473669
Deep Representation Learning for Speech Emotion Recognition
Latif, Siddique. 2022. Deep Representation Learning for Speech Emotion Recognition. PhD by Publication Doctor of Philosophy (DPHD). University of Southern Queensland. https://doi.org/10.26192/w8w00
Survey of Deep Representation Learning for Speech Emotion Recognition
Latif, Siddique, Rana, Rajib, Khalifa, Sara, Jurdak, Raja, Qadir, Junaid and Schuller, Bjorn. 2023. "Survey of Deep Representation Learning for Speech Emotion Recognition." IEEE Transactions on Affective Computing. 14 (2), pp. 1634-1654. https://doi.org/10.1109/TAFFC.2021.3114365
Deep Architecture Enhancing Robustness to Noise, Adversarial Attacks, and Cross-corpus Setting for Speech Emotion Recognition
Latif, Siddique, Rana, Rajib, Khalifa, Sara, Jurdak, Raja and Schuller, Bjorn W.. 2020. "Deep Architecture Enhancing Robustness to Noise, Adversarial Attacks, and Cross-corpus Setting for Speech Emotion Recognition." 21st Annual Conference of the International Speech Communication Association: Cognitive Intelligence for Speech Processing (INTERSPEECH 2020). Shanghai, China 25 - 29 Oct 2020 France. https://doi.org/10.21437/Interspeech.2020-3190
Augmenting Generative Adversarial Networks for Speech Emotion Recognition
Latif, Siddique, Asim, Muhammad, Rana, Rajib, Khalifa, Sara, Jurdak, Raja and Schuller, Bjorn W.. 2020. "Augmenting Generative Adversarial Networks for Speech Emotion Recognition." 21st Annual Conference of the International Speech Communication Association: Cognitive Intelligence for Speech Processing (INTERSPEECH 2020). Shanghai, China 25 - 29 Oct 2020 France. https://doi.org/10.21437/Interspeech.2020-3194
Federated Learning for Speech Emotion Recognition Applications
Latif, Siddique, Khalifa, Sara, Rana, Rajib and Jurdak, Raja. 2020. "Federated Learning for Speech Emotion Recognition Applications." 19th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN 2020). Sydney, Australia 21 - 24 Apr 2020 United States. https://doi.org/10.1109/IPSN48710.2020.00-16
Direct modelling of speech emotion from raw speech
Latif, Siddique, Rana, Rajib, Khalifa, Sara, Jurdak, Raja and Epps, Julien. 2019. "Direct modelling of speech emotion from raw speech." 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language (INTERSPEECH 2019). Graz, Austria 15 - 19 Sep 2019 France. https://doi.org/10.21437/Interspeech.2019-3252
Variational Autoencoders to Learn Latent Representations of Speech Emotion
Latif, Siddique, Rana, Rajib, Qadir, Junaid and Epps, Julien. 2018. "Variational Autoencoders to Learn Latent Representations of Speech Emotion." 19th Annual Conference of the International Speech Communication Association: Speech Research for Emerging Markets in Multilingual Societies (INTERSPEECH 2018). Hyderabad, India 02 - 06 Sep 2018 France. https://doi.org/10.21437/Interspeech.2018-1568
Transfer learning for improving speech emotion classification accuracy
Latif, Siddique, Rana, Rajib, Younis, Shahzad, Qadir, Junaid and Epps, Julien. 2018. "Transfer learning for improving speech emotion classification accuracy." 19th Annual Conference of the International Speech Communication Association: Speech Research for Emerging Markets in Multilingual Societies (INTERSPEECH 2018). Hyderabad, India 02 - 06 Sep 2018 France. https://doi.org/10.21437/Interspeech.2018-1625
Automated screening for distress: A perspective for the future
Rana, Rajib, Latif, Siddique, Gururajan, Raj, Gray, Anthony, Mackenzie, Geraldine, Humphris, Gerald and Dunn, Jeff. 2019. "Automated screening for distress: A perspective for the future." European Journal of Cancer Care. 28 (4). https://doi.org/10.1111/ecc.13033
Phonocardiographic sensing using deep learning for abnormal heartbeat detection
Latif, Siddique, Usman, Muhammad, Rana, Rajib and Qadir, Junaid. 2018. "Phonocardiographic sensing using deep learning for abnormal heartbeat detection." IEEE Sensors Journal. 18 (22), pp. 9393-9400. https://doi.org/10.1109/JSEN.2018.2870759
Mobile health in the Developing World: review of literature and lessons from a case study
Latif, Siddique, Rana, Rajib, Qadir, Junaid, Ali, Anwaar, Imran, Muhammad Ali and Younis, Muhammad Shahzad. 2017. "Mobile health in the Developing World: review of literature and lessons from a case study." IEEE Access. 5, pp. 11540-11556. https://doi.org/10.1109/ACCESS.2017.2710800