Direct modelling of speech emotion from raw speech
Paper
Paper/Presentation Title | Direct modelling of speech emotion from raw speech |
---|---|
Presentation Type | Paper |
Authors | Latif, Siddique (Author), Rana, Rajib (Author), Khalifa, Sara (Author), Jurdak, Raja (Author) and Epps, Julien (Author) |
Journal or Proceedings Title | Proceedings of the 20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019) |
Number of Pages | 5 |
Year | 2019 |
Place of Publication | France |
ISBN | 9781510896833 |
Digital Object Identifier (DOI) | https://doi.org/10.21437/Interspeech.2019-3252 |
Web Address (URL) of Paper | https://www.isca-speech.org/archive/interspeech_2019/latif19_interspeech.html |
Conference/Event | 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language (INTERSPEECH 2019) |
Event Details | 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language (INTERSPEECH 2019) Event Date 15 to end of 19 Sep 2019 Event Location Graz, Austria |
Abstract | Speech emotion recognition is a challenging task and heavily depends on hand-engineered acoustic features, which are typically crafted to echo human perception of speech signals. However, a filter bank that is designed from perceptual evidence is not always guaranteed to be the best in a statistical modelling framework where the end goal is for example emotion classification. This has fuelled the emerging trend of learning representations from raw speech especially using deep learning neural networks. In particular, a combination of Convolution Neural Networks (CNNs) and Long Short Term Memory (LSTM) have gained great traction for the intrinsic property of LSTM in learning contextual information crucial for emotion recognition; and CNNs been used for its ability to overcome the scalability problem of regular neural networks. In this paper, we show that there are still opportunities to improve the performance of emotion recognition from the raw speech by exploiting the properties of CNN in modelling contextual information. We propose the use of parallel convolutional layers to harness multiple temporal resolutions in the feature extraction block that is jointly trained with the LSTM based classification network for the emotion recognition task. Our results suggest that the proposed model can reach the performance of CNN trained with hand-engineered features from both IEMOCAP and MSP-IMPROV datasets. |
Keywords | speech emotion, raw speech, convolutional neural networks, long short term memory |
Contains Sensitive Content | Does not contain sensitive content |
ANZSRC Field of Research 2020 | 460212. Speech recognition |
Public Notes | Files associated with this item cannot be displayed due to copyright restrictions. |
Byline Affiliations | Institute for Resilient Regions |
University of New South Wales | |
Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia | |
University of New South Wales | |
Institution of Origin | University of Southern Queensland |
https://research.usq.edu.au/item/q566w/direct-modelling-of-speech-emotion-from-raw-speech
176
total views12
total downloads6
views this month0
downloads this month