Design of Objective Quality Measures for Time-Scale Modification of Audio

PhD Thesis


Roberts, Timothy. 2021. Design of Objective Quality Measures for Time-Scale Modification of Audio. PhD Thesis Doctor of Philosophy. Griffith University. https://doi.org/10.25904/1912/4070
Title

Design of Objective Quality Measures for Time-Scale Modification of Audio

TypePhD Thesis
AuthorsRoberts, Timothy
SupervisorProf. Kuldip Paliwal
Dr Andrew Busch
Institution of OriginGriffith University
Qualification NameDoctor of Philosophy
Number of Pages260
Year2021
PublisherGriffith University
Place of PublicationAustralia
Digital Object Identifier (DOI)https://doi.org/10.25904/1912/4070
Web Address (URL)http://hdl.handle.net/10072/401637
Abstract

This dissertation describes the design of effective objective measures of quality for Time-Scale Modification (TSM). TSM methods are single channel algorithms that give poor results when applied to multi-channel signals, as the phase relationship between channels must be maintained. This dissertation proposes a method and additional variant for maintaining the phase relationship between channels and retaining the presence in the centre of the stereo signal. The method involves pre- and post-processing the signal, with the variant processing each frame for real-time suitability. Sum and difference transformations of the stereo signal are used for TSM and result in a large improvement in stereo phase coherence, consequently maintaining the stereo field. The proposed method produces a high quality stereo output and greatly improves quality over the independent channel processing method. A modification to the Epoch-Synchronous Overlap-Add (ESOLA) TSM algorithm is proposed in this dissertation. The proposed method, Fuzzy Epoch-Synchronous Overlap-Add, improves on the previous ESOLA method through cross-correlation of time-smeared epochs before overlap-adding. This reduces distortion and artefacts while the speaker's fundamental frequency is stable, as well as reducing artefacts during pitch modulation. The proposed method is tested against well-known TSM algorithms. It is preferred over ESOLA and gives similar performance to other TSM algorithms for voice signals. It is also shown that this algorithm can work effectively with solo instrument signals containing strong fundamental frequencies. No effective objective measure of quality for TSM exists. This dissertation details the creation, subjective evaluation and analysis of a dataset, for use in the development of an objective measure of quality for TSM. Comprising two parts, the training subset contains 88 source files processed using six TSM methods at 10 time-scales, while the testing subset contains 20 source files processed using three additional methods at four time-scales. The source material contains speech, solo harmonic and percussive instruments, sound effects and a range of music genres. 42,529 ratings were collected from 633 sessions using laboratory and remote collection methods. Analysis of results shows no correlation between age and quality of rating; equivalence between expert and non-expert listeners; negligible differences between participants with and without hearing issues; and negligible differences between testing modalities. Comparison of published objective measures and subjective scores shows the objective measures to be poor indicators of subjective quality. Initial results for a retrained objective measure of quality are presented with results approaching average loss and correlation values of subjective sessions. An objective measure of quality for time-scaled audio is proposed that makes use of the previously developed dataset and improves on reported results. The measure uses hand-crafted features and a fully connected network to predict subjective mean opinion scores. Basic and Advanced Perceptual Evaluation of Audio Quality features are used in addition to nine features specific to TSM artefacts. Six methods of alignment are explored, with interpolation of the reference magnitude spectrum to the length of the test magnitude spectrum giving the best performance. The proposed measure achieves an average Root Mean Squared Error (RMSE) of 0.490 and a mean Pearson Correlation Coeffcient (PCC) of 0.864, equivalent to 97th and 82nd percentiles of subjective sessions respectively. The proposed measure is used to evaluate TSM algorithms, finding that Elastique gives the highest objective quality for solo instrument and voice signals, while the Identity Phase-Locking Phase Vocoder gives the highest objective quality for music signals and the best overall quality. Two single-ended objective quality measures for time-scaled audio are also proposed. These measure do not require a reference signal, nor alignment. Data driven features are created by either a convolutional neural network (CNN) or a bidirectional gated recurrent unit (BGRU) network, and are fed to a fully-connected network to predict subjective mean opinion scores. The proposed CNN and BGRU measures achieve an average RMSE of 0.608 and 0.576, and a mean PCC of 0.771 and 0.794, respectively. The proposed measures are used to evaluate TSM algorithms, and comparisons are provided for 16 TSM implementations. A literature review is included with required background knowledge. It includes the fundamentals of sound perception, sound capture, digital signal processing, time-scale modification methods used within research, and subjective and objective measures of quality. Full implementation of all proposed methods and measures can be found at github.com/zygurt/TSM, while the labelled dataset is available at http://ieee-dataport.org/1987.

Keywordstime scale modification ; objective ; subjective ; quality ; opinion score
Contains Sensitive ContentDoes not contain sensitive content
ANZSRC Field of Research 2020400607. Signal processing
461104. Neural networks
Public Notes

Files associated with this item cannot be displayed due to copyright restrictions.

Byline AffiliationsGriffith University
Permalink -

https://research.usq.edu.au/item/zz0z0/design-of-objective-quality-measures-for-time-scale-modification-of-audio

  • 3
    total views
  • 1
    total downloads
  • 3
    views this month
  • 1
    downloads this month

Export as

Related outputs

Deep learning-based single-ended quality prediction for time-scale modified audio
Roberts, Timothy, Nicolson, Aaron and Paliwal, Kuldip K.. 2021. "Deep learning-based single-ended quality prediction for time-scale modified audio." Journal of the Audio Engineering Society. 69 (9), pp. 644-655. https://doi.org/10.17743/jaes.2021.0031