Design of Objective Quality Measures for Time-Scale Modification of Audio
PhD Thesis
Title | Design of Objective Quality Measures for Time-Scale Modification of Audio |
---|---|
Type | PhD Thesis |
Authors | Roberts, Timothy |
Supervisor | Prof. Kuldip Paliwal |
Dr Andrew Busch | |
Institution of Origin | Griffith University |
Qualification Name | Doctor of Philosophy |
Number of Pages | 260 |
Year | 2021 |
Publisher | Griffith University |
Place of Publication | Australia |
Digital Object Identifier (DOI) | https://doi.org/10.25904/1912/4070 |
Web Address (URL) | http://hdl.handle.net/10072/401637 |
Abstract | This dissertation describes the design of effective objective measures of quality for Time-Scale Modification (TSM). TSM methods are single channel algorithms that give poor results when applied to multi-channel signals, as the phase relationship between channels must be maintained. This dissertation proposes a method and additional variant for maintaining the phase relationship between channels and retaining the presence in the centre of the stereo signal. The method involves pre- and post-processing the signal, with the variant processing each frame for real-time suitability. Sum and difference transformations of the stereo signal are used for TSM and result in a large improvement in stereo phase coherence, consequently maintaining the stereo field. The proposed method produces a high quality stereo output and greatly improves quality over the independent channel processing method. A modification to the Epoch-Synchronous Overlap-Add (ESOLA) TSM algorithm is proposed in this dissertation. The proposed method, Fuzzy Epoch-Synchronous Overlap-Add, improves on the previous ESOLA method through cross-correlation of time-smeared epochs before overlap-adding. This reduces distortion and artefacts while the speaker's fundamental frequency is stable, as well as reducing artefacts during pitch modulation. The proposed method is tested against well-known TSM algorithms. It is preferred over ESOLA and gives similar performance to other TSM algorithms for voice signals. It is also shown that this algorithm can work effectively with solo instrument signals containing strong fundamental frequencies. No effective objective measure of quality for TSM exists. This dissertation details the creation, subjective evaluation and analysis of a dataset, for use in the development of an objective measure of quality for TSM. Comprising two parts, the training subset contains 88 source files processed using six TSM methods at 10 time-scales, while the testing subset contains 20 source files processed using three additional methods at four time-scales. The source material contains speech, solo harmonic and percussive instruments, sound effects and a range of music genres. 42,529 ratings were collected from 633 sessions using laboratory and remote collection methods. Analysis of results shows no correlation between age and quality of rating; equivalence between expert and non-expert listeners; negligible differences between participants with and without hearing issues; and negligible differences between testing modalities. Comparison of published objective measures and subjective scores shows the objective measures to be poor indicators of subjective quality. Initial results for a retrained objective measure of quality are presented with results approaching average loss and correlation values of subjective sessions. An objective measure of quality for time-scaled audio is proposed that makes use of the previously developed dataset and improves on reported results. The measure uses hand-crafted features and a fully connected network to predict subjective mean opinion scores. Basic and Advanced Perceptual Evaluation of Audio Quality features are used in addition to nine features specific to TSM artefacts. Six methods of alignment are explored, with interpolation of the reference magnitude spectrum to the length of the test magnitude spectrum giving the best performance. The proposed measure achieves an average Root Mean Squared Error (RMSE) of 0.490 and a mean Pearson Correlation Coeffcient (PCC) of 0.864, equivalent to 97th and 82nd percentiles of subjective sessions respectively. The proposed measure is used to evaluate TSM algorithms, finding that Elastique gives the highest objective quality for solo instrument and voice signals, while the Identity Phase-Locking Phase Vocoder gives the highest objective quality for music signals and the best overall quality. Two single-ended objective quality measures for time-scaled audio are also proposed. These measure do not require a reference signal, nor alignment. Data driven features are created by either a convolutional neural network (CNN) or a bidirectional gated recurrent unit (BGRU) network, and are fed to a fully-connected network to predict subjective mean opinion scores. The proposed CNN and BGRU measures achieve an average RMSE of 0.608 and 0.576, and a mean PCC of 0.771 and 0.794, respectively. The proposed measures are used to evaluate TSM algorithms, and comparisons are provided for 16 TSM implementations. A literature review is included with required background knowledge. It includes the fundamentals of sound perception, sound capture, digital signal processing, time-scale modification methods used within research, and subjective and objective measures of quality. Full implementation of all proposed methods and measures can be found at github.com/zygurt/TSM, while the labelled dataset is available at http://ieee-dataport.org/1987. |
Keywords | time scale modification ; objective ; subjective ; quality ; opinion score |
Contains Sensitive Content | Does not contain sensitive content |
ANZSRC Field of Research 2020 | 400607. Signal processing |
461104. Neural networks | |
Public Notes | Files associated with this item cannot be displayed due to copyright restrictions. |
Byline Affiliations | Griffith University |
https://research.usq.edu.au/item/zz0z0/design-of-objective-quality-measures-for-time-scale-modification-of-audio
3
total views1
total downloads3
views this month1
downloads this month