Non-intrusive Quality Evaluation of Speech Processed in Noisy and Reverberant Environments

Benjamin Cauchi
In many speech applications such as hands-free telephony or voice-controlled home assistants, the distance between the user and the recording microphones can be relatively large. In such a far-field scenario, the recorded microphone signals are typically corrupted by noise and reverberation, which may severely degrade the performance of speech recognition systems and reduce intelligibility and quality of speech in communication applications. In order to limit these effects, speech enhancement algorithms are typically applied. The main objective of this thesis is to develop novel speech enhancement algorithms for noisy and reverberant environments and signal-based measures to evaluate these algorithms, focusing on solutions that are applicable in realistic scenarios.First, we propose a single-channel speech enhancement algorithm for joint noise and reverberation reduction. The proposed algorithm uses a spectral gain to enhance the input signal, where the gain is computed using a combination of a statistical room acoustics model, minimum statistics and temporal cepstrum smoothing. This single-channel spectral enhancement algorithm can be combined easily with existing beamforming techniques when multiple microphones are available. Evaluation results show that the proposed algorithm is able to improve speech recognition accuracy, when using clean as well as multi-condition training data. In addition, signal-based measures and the results of a listening test show that the proposed algorithm is beneficial in terms of both speech quality and reverberation suppression. In the REVERB Challenge, the proposed single-channel speech enhancement algorithm has obtained the best performance in terms of subjective speech quality among all submitted single-channel algorithms.Second, we propose two non-intrusive speech quality measures that combine perceptually motivated features and predicting functions based on machine learning. The first measure uses time-averaged modulation energies as input features to a model tree. The second measure uses time-varying modulation energies as input features to a recurrent neural network in order to take the time-dependency of the test signal into account. Both measures are trained and evaluated using a dataset of perceptually evaluated signals comprising a wide range of algorithms, settings and acoustic scenarios. The results show that the speech quality measure using a recurrent neural network as predicting function outperforms existing non-intrusive measures and yields a similar performance as intrusive measures when trained and evaluated for a single category of algorithms. When trained and evaluated for several categories of algorithms, it even outperforms the intrusive benchmark measures, making it suitable for the selection of algorithms or algorithm parameters.
August / 2021
Dr. Hut