검색 상세

Deep Learning Based Spatial Information Modeling for Multi-Channel Speech Enhancement toward Noise Robust Automatic Speech Recognition

초록/요약

Speech signal processing is playing an increasing role in human interaction with smart devices. Therefore, the problem of improving the robustness of automatic speech recognition in noisy environments has attracted considerable research effort. Deep learning approaches to speech enhancement, particularly those that incorporate a denoising auto-encoder, have achieved great success when applied to single-channel audio signals. In the case of a single channel, the signal intensity in the time–frequency domain is the main information resource to represent an input signal, and a simple neural network (NN) topology has been sufficiently effective. Using deep learning is effective in improving the performance of an automatic speech recognition (ASR) system, but a performance limit is exhibited in the field of clean signal estimation. It is expected that multichannel signal-based approaches can overcome this limitation by using spatial information in addition to signal intensity. Attempts have been made to expand the application of deep learning into multichannel speech enhancement fields. However, the full potential of deep learning-based approaches for microphone array processing has not yet been attained. The use of NN for multiple channels faces many obstacles. The main reason for this is that phase information in the time–frequency domain plays a vital role in delivering the spatial information of multichannel signals, whereas NNs traditionally process real-valued physical data and rely on real-valued weights. The application of deep learning to spatial information for multichannel acoustic signal is still under study and there is no representative solution. This analysis is supported by the fact that a variety of approaches are still actively being proposed, such as developing features to be fed into an NN with a multichannel signal, or replacing part of an existing spatial filter-based algorithm with a NN. It is noteworthy that adopting a fully deep learning-based approach in multichannel speech enhancement is not as remarkable as that in speech recognition or image classification. In this dissertation, problems encountered in applying deep learning to multichannel speech enhancement are addressed and mitigating approaches to improve existing methods are proposed. For an in-depth analysis and problem posing, traditional signal processing-based approaches are also discussed. Remarkable approaches incorporating independent component analysis and non-negative matrix factorization are revisited while a relevant previous approach is introduced. This provides an analysis of how and why deep learning is required. Application of spatial diversity with deep learning framework is evaluated and analyzed from various perspectives. This includes a previous work on combining a traditional spatial filter and a deep learning based postfilter. By designing effective feature for following postfilter, the beamformer structure resulted improved performance in terms of speech enhancement. Existing deep learning-based algorithms are also analyzed and improvement methods are proposed. Modeling phase information of the input signal is emphasized to overcome limitations of existing algorithms. By introducing a front-end structure accepting real-valued representation of time–frequency domain signal, the proposed structure avoids forcing the NN to approximate a complex algebra to decode the phase difference between the input signals. As a result, deep learning with real-valued NN is effectively applied for exploiting the spatial information embedded in the phase difference between the input signals. To ensure the applicability as the front-end of an arbitrary ASR system, the proposed methods focus on clean speech estimation without requiring an acoustic model to be trained with noise information. This allows ASR performance improvement without predefining noise conditions when applied to two channel real recorded noisy signal. It proves that the proposed method can be applied to un-seen noise situations encountered in everyday life.

more

목차

Contents
Abstract i
Contents v
List of Figures vii
List of Tables viii
List of Abbreviations ix
Chapter 1. Introduction 1
1.1. Background 1
1.2. Organization of Dissertation 2
Chapter 2. Classical Multichannel Speech Enhancement Algorithms 5
2.1. Signal Model and Definitions 5
2.2. Overview of Classical Beamformers 7
Chapter 3. Blind Source Separation-based Schemes in Relative Transfer Function Estimation 13
3.1. Issues and related works 13
3.2. RTF Estimation Using Peaks in Time-Domain RTF 19
3.3. Experiments 25
3.4. Conclusions 27
Chapter 4. Neural Network-based Approaches 29
4.1. Issues and related works 29
4.2. Motivation 39
4.3. Proposed system with phase-encoded input for NN-based mask estimation 42
4.4. The ASR stage in the proposed system 47
4.5. Experiments 52
4.6. Conclusions 59
Chapter 5. Neural Network-based postfilter 60
5.1. Issues and related works 60
5.2. New Generalized Sidelobe Canceller with Denoising Auto-Encoder for Improved Speech Enhancement 60
5.3. Experiments 65
5.4. Conclusions 68
Chapter 6. Conclusions and Future Works 70
6.1. Conclusions 70
6.2. Future Works 72
Bibliography 73


more