검색 상세

Neural Attentive Approaches on Voice Activity Detection and Speech Separation

초록/요약

Speech signal is the fastest and the most natural method of communication for humans. This fact has motivated researchers to think of speech as a fast and efficient method of interaction between human and machine. However, we are still far from achieving a natural interaction between human and machine due to the nature of speech signal. Speech signal can easily be contaminated by unwanted background noise or even other speech signals of no interest. This dissertation especially aims to mitigate these difficulties in speech signal processing by means of voice activity detection and speech separation. Both of them are important preprocessors in speech-related audio signal processing. Prior to presenting the proposed approaches on voice activity detection and speech separation, general frameworks and research issues are described. First, spectro-temporal attention-based voice activity detection method is presented. Voice activity detection systems suffer from unexpected and non-stationary background noises at magnitudes sufficiently high to mask the speech signal. The solution proposed here is by making use of neural attentive mechanisms. Spectral attention module extracts the meaningful information through the gating convolution mechanism, called gated linear units. The gated linear units control the bandwidth of information flow of a single neuron by exploiting the output of a sigmoid function, which can extract speech-related features from the time-frequency representation of the input signal. In addition, temporal attention module aggregates information from a couple of different neurons or different positions with self-attention algorithm. Multiple loss functions are introduced to aid convergence of the proposed modules and prevent the vanishing gradient problem. Substantial experimental results validate the robustness of the proposed method and its generalization power in environments of unknown or unexpected noise. Second, neural attentive speech separation model is proposed. Most previous approaches are grounded on the assumption that the number of speakers is known in advance. This assumption compels the separation model to generate the predefined number of outputs at all times, which limits its flexibility and generalization. In order to relax this assumption, speaker clustering module based on slot attention is proposed. Speaker clustering maps from a set of input feature vectors to a set of output vectors which can be seen as speaker centroids. After the speaker centroids are obtained, a set of input feature vectors conditioned by the centroids is fed into the next few layers to predict each source for each centroid. Various experimental results show that the proposed method performs reasonably well even for an unknown number of multiple speakers at test time.

more

목차

Abstract i
Contents iv
List of Figures vi
List of Tables ix
List of Abbreviations x
Chapter 1. Introduction 1
1.1. Background 1
1.2. Research Goals and Contributions 5
1.3. Organization of Dissertation 7
Chapter 2. General Frameworks for Voice Activity Detection and Speech Separation 9
2.1. General Framework for Voice Activity Detection 9
2.2. Related Work of VAD 10
2.3. General Framework for Speech Separation 12
2.4. Related Work of Speech Separation 13
2.5. Attention Mechanism 21
Chapter 3. The Proposed Voice Activity Detection Method 31
3.1. Overview 31
3.2. Architecture of Proposed VAD method 33
3.3. Experiments 40
Chapter 4. The Proposed Speech Separation Method 49
4.1. Overview 49
4.2. Architecture of Proposed Speech Separation Method 51
4.3. Experiments 58
Chapter 5. Conclusions and Future Works 66
5.1. Conclusions 66
5.2. Future Works 67
Bibliography 69
Curriculum Vitae 76
Publications 77

more