Acoustic Feature Fusion and Denoising Representation for Robust Audio Based Recognition System
- 주제(키워드) Denoising Autoencoder , Correlation Distance Measure (CDM) , Automatic Speech Recognition (ASR)
- 발행기관 고려대학교 대학원
- 지도교수 고한석
- 발행년도 2020
- 학위수여년월 2020. 2
- 학위구분 석사
- 학과 대학원 전기전자공학과
- 세부전공 신호 및 멀티미디어전공
- 원문페이지 67 p
- UCI I804:11009-000000127274
- DOI 10.23186/korea.000000127274.11009.0000946
- 본문언어 영어
- 제출원본 000046025814
초록/요약
In this thesis, two novel approaches to handling corrupted acoustic signals are presented. These approaches are applied to bird species sound classification and Automatic Speech Recognition tasks. First, various feature extraction algorithms ("feature fusion") are combined using deep learning. Using these algorithms, robust sound classification models are created. A novel denoising autoencoder is then developed for robust Automatic Speech Recognition model to enhance the corrupted extracted features. For investigating environmental (bird species) sound classification techniques, a "feature fusion" is developed to improve sound classification accuracy in noisy environments. The fusion models are created using a CNN (Convolutional Neural Network) structure. Several methods are used for extracting features, such as a Robust log Mel-filter bank using a Wiener filter and PNCCs (Power Normalized Cepstral Coefficients). These features are combined to form a 3-dimensional feature that are then used as an input to the CNN structure. A database from https://ebird.org is used to train and classify 43 types of bird species in their natural environment. Performance of the proposed "feature fusion" is evaluated by injecting 3 types of noise with four different SNRs (Signal to Noise Ratios) (20 dB, 10 dB, 5 dB, 0 dB). The fusion feature is compared to the log Mel-filter bank, both with and without the Wiener filter and the PNCCs. It is found that a 1.34 % performance is increased in average clean environment accuracy. Noisy environment accuracy at the 4 SNR levels is increased by 1.06 % and 0.65 % for shop and schoolyard noise backgrounds, respectively. The performance of Automatic Speech Recognition (ASR) is also susceptible to noise. This is especially true when it in the testing data but is not present in the training data. Thus, the second method focuses on feature enhancement for robust end-to-end ASR systems. A novel variant of a denoising autoencoder (DAE) is proposed. The proposed method uses skip connections on both encoder and decoder sides. Speech information is passed from the target frame from the input to the model. It also uses a new objective function in the training model. A correlation distance measure is used in penalty terms by measuring the dependency of the latent target features and the model (latent features and enhanced features obtained from the DAE). The proposed method’s performance is compared against a conventional model and a state-of-the-art model. It is compared in both seen and unseen noisy environments by using 7 different types of background noise with different SNR levels (0, 5, 10 and 20 dB). The proposed method is further tested using both linear and non-linear penalty terms. In both cases, it is observed that an improvement is achieved in the overall average word error rate (WER).
more목차
Contents
Abstract I
Contents IV
List of Figures V
List of Tables VI
List of Abbreviations VII
Chapter 1. Introduction 1
1.1. Background 1
1.2. Research goals and Contributions 2
1.3. Organization of Thesis 3
Chapter 1. 4
Chapter 2. Bird Sounds Classification by Combining PNCC and Robust Mel-log filter Bank Features 4
2.1. Introduction 5
2.2. Proposed Method 7
2.3. Experimental Work 13
Chapter 3. Correlation Distance Skip Connection Denoising Autoencoder (CDSK-DAE) for Speech Feature Enhancement 20
3.1. Introduction 20
3.2. Related Work 23
3.3. Proposed Model 25
3.4. Experimental Work 31
Chapter 4. Conclusions and Future Works 40
4.1. Conclusion 40
4.2. Future Works 41
Bibliography 42

