META VIEW

항목 Soms Field 내용 언어
제목 dc.title Real-time Sound-Effects Synthesis of Raw-Waveform Audio with Generative Adversarial Networks
저자 dc.creator Minwook Chang
저자(제2언어) somsterms.otherName 장민욱
소속 somsterms.affiliation 대학원 컴퓨터학과 소프트웨어전공
주제(키워드) dc.subject Sound Synthesis, Generative Adversarial Network, Virtual Reality
발행기관 dc.publisher 고려대학교 대학원
지도교수 somsterms.advisor 김정현
발행년도 dcterms.issued 2020
학위수여년월 somsterms.awarded 2020. 2
자료유형 somsterms.subType 학위논문
학위구분 somsterms.thesisDegree 석사
학과 somsterms.major 대학원 컴퓨터학과(정보대학)
세부전공 somsterms.specialty 소프트웨어전공
원문형식 dc.format application/pdf
원문크기 dcterms.extent 1051537 bytes
원문매체 dcterms.medium application/pdf
원문페이지 somsterms.page 54 p
원문URL dc.identifier http://dcollection.korea.ac.kr/common/orgView/000000127372
UCI somsterms.UCI I804:11009-000000127372
DOI somsterms.DOI 10.23186/korea.000000127372.11009.0000942
본문언어 dc.language 영어
제출원본 somsterms.isBasedOn 000046026271
초록/요약 dcterms.abstract Conventional methods of real-time sound effects in 3D graphical and virtual environments relied upon preparing all the needed samples ahead of time and simply replaying them as needed, or parametrically modifying a basic set of samples using physically based techniques such as the spring-damper simulation and modal analysis/synthesis. In this work, we propose (1) to apply the generative adversarial networks (GAN) approach to the problem at hand and (2) a novel generative model called PUGAN, which progressively synthesizes high-quality audio in a raw waveform.
We demonstrate our claim by training a GAN with sounds of different drums and synthesizing the sounds on the fly for a virtual drum playing environment. The perceptual test revealed that the subjects could not discern the synthesized sounds from the ground truth nor perceived any noticeable delay upon the corresponding physical event.
PUGAN leverages on the recently proposed idea of progressive generation of higher-resolution images by stacking multiple encode-decoder architectures. To effectively apply it to raw audio generation, we propose two novel modules: (1) a neural upsampling layer and (2) a sinc convolutional layer. Compared to the existing state-of-the-art model called WaveGAN, which uses a single decoder architecture, our model generates audio signals and converts them in a higher resolution in a progressive manner, while using a significantly smaller number of parameters, e.g., 20x smaller for 44.1 kHz output, than an existing technique called WaveGAN. Our experiments show that the audio signals can be generated in real-time with the comparable quality to that of WaveGAN with respect to the inception scores and the human evaluation.
영어
목차 dcterms.tableOfContents CHAPTER 1. INTRODUCTION 1
CHAPTER 2. RELATED WORK 5
2.1 PHYSICALLY BASED SOUND SYNTHESIS 5
2.2 GAN BASED AUDIO GENERATION. 6
2.3 AUDIO-TO-AUDIO CONVERSION 8
CHAPTER 3. DATA CHARACTERISTICS: AUDIO VERSUS IMAGE 9
CHAPTER 4. PUGAN: PROGRESSIVE UPSAMPLING GAN 12
4.2 GENERATOR 15
4.2.1 Lightweight WaveGAN module 15
4.2.2 Bandwidth extension module (BWE) 16
4.3 DISCRIMINATOR 18
CHAPTER 5. EXPERIMENT 19
5.1 VIRTUAL ENVIRONMENT EXPERIMENT 19
5.1.1 Dataset 19
5.1.2 Experimental design 20
5.2 PUGAN EXPERIMENT 24
5.2.1 Dataset 24
5.2.2 Training 25
5.2.3 Inception score (IS) 26
5.2.4 Human evaluation 27
CHAPTER 6. RESULT AND DISCUSSION 28
6.1 VIRTUAL ENVIRONMENT RESULTS 28
6.1.1 Naturalness and realism 28
6.1.2 Perceived delay 30
6.2 PUGAN RESULTS 32
6.2.1 Inception score and human evaluation 32
6.2.2 Computation cost 35
CHAPTER 7. CONCLUSION AND FUTURE WORK 37
REFERENCES 39
ACKNOWLEDGEMENT