항목 | Soms Field | 내용 | 언어 |
---|---|---|---|
제목 | dc.title | Real-time Sound-Effects Synthesis of Raw-Waveform Audio with Generative Adversarial Networks | |
저자 | dc.creator | Minwook Chang | |
저자(제2언어) | somsterms.otherName | 장민욱 | |
소속 | somsterms.affiliation | 대학원 컴퓨터학과 소프트웨어전공 | |
주제(키워드) | dc.subject | Sound Synthesis, Generative Adversarial Network, Virtual Reality | |
발행기관 | dc.publisher | 고려대학교 대학원 | |
지도교수 | somsterms.advisor | 김정현 | |
발행년도 | dcterms.issued | 2020 | |
학위수여년월 | somsterms.awarded | 2020. 2 | |
자료유형 | somsterms.subType | 학위논문 | |
학위구분 | somsterms.thesisDegree | 석사 | |
학과 | somsterms.major | 대학원 컴퓨터학과(정보대학) | |
세부전공 | somsterms.specialty | 소프트웨어전공 | |
원문형식 | dc.format | application/pdf | |
원문크기 | dcterms.extent | 1051537 bytes | |
원문매체 | dcterms.medium | application/pdf | |
원문페이지 | somsterms.page | 54 p | |
원문URL | dc.identifier | http://dcollection.korea.ac.kr/common/orgView/000000127372 | |
UCI | somsterms.UCI | I804:11009-000000127372 | |
DOI | somsterms.DOI | 10.23186/korea.000000127372.11009.0000942 | |
본문언어 | dc.language | 영어 | |
제출원본 | somsterms.isBasedOn | 000046026271 | |
초록/요약 | dcterms.abstract |
Conventional methods of real-time sound effects in 3D graphical and virtual environments relied upon preparing all the needed samples ahead of time and simply replaying them as needed, or parametrically modifying a basic set of samples using physically based techniques such as the spring-damper simulation and modal analysis/synthesis. In this work, we propose (1) to apply the generative adversarial networks (GAN) approach to the problem at hand and (2) a novel generative model called PUGAN, which progressively synthesizes high-quality audio in a raw waveform.
We demonstrate our claim by training a GAN with sounds of different drums and synthesizing the sounds on the fly for a virtual drum playing environment. The perceptual test revealed that the subjects could not discern the synthesized sounds from the ground truth nor perceived any noticeable delay upon the corresponding physical event. PUGAN leverages on the recently proposed idea of progressive generation of higher-resolution images by stacking multiple encode-decoder architectures. To effectively apply it to raw audio generation, we propose two novel modules: (1) a neural upsampling layer and (2) a sinc convolutional layer. Compared to the existing state-of-the-art model called WaveGAN, which uses a single decoder architecture, our model generates audio signals and converts them in a higher resolution in a progressive manner, while using a significantly smaller number of parameters, e.g., 20x smaller for 44.1 kHz output, than an existing technique called WaveGAN. Our experiments show that the audio signals can be generated in real-time with the comparable quality to that of WaveGAN with respect to the inception scores and the human evaluation. |
영어 |
목차 | dcterms.tableOfContents |
CHAPTER 1. INTRODUCTION 1
CHAPTER 2. RELATED WORK 5 2.1 PHYSICALLY BASED SOUND SYNTHESIS 5 2.2 GAN BASED AUDIO GENERATION. 6 2.3 AUDIO-TO-AUDIO CONVERSION 8 CHAPTER 3. DATA CHARACTERISTICS: AUDIO VERSUS IMAGE 9 CHAPTER 4. PUGAN: PROGRESSIVE UPSAMPLING GAN 12 4.2 GENERATOR 15 4.2.1 Lightweight WaveGAN module 15 4.2.2 Bandwidth extension module (BWE) 16 4.3 DISCRIMINATOR 18 CHAPTER 5. EXPERIMENT 19 5.1 VIRTUAL ENVIRONMENT EXPERIMENT 19 5.1.1 Dataset 19 5.1.2 Experimental design 20 5.2 PUGAN EXPERIMENT 24 5.2.1 Dataset 24 5.2.2 Training 25 5.2.3 Inception score (IS) 26 5.2.4 Human evaluation 27 CHAPTER 6. RESULT AND DISCUSSION 28 6.1 VIRTUAL ENVIRONMENT RESULTS 28 6.1.1 Naturalness and realism 28 6.1.2 Perceived delay 30 6.2 PUGAN RESULTS 32 6.2.1 Inception score and human evaluation 32 6.2.2 Computation cost 35 CHAPTER 7. CONCLUSION AND FUTURE WORK 37 REFERENCES 39 ACKNOWLEDGEMENT |