Performance Analysis and Optimization of three-dimensional FDTD on GPU Cluster
- 주제(키워드) FDTD , GPU cluster , CUDA , OpenCL
- 발행기관 고려대학교 대학원
- 지도교수 박규환
- 발행년도 2012
- 학위수여년월 2012. 2
- 학위구분 박사
- 학과 일반대학원 물리학과
- 원문페이지 115 p
- 실제URI http://www.dcollection.net/handler/korea/000000033177
- 본문언어 영어
- 제출원본 000045696565
초록/요약
In this thesis, we analyse the performance and propose optimization methods for FDTD (Finite-Difference Time-Domain) on GPU (Graphics Processing Unit) cluster. The FDTD method is an especially popular numerical method for electromagnetic field simulation. Although the FDTD method has many strengths such as accuracy, robustness and easy to understand, the three-dimensional FDTD simulation for practical applications requires a large allocation of fast memory and a significant run time. In recent years, GPU has been used for scientific computation due to the high computational power and memory bandwidth. There have been reports on successful hardware-accelerations for the large-scale FDTD simulation using GPU. We present the performance bottleneck and theoretical maximum performance of the 3D FDTD method on GPU in units of FLOPS (FLoating-point OPerations per Second) by adopting the Roofline model. We confirm that the primary bottleneck of performance is the memory bandwidth of GPU. Based on the performance bottleneck, we propose suitable optimization methods for the 3D FDTD on GPU and calculate theoretically the performance improvement by applying the optimization methods. To test the proposed optimization methods, we have performed theoretical analyses and compared them with experiments on NVIDIA GPUs. The theoretically attainable performance shows a good agreement with the experimental results. To compare the speedups, we obtained experimental results additionally, for two Intel CPUs and two NVIDIA GPUs. The optimized 3D FDTD program on GTX 480 GPU was 64 times (7 times) faster than naively implemented (optimized) program on i7 2600 CPU. To further increase the computational power and overcome the limited memory size of a single GPU, clustering of GPUs is necessary. The main issues affecting the performance of FDTD using the GPU cluster are the load balancing and the communication overhead. We have had interest on the methods to reduce the communication overhead. The communication overhead can be hided by overlapping computation and communication. we propose two methods, which we call the `kernel-split method' and the `host-buffer method' that overlap computation and communication for 3D FDTD on the GPU cluster. The host-buffer method in particular enables the overlapping of computation and communication without any modifications to the update-kernels executed in the GPU that are already in use. We also present theoretical formulas to predict the overlap threshold and the total throughput for each methods. To test the methods, we have performed theoretical analyses and compared them with experiments on a GPU Cluster with NVIDIA Tesla C2075 GPUs. By using the overlap methods with 6 GPU nodes, we verify that the theoretical prediction agrees well with the experimental results. We also demonstrate that the total performance of 3D FDTD reaches 92 % of the upper limit that would apply if there were no communication overhead. Meanwhile, we analyze the modes in a metallic rectangular waveguide. In a metal of finite conductivity, the non-vanishing skin depth requires boundary conditions on the surfaces of metallic waveguides that are different from those of a perfect conductor. This implies that the TE and TM modes are no longer allowed modes inside a metal waveguide. Instead, they should be replaced by more general ones, HE and EH modes. By extending Marcatili?s method to the metal case, we derive explicitly an approximate expression for the HE and EH modes and compute them numerically. In the limiting case of a perfect conductor, we show that these modes, indeed, reduce to the TE and TM modes. We also clarify the correspondence between the HE and EH modes and the TE and TM modes. We, additionally, introduce the attempts to improve the FDTD method for the waveguide simulations.
more목차
Abstract 2
1. Introduction 14
2. Performance Analysis and Optimization on Single GPU 18
2.1. Overview 18
2.2. Roofline model for GPU with 3D FDTD 19
2.3. Optimization 24
2.4. Experimental results 29
2.5. Conclusion 36
3. Overlap Computation and Communication on Cluster 38
3.1. Overview 38
3.2. Overlap methods 40
3.3. Theoretical analysis 43
3.4. Experimental results 47
3.5. Conclusion 51
4. Mode Analysis for Metallic Rectangular Waveguide 53
4.1. Overview 53
4.2. Extend the Marcatili?s method 54
4.3. Verify in the perfect conductor limit 58
54.4. Attempts to improve the FDTD method 61
4.4.1. Attempt to improve the PML 62
4.4.2. Attempt to define the boundary 67
4.5. Conclusion 72
Summary 73
Bibliography 76
Appendix 82
A. Source Code 82
A.1. CUDA version 82
A.2. OpenCL version 84
A.3. C version (with SSE and OpenMP) 86
B. Building the Diskless GPU Cluster 90
B.1. Introduction 90
B.2. Setup Server 91
B.2.1. PXELINUX bootloader image 91
B.2.2. TFTP 92
B.2.3. DHCP 93
B.2.4. NFS 95
B.2.5. RSH, SSH 96
B.2.6. NTP 97
B.2.7. Configure /etc/hosts file 97
B.3. Node Kernel 97
B.4. Root File System for Node 98
B.4.1. Create Root File System using debootstrap and chroot 99
B.4.2. Configure the Root File System 100
B.4.3. Other useful packages 108
6B.5. Troubleshooting 108
B.5.1. Node boot error 108
B.5.2. General Error 110

