Mercurial > hg > Members > yuuhi > OpenCL
comparison fft_Example/ReadMe.txt @ 2:ccea4e6a1945
add OpenCL example
author | Yuhi TOMARI <yuhi@cr.ie.u-ryukyu.ac.jp> |
---|---|
date | Tue, 22 Jan 2013 23:19:41 +0900 |
parents | |
children | ea2e7ce9d5bb |
comparison
equal
deleted
inserted
replaced
1:b511640282d2 | 2:ccea4e6a1945 |
---|---|
1 ### OpenCL FFT (Fast Fourier Transform) ### | |
2 | |
3 =========================================================================== | |
4 DESCRIPTION: | |
5 | |
6 This example shows how OpenCL can be used to compute FFT. Algorithm implemented | |
7 is described in the following references | |
8 | |
9 1) Fitting FFT onto the G80 Architecture | |
10 by Vasily Volkov and Brian Kazian | |
11 University of California, Berkeley, May 19, 2008 | |
12 http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project6_report.pdf | |
13 | |
14 2) High Performance Discrete Fourier Tansforms on Graphics Processors | |
15 by Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manferdelli | |
16 Supercomputing 2008. | |
17 http://portal.acm.org/citation.cfm?id=1413373 | |
18 | |
19 Current version only supports power of two transform sizes however it should be straight forward | |
20 to extend the sample to non-power of two but power of base primes like 3, 5, 7. | |
21 | |
22 Current version supports 1D, 2D, 3D batched transforms. | |
23 | |
24 Current version supports both in-place and out-of-place transforms. | |
25 | |
26 Current version supports both forward and inverse transform. | |
27 | |
28 Current version supports both plannar and interleaved data format. | |
29 | |
30 Current version only supports complex-to-complex transform. For real | |
31 transform, one can use plannar data format with imaginary array mem set to zero. | |
32 | |
33 Current version only supports transform on GPU device. Accelerate framework can be used on CPU. | |
34 | |
35 Current version supports sizes that fits in device global memory although "Twist Kernel" is | |
36 included in fft plan if user wants to virtualize (implement sizes larger than what can fit | |
37 in GPU global memory). | |
38 | |
39 Users can dump all the kernels and global, local dimensions with which these kernels are run | |
40 so that they can not only inspect/modify these kernels and understand how FFT is being | |
41 computed on GPU, but also create their own stand along app for executing FFT of size of | |
42 their interest. | |
43 | |
44 For any given signal size n, sample crates a clFFT_Plan, that encapsulates the kernel string, | |
45 associated compiled cl_program. Note that kernel string is generated at runtime based on | |
46 input size, dimension (1D, 2D, 3D) and data format (plannar or interleaved) along with some | |
47 device depended parameters encapsulated in the clFFT_Plan. These device dependent parameters | |
48 are set such that kernel is generated for high performance meeting following requirements | |
49 | |
50 1) Access pattern to global memory (of-chip DRAM) is such that memory transaction | |
51 coalesceing is achieved if device supports it thus achieving full bandwidth | |
52 2) Local shuffles (matrix transposes or data sharing among work items of a workgroup) | |
53 are band conflict free if local memory is banked. | |
54 3) Kernel is fully optimized for memory hierarcy meaning that it uses GPU's large | |
55 vector register file, which is fastest, first before reverting to local memory | |
56 for data sharing among work items to save global DRAM bandwidth and only then | |
57 reverts to global memory if signal size is such that transform cannnot be computed | |
58 by singal workgroup and thus require global communation among work groups. | |
59 | |
60 Users can modify these parameters to get best performance on their particular GPU. | |
61 | |
62 Users how really want to understand the details of implementation are highly encouraged | |
63 to read above two references but here is a high level description. | |
64 At a higher the algorithm decomposes signal of length N into factors as | |
65 | |
66 N = N1 x N2 x N3 x N4 x .... Nn | |
67 | |
68 where the factors (N1, ....., Nn) are sorted such that N1 is largest. It thus decomposes | |
69 N into n-dimensional matrix. It than applies fft along each dimension, multiply by twiddle | |
70 factors and transposes the matrix as follow | |
71 | |
72 N2 x N3 x N4 x ............ x Nn x N1 (fft along N1 and transpose) | |
73 N3 x N4 x N5 x .... x Nn x N2 x N1 (fft along N2 and transpose) | |
74 N4 x N5 x N6 x .. x Nn x N3 x N2 x N1 (fft along N3 and transpose) | |
75 | |
76 ...... | |
77 Nn x Nn-1 x Nn-2 x ........ N3 x N2 x N1 (fft along Nn and transpose) | |
78 | |
79 Decomposition is such that algorithm is fully optimized for memory hierarchy. N1 (largest | |
80 base radix) is constrained by maximum register usage by work item (largest size of in-register | |
81 fft) and product N2 x N3 .... x Nn determine the maximum size of work group which is constrained | |
82 by local memory used by work group (local memory is used to share data among work items i.e. | |
83 local transposes). Togather these two parameters determine the maximum size fft that can be | |
84 computed by just using register file and local memory without reverting to global memory | |
85 for transpose (i.e. these sizes do not require global transpose and thus no inter work group | |
86 communication). However, for larger sizes, global communication among workgroup is required | |
87 and multiple kernel launches are needed depending on the size and the base radix used. | |
88 | |
89 For details of parameters user can play with, please see the comments in fft_internal.h | |
90 and kernel_string.cpp, which has the main kernel generator functions ... especially | |
91 see the comments preceeding function getRadixArray and getGlobalRadixInfo. | |
92 User can adjust these parameters you achieve best performance on his device. | |
93 | |
94 Description of API Calls | |
95 ========================= | |
96 clFFT_Plan clFFT_CreatePlan( cl_context context, clFFT_Dim3 n, clFFT_Dimension dim, clFFT_DataFormat dataFormat, cl_int *error_code ); | |
97 | |
98 This function creates a plan and returns a handle to it for use with other functions below. | |
99 context context in which things are happening | |
100 n n.x, n.y, n.z contain the dimension of signal (length along each dimension) | |
101 dim much be one of clFFT_1D, clFFT_2D, clFFT_3D for one, two or three dimensional fft | |
102 dataFormat much be either clFFT_InterleavedComplexFormat or clFFT_SplitComplexFormat for either interleaved or plannar data (real and imaginary) | |
103 error_code pointer for getting error back in plan creation. In case of error NULL plan is returned | |
104 ========================== | |
105 void clFFT_DestroyPlan( clFFT_Plan plan ); | |
106 | |
107 Function to release/free resources | |
108 ========================== | |
109 cl_int clFFT_ExecuteInterleaved( cl_command_queue queue, clFFT_Plan plan, cl_int batchSize, clFFT_Direction dir, | |
110 cl_mem data_in, cl_mem data_out, | |
111 cl_int num_events, cl_event *event_list, cl_event *event ); | |
112 | |
113 Function for interleaved fft execution. | |
114 queue command queue for the device on which fft needs to be executed. It should be present in the context for this plan was created | |
115 plan fft plan that was created using clFFT_CreatePlan | |
116 batchSize size of the batch for batched transform | |
117 dir much be either clFFT_Forward or clFFT_Inverse for forward or inverse transform | |
118 data_in input data | |
119 data_out output data. For in-place transform, pass same mem object for both data_in and data_out | |
120 num_events, event_list and event are for future use for letting fft fit in other CL based application pipeline through event dependency. | |
121 Not implemented in this version yet so these parameters are redundant right now. Just pass NULL. | |
122 | |
123 ========================= | |
124 cl_int clFFT_ExecutePlannar( cl_command_queue queue, clFFT_Plan plan, cl_int batchSize, clFFT_Direction dir, | |
125 cl_mem data_in_real, cl_mem data_in_imag, cl_mem data_out_real, cl_mem data_out_imag, | |
126 cl_int num_events, cl_event *event_list, cl_event *event ); | |
127 | |
128 Same as above but for plannar data type. | |
129 ========================= | |
130 cl_int clFFT_1DTwistInterleaved( clFFT_Plan plan, cl_mem mem, size_t numRows, size_t numCols, size_t startRow, clFFT_Direction dir ); | |
131 | |
132 Function for applying twist (twiddle factor multiplication) for virtualizing computation of very large ffts that cannot fit into global | |
133 memory at once but can be decomposed into many global memory fitting ffts followed by twiddle multiplication (twist) followed by transpose | |
134 followed by again many global memory fitting ffts. | |
135 | |
136 ========================= | |
137 cl_int clFFT_1DTwistPlanner( clFFT_Plan plan, cl_mem mem_real, cl_mem mem_imag, size_t numRows, size_t numCols, size_t startRow, clFFT_Direction dir ); | |
138 | |
139 Same fucntion as above but for plannar data | |
140 ========================= | |
141 | |
142 void clFFT_DumpPlan( clFFT_Plan plan, FILE *file); | |
143 | |
144 Function to dump the plan. Passing stdout to file prints out the plan to standard out. It prints out | |
145 the kernel string and local, global dimension with which each kernel is executed in this plan. | |
146 | |
147 ================================================================================== | |
148 IMPORTANT NOTE ON PERFORMANCE: | |
149 | |
150 Currently there are a few known performance issues (bug) that this sample has discovered | |
151 in rumtime and code generation that are being actively fixed. Hence, for sizes >= 1024, | |
152 performance is much below the expected peak for any particular size. However, we have | |
153 internally verified that once these bugs are fixed, performance should be on par with | |
154 expected peak. Note that these are bugs in OpenCL runtime/compiler and not in this | |
155 sample. | |
156 | |
157 =========================================================================== | |
158 BUILD REQUIREMENTS: | |
159 | |
160 Mac OS X v10.6 or later | |
161 | |
162 If you are running in Xcode, be sure to pass file name "param.txt". You can do that | |
163 by double clicking OpenCL_FFT under executable and then click on Argument tab and | |
164 add ./../../param.txt under "Arguments to be passed on launch" section. | |
165 | |
166 =========================================================================== | |
167 RUNTIME REQUIREMENTS: | |
168 | |
169 . Mac OS X v10.6 or later with OpenCL 1.0 | |
170 . For good performance, device should support local memory. | |
171 FFT performance critically depend on how efficiently local shuffles | |
172 (matrix transposes) using local memory to reduce external DRAM bandwidth | |
173 requirement. | |
174 | |
175 =========================================================================== | |
176 PACKAGING LIST: | |
177 | |
178 AccelerateError.pdf | |
179 clFFT.h | |
180 Error.pdf | |
181 fft_base_kernels.h | |
182 fft_execute.cpp | |
183 fft_internal.h | |
184 fft_kernelstring.cpp | |
185 fft_setup.cpp | |
186 main.cpp | |
187 Makefile | |
188 OpenCL_FFT.xcodeproj | |
189 OpenCLError.pdf | |
190 param.txt | |
191 procs.h | |
192 ReadMe.txt | |
193 | |
194 =========================================================================== | |
195 CHANGES FROM PREVIOUS VERSIONS: | |
196 | |
197 Version 1.0 | |
198 - First version. | |
199 | |
200 =========================================================================== | |
201 Copyright (C) 2008 Apple Inc. All rights reserved. |