comparison fft_Example/ReadMe.txt @ 2:ccea4e6a1945

add OpenCL example
author Yuhi TOMARI <yuhi@cr.ie.u-ryukyu.ac.jp>
date Tue, 22 Jan 2013 23:19:41 +0900
parents
children ea2e7ce9d5bb
comparison
equal deleted inserted replaced
1:b511640282d2 2:ccea4e6a1945
1 ### OpenCL FFT (Fast Fourier Transform) ###
2
3 ===========================================================================
4 DESCRIPTION:
5
6 This example shows how OpenCL can be used to compute FFT. Algorithm implemented
7 is described in the following references
8
9 1) Fitting FFT onto the G80 Architecture
10 by Vasily Volkov and Brian Kazian
11 University of California, Berkeley, May 19, 2008
12 http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project6_report.pdf
13
14 2) High Performance Discrete Fourier Tansforms on Graphics Processors
15 by Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manferdelli
16 Supercomputing 2008.
17 http://portal.acm.org/citation.cfm?id=1413373
18
19 Current version only supports power of two transform sizes however it should be straight forward
20 to extend the sample to non-power of two but power of base primes like 3, 5, 7.
21
22 Current version supports 1D, 2D, 3D batched transforms.
23
24 Current version supports both in-place and out-of-place transforms.
25
26 Current version supports both forward and inverse transform.
27
28 Current version supports both plannar and interleaved data format.
29
30 Current version only supports complex-to-complex transform. For real
31 transform, one can use plannar data format with imaginary array mem set to zero.
32
33 Current version only supports transform on GPU device. Accelerate framework can be used on CPU.
34
35 Current version supports sizes that fits in device global memory although "Twist Kernel" is
36 included in fft plan if user wants to virtualize (implement sizes larger than what can fit
37 in GPU global memory).
38
39 Users can dump all the kernels and global, local dimensions with which these kernels are run
40 so that they can not only inspect/modify these kernels and understand how FFT is being
41 computed on GPU, but also create their own stand along app for executing FFT of size of
42 their interest.
43
44 For any given signal size n, sample crates a clFFT_Plan, that encapsulates the kernel string,
45 associated compiled cl_program. Note that kernel string is generated at runtime based on
46 input size, dimension (1D, 2D, 3D) and data format (plannar or interleaved) along with some
47 device depended parameters encapsulated in the clFFT_Plan. These device dependent parameters
48 are set such that kernel is generated for high performance meeting following requirements
49
50 1) Access pattern to global memory (of-chip DRAM) is such that memory transaction
51 coalesceing is achieved if device supports it thus achieving full bandwidth
52 2) Local shuffles (matrix transposes or data sharing among work items of a workgroup)
53 are band conflict free if local memory is banked.
54 3) Kernel is fully optimized for memory hierarcy meaning that it uses GPU's large
55 vector register file, which is fastest, first before reverting to local memory
56 for data sharing among work items to save global DRAM bandwidth and only then
57 reverts to global memory if signal size is such that transform cannnot be computed
58 by singal workgroup and thus require global communation among work groups.
59
60 Users can modify these parameters to get best performance on their particular GPU.
61
62 Users how really want to understand the details of implementation are highly encouraged
63 to read above two references but here is a high level description.
64 At a higher the algorithm decomposes signal of length N into factors as
65
66 N = N1 x N2 x N3 x N4 x .... Nn
67
68 where the factors (N1, ....., Nn) are sorted such that N1 is largest. It thus decomposes
69 N into n-dimensional matrix. It than applies fft along each dimension, multiply by twiddle
70 factors and transposes the matrix as follow
71
72 N2 x N3 x N4 x ............ x Nn x N1 (fft along N1 and transpose)
73 N3 x N4 x N5 x .... x Nn x N2 x N1 (fft along N2 and transpose)
74 N4 x N5 x N6 x .. x Nn x N3 x N2 x N1 (fft along N3 and transpose)
75
76 ......
77 Nn x Nn-1 x Nn-2 x ........ N3 x N2 x N1 (fft along Nn and transpose)
78
79 Decomposition is such that algorithm is fully optimized for memory hierarchy. N1 (largest
80 base radix) is constrained by maximum register usage by work item (largest size of in-register
81 fft) and product N2 x N3 .... x Nn determine the maximum size of work group which is constrained
82 by local memory used by work group (local memory is used to share data among work items i.e.
83 local transposes). Togather these two parameters determine the maximum size fft that can be
84 computed by just using register file and local memory without reverting to global memory
85 for transpose (i.e. these sizes do not require global transpose and thus no inter work group
86 communication). However, for larger sizes, global communication among workgroup is required
87 and multiple kernel launches are needed depending on the size and the base radix used.
88
89 For details of parameters user can play with, please see the comments in fft_internal.h
90 and kernel_string.cpp, which has the main kernel generator functions ... especially
91 see the comments preceeding function getRadixArray and getGlobalRadixInfo.
92 User can adjust these parameters you achieve best performance on his device.
93
94 Description of API Calls
95 =========================
96 clFFT_Plan clFFT_CreatePlan( cl_context context, clFFT_Dim3 n, clFFT_Dimension dim, clFFT_DataFormat dataFormat, cl_int *error_code );
97
98 This function creates a plan and returns a handle to it for use with other functions below.
99 context context in which things are happening
100 n n.x, n.y, n.z contain the dimension of signal (length along each dimension)
101 dim much be one of clFFT_1D, clFFT_2D, clFFT_3D for one, two or three dimensional fft
102 dataFormat much be either clFFT_InterleavedComplexFormat or clFFT_SplitComplexFormat for either interleaved or plannar data (real and imaginary)
103 error_code pointer for getting error back in plan creation. In case of error NULL plan is returned
104 ==========================
105 void clFFT_DestroyPlan( clFFT_Plan plan );
106
107 Function to release/free resources
108 ==========================
109 cl_int clFFT_ExecuteInterleaved( cl_command_queue queue, clFFT_Plan plan, cl_int batchSize, clFFT_Direction dir,
110 cl_mem data_in, cl_mem data_out,
111 cl_int num_events, cl_event *event_list, cl_event *event );
112
113 Function for interleaved fft execution.
114 queue command queue for the device on which fft needs to be executed. It should be present in the context for this plan was created
115 plan fft plan that was created using clFFT_CreatePlan
116 batchSize size of the batch for batched transform
117 dir much be either clFFT_Forward or clFFT_Inverse for forward or inverse transform
118 data_in input data
119 data_out output data. For in-place transform, pass same mem object for both data_in and data_out
120 num_events, event_list and event are for future use for letting fft fit in other CL based application pipeline through event dependency.
121 Not implemented in this version yet so these parameters are redundant right now. Just pass NULL.
122
123 =========================
124 cl_int clFFT_ExecutePlannar( cl_command_queue queue, clFFT_Plan plan, cl_int batchSize, clFFT_Direction dir,
125 cl_mem data_in_real, cl_mem data_in_imag, cl_mem data_out_real, cl_mem data_out_imag,
126 cl_int num_events, cl_event *event_list, cl_event *event );
127
128 Same as above but for plannar data type.
129 =========================
130 cl_int clFFT_1DTwistInterleaved( clFFT_Plan plan, cl_mem mem, size_t numRows, size_t numCols, size_t startRow, clFFT_Direction dir );
131
132 Function for applying twist (twiddle factor multiplication) for virtualizing computation of very large ffts that cannot fit into global
133 memory at once but can be decomposed into many global memory fitting ffts followed by twiddle multiplication (twist) followed by transpose
134 followed by again many global memory fitting ffts.
135
136 =========================
137 cl_int clFFT_1DTwistPlanner( clFFT_Plan plan, cl_mem mem_real, cl_mem mem_imag, size_t numRows, size_t numCols, size_t startRow, clFFT_Direction dir );
138
139 Same fucntion as above but for plannar data
140 =========================
141
142 void clFFT_DumpPlan( clFFT_Plan plan, FILE *file);
143
144 Function to dump the plan. Passing stdout to file prints out the plan to standard out. It prints out
145 the kernel string and local, global dimension with which each kernel is executed in this plan.
146
147 ==================================================================================
148 IMPORTANT NOTE ON PERFORMANCE:
149
150 Currently there are a few known performance issues (bug) that this sample has discovered
151 in rumtime and code generation that are being actively fixed. Hence, for sizes >= 1024,
152 performance is much below the expected peak for any particular size. However, we have
153 internally verified that once these bugs are fixed, performance should be on par with
154 expected peak. Note that these are bugs in OpenCL runtime/compiler and not in this
155 sample.
156
157 ===========================================================================
158 BUILD REQUIREMENTS:
159
160 Mac OS X v10.6 or later
161
162 If you are running in Xcode, be sure to pass file name "param.txt". You can do that
163 by double clicking OpenCL_FFT under executable and then click on Argument tab and
164 add ./../../param.txt under "Arguments to be passed on launch" section.
165
166 ===========================================================================
167 RUNTIME REQUIREMENTS:
168
169 . Mac OS X v10.6 or later with OpenCL 1.0
170 . For good performance, device should support local memory.
171 FFT performance critically depend on how efficiently local shuffles
172 (matrix transposes) using local memory to reduce external DRAM bandwidth
173 requirement.
174
175 ===========================================================================
176 PACKAGING LIST:
177
178 AccelerateError.pdf
179 clFFT.h
180 Error.pdf
181 fft_base_kernels.h
182 fft_execute.cpp
183 fft_internal.h
184 fft_kernelstring.cpp
185 fft_setup.cpp
186 main.cpp
187 Makefile
188 OpenCL_FFT.xcodeproj
189 OpenCLError.pdf
190 param.txt
191 procs.h
192 ReadMe.txt
193
194 ===========================================================================
195 CHANGES FROM PREVIOUS VERSIONS:
196
197 Version 1.0
198 - First version.
199
200 ===========================================================================
201 Copyright (C) 2008 Apple Inc. All rights reserved.