Members/yuuhi/OpenCL: fft_Example/ReadMe.txt comparison

comparison fft_Example/ReadMe.txt @ 2:ccea4e6a1945

add OpenCL example

author	Yuhi TOMARI <yuhi@cr.ie.u-ryukyu.ac.jp>
date	Tue, 22 Jan 2013 23:19:41 +0900
parents
children	ea2e7ce9d5bb

comparison

equal deleted inserted replaced

-:b511640282d2
+:ccea4e6a1945
+### OpenCL FFT (Fast Fourier Transform) ###
+===========================================================================
+DESCRIPTION:
+This example shows how OpenCL can be used to compute FFT. Algorithm implemented
+is described in the following references
+1) Fitting FFT onto the G80 Architecture
+by Vasily Volkov and Brian Kazian
+University of California, Berkeley, May 19, 2008
+http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project6_report.pdf
+2) High Performance Discrete Fourier Tansforms on Graphics Processors
+by Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manferdelli
+Supercomputing 2008.
+http://portal.acm.org/citation.cfm?id=1413373
+Current version only supports power of two transform sizes however it should be straight forward
+to extend the sample to non-power of two but power of base primes like 3, 5, 7.
+Current version supports 1D, 2D, 3D batched transforms.
+Current version supports both in-place and out-of-place transforms.
+Current version supports both forward and inverse transform.
+Current version supports both plannar and interleaved data format.
+Current version only supports complex-to-complex transform. For real
+transform, one can use plannar data format with imaginary array mem set to zero.
+Current version only supports transform on GPU device. Accelerate framework can be used on CPU.
+Current version supports sizes that fits in device global memory although "Twist Kernel" is
+included in fft plan if user wants to virtualize (implement sizes larger than what can fit
+in GPU global memory).
+Users can dump all the kernels and global, local dimensions with which these kernels are run
+so that they can not only inspect/modify these kernels and understand how FFT is being
+computed on GPU, but also create their own stand along app for executing FFT of size of
+their interest.
+For any given signal size n, sample crates a clFFT_Plan, that encapsulates the kernel string,
+associated compiled cl_program. Note that kernel string is generated at runtime based on
+input size, dimension (1D, 2D, 3D) and data format (plannar or interleaved) along with some
+device depended parameters encapsulated in the clFFT_Plan. These device dependent parameters
+are set such that kernel is generated for high performance meeting following requirements
+1) Access pattern to global memory (of-chip DRAM) is such that memory transaction
+coalesceing is achieved if device supports it thus achieving full bandwidth
+2) Local shuffles (matrix transposes or data sharing among work items of a workgroup)
+are band conflict free if local memory is banked.
+3) Kernel is fully optimized for memory hierarcy meaning that it uses GPU's large
+vector register file, which is fastest, first before reverting to local memory
+for data sharing among work items to save global DRAM bandwidth and only then
+reverts to global memory if signal size is such that transform cannnot be computed
+by singal workgroup and thus require global communation among work groups.
+Users can modify these parameters to get best performance on their particular GPU.
+Users how really want to understand the details of implementation are highly encouraged
+to read above two references but here is a high level description.
+At a higher the algorithm decomposes signal of length N into factors as
+N = N1 x N2 x N3 x N4 x .... Nn
+where the factors (N1, ....., Nn) are sorted such that N1 is largest. It thus decomposes
+N into n-dimensional matrix. It than applies fft along each dimension, multiply by twiddle
+factors and transposes the matrix as follow
+N2 x N3 x N4 x ............ x Nn x N1   (fft along N1 and transpose)
+N3 x N4 x N5 x ....    x Nn x N2 x N1   (fft along N2 and transpose)
+N4 x N5 x N6 x .. x Nn x N3 x N2 x N1   (fft along N3 and transpose)
+......
+Nn x Nn-1 x Nn-2 x ........ N3 x N2 x N1 (fft along Nn and transpose)
+Decomposition is such that algorithm is fully optimized for memory hierarchy. N1 (largest
+base radix) is constrained by maximum register usage by work item (largest size of in-register
+fft) and product N2 x N3 .... x Nn determine the maximum size of work group which is constrained
+by local memory used by work group (local memory is used to share data among work items i.e.
+local transposes). Togather these two parameters determine the maximum size fft that can be
+computed by just using register file and local memory without reverting to global memory
+for transpose (i.e. these sizes do not require global transpose and thus no inter work group
+communication). However, for larger sizes, global communication among workgroup is required
+and multiple kernel launches are needed depending on the size and the base radix used.
+For details of parameters user can play with, please see the comments in fft_internal.h
+and kernel_string.cpp, which has the main kernel generator functions ... especially
+see the comments preceeding function getRadixArray and getGlobalRadixInfo.
+User can adjust these parameters you achieve best performance on his device.
+Description of API Calls
+=========================
+clFFT_Plan clFFT_CreatePlan( cl_context context, clFFT_Dim3 n, clFFT_Dimension dim, clFFT_DataFormat dataFormat, cl_int *error_code );
+This function creates a plan and returns a handle to it for use with other functions below.
+context    context in which things are happening
+n          n.x, n.y, n.z contain the dimension of signal (length along each dimension)
+dim        much be one of clFFT_1D, clFFT_2D, clFFT_3D for one, two or three dimensional fft
+dataFormat much be either clFFT_InterleavedComplexFormat or clFFT_SplitComplexFormat for either interleaved or plannar data (real and imaginary)
+error_code pointer for getting error back in plan creation. In case of error NULL plan is returned
+==========================
+void clFFT_DestroyPlan( clFFT_Plan plan );
+Function to release/free resources
+==========================
+cl_int clFFT_ExecuteInterleaved( cl_command_queue queue, clFFT_Plan plan, cl_int batchSize, clFFT_Direction dir,
+								 cl_mem data_in, cl_mem data_out,
+								 cl_int num_events, cl_event *event_list, cl_event *event );
+Function for interleaved fft execution.
+queue      command queue for the device on which fft needs to be executed. It should be present in the context for this plan was created
+plan       fft plan that was created using clFFT_CreatePlan
+batchSize  size of the batch for batched transform
+dir        much be either clFFT_Forward or clFFT_Inverse for forward or inverse transform
+data_in    input data
+data_out   output data. For in-place transform, pass same mem object for both data_in and data_out
+num_events, event_list and event are for future use for letting fft fit in other CL based application pipeline through event dependency.
+Not implemented in this version yet so these parameters are redundant right now. Just pass NULL.
+=========================
+cl_int clFFT_ExecutePlannar( cl_command_queue queue, clFFT_Plan plan, cl_int batchSize, clFFT_Direction dir,
+							 cl_mem data_in_real, cl_mem data_in_imag, cl_mem data_out_real, cl_mem data_out_imag,
+							 cl_int num_events, cl_event *event_list, cl_event *event );
+Same as above but for plannar data type.
+=========================
+cl_int clFFT_1DTwistInterleaved( clFFT_Plan plan, cl_mem mem, size_t numRows, size_t numCols, size_t startRow, clFFT_Direction dir );
+Function for applying twist (twiddle factor multiplication) for virtualizing computation of very large ffts that cannot fit into global
+memory at once but can be decomposed into many global memory fitting ffts followed by twiddle multiplication (twist) followed by transpose
+followed by again many global memory fitting ffts.
+=========================
+cl_int clFFT_1DTwistPlanner( clFFT_Plan plan, cl_mem mem_real, cl_mem mem_imag, size_t numRows, size_t numCols, size_t startRow, clFFT_Direction dir );
+Same fucntion as above but for plannar data
+=========================
+void clFFT_DumpPlan( clFFT_Plan plan, FILE *file);
+Function to dump the plan. Passing stdout to file prints out the plan to standard out. It prints out
+the kernel string and local, global dimension with which each kernel is executed in this plan.
+==================================================================================
+IMPORTANT NOTE ON PERFORMANCE:
+Currently there are a few known performance issues (bug) that this sample has discovered
+in rumtime and code generation that are being actively fixed. Hence, for sizes >= 1024,
+performance is much below the expected peak for any particular size. However, we have
+internally verified that once these bugs are fixed, performance should be on par with
+expected peak. Note that these are bugs in OpenCL runtime/compiler and not in this
+sample.
+===========================================================================
+BUILD REQUIREMENTS:
+Mac OS X v10.6 or later
+If you are running in Xcode, be sure to pass file name "param.txt". You can do that
+by double clicking OpenCL_FFT under executable and then click on Argument tab and
+add ./../../param.txt under "Arguments to be passed on launch" section.
+===========================================================================
+RUNTIME REQUIREMENTS:
+. Mac OS X v10.6 or later with OpenCL 1.0
+. For good performance, device should support local memory.
+FFT performance critically depend on how efficiently local shuffles
+(matrix transposes) using local memory to reduce external DRAM bandwidth
+requirement.
+===========================================================================
+PACKAGING LIST:
+AccelerateError.pdf
+clFFT.h
+Error.pdf
+fft_base_kernels.h
+fft_execute.cpp
+fft_internal.h
+fft_kernelstring.cpp
+fft_setup.cpp
+main.cpp
+Makefile
+OpenCL_FFT.xcodeproj
+OpenCLError.pdf
+param.txt
+procs.h
+ReadMe.txt
+===========================================================================
+CHANGES FROM PREVIOUS VERSIONS:
+Version 1.0
+- First version.
+===========================================================================
+Copyright (C) 2008 Apple Inc. All rights reserved.

Mercurial > hg > Members > yuuhi > OpenCL

comparison fft_Example/ReadMe.txt @ 2:ccea4e6a1945