changeset 7:ea2e7ce9d5bb

add sample.pgm
author Yuhi TOMARI <yuhi@cr.ie.u-ryukyu.ac.jp>
date Tue, 05 Feb 2013 15:19:02 +0900
parents db074091ed0b
children 1b8da19bb31c
files fft_Example/Makefile fft_Example/ReadMe.txt fft_Example/clFFT.h fft_Example/fft_base_kernels.h fft_Example/fft_internal.h fft_Example/fft_kernelstring.cc fft_Example/fft_setup.cc fft_Example/param.txt fft_Example/param_small.txt fft_fixstart/sample2.jpg fft_fixstart/sample2.pgm
diffstat 11 files changed, 1649 insertions(+), 1621 deletions(-) [+]
line wrap: on
line diff
--- a/fft_Example/Makefile	Tue Feb 05 15:12:19 2013 +0900
+++ b/fft_Example/Makefile	Tue Feb 05 15:19:02 2013 +0900
@@ -2,11 +2,10 @@
 ATF = -framework ATF
 USE_ATF = -DUSE_ATF
 endif
-
 SRCS = fft_execute.cc fft_setup.cc main.cc fft_kernelstring.cc
 HEADERS = procs.h fft_internal.h fft_base_kernels.h clFFT.h
 TARGET = test_clFFT
-COMPILERFLAGS = -c -g -Wall -Werror -O0
+COMPILERFLAGS = -c -g -Wall -Werror -O0 #-Dgen_kernel
 CFLAGS = $(COMPILERFLAGS) ${RC_CFLAGS} ${USE_ATF}
 CC = clang++
 LIBRARIES = -framework OpenCL -framework Accelerate -framework AppKit ${RC_CFLAGS} ${ATF}
@@ -19,7 +18,7 @@
 	$(CC) $(CFLAGS) $(SRCS)
 
 $(TARGET): $(OBJECTS)
-	$(CC) $(OBJECTS) -o $@ $(LIBRARIES)
+	$(CC) $(OBJECTS) -o $(TARGET) $(LIBRARIES)
 
 clean:
 	rm -f $(TARGET) $(OBJECTS)
--- a/fft_Example/ReadMe.txt	Tue Feb 05 15:12:19 2013 +0900
+++ b/fft_Example/ReadMe.txt	Tue Feb 05 15:19:02 2013 +0900
@@ -1,1 +1,1 @@
-### OpenCL FFT (Fast Fourier Transform) ###

===========================================================================
DESCRIPTION:

This example shows how OpenCL can be used to compute FFT. Algorithm implemented
is described in the following references

1) Fitting FFT onto the G80 Architecture
   by Vasily Volkov and Brian Kazian
   University of California, Berkeley, May 19, 2008
   http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project6_report.pdf
   
2) High Performance Discrete Fourier Tansforms on Graphics Processors
   by Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manferdelli
   Supercomputing 2008.
   http://portal.acm.org/citation.cfm?id=1413373
   
Current version only supports power of two transform sizes however it should be straight forward
to extend the sample to non-power of two but power of base primes like 3, 5, 7. 

Current version supports 1D, 2D, 3D batched transforms. 

Current version supports both in-place and out-of-place transforms.

Current version supports both forward and inverse transform.

Current version supports both plannar and interleaved data format.

Current version only supports complex-to-complex transform. For real 
transform, one can use plannar data format with imaginary array mem set to zero. 

Current version only supports transform on GPU device. Accelerate framework can be used on CPU.

Current version supports sizes that fits in device global memory although "Twist Kernel" is 
included in fft plan if user wants to virtualize (implement sizes larger than what can fit 
in GPU global memory).

Users can dump all the kernels and global, local dimensions with which these kernels are run 
so that they can not only inspect/modify these kernels and understand how FFT is being 
computed on GPU, but also create their own stand along app for executing FFT of size of
their interest.

For any given signal size n, sample crates a clFFT_Plan, that encapsulates the kernel string, 
associated compiled cl_program. Note that kernel string is generated at runtime based on 
input size, dimension (1D, 2D, 3D) and data format (plannar or interleaved) along with some 
device depended parameters encapsulated in the clFFT_Plan. These device dependent parameters 
are set such that kernel is generated for high performance meeting following requirements

   1) Access pattern to global memory (of-chip DRAM) is such that memory transaction 
      coalesceing is achieved if device supports it thus achieving full bandwidth
   2) Local shuffles (matrix transposes or data sharing among work items of a workgroup)
      are band conflict free if local memory is banked.
   3) Kernel is fully optimized for memory hierarcy meaning that it uses GPU's large 
      vector register file, which is fastest, first before reverting to local memory 
      for data sharing among work items to save global DRAM bandwidth and only then 
      reverts to global memory if signal size is such that transform cannnot be computed
      by singal workgroup and thus require global communation among work groups.
      
Users can modify these parameters to get best performance on their particular GPU.     

Users how really want to understand the details of implementation are highly encouraged 
to read above two references but here is a high level description.
At a higher the algorithm decomposes signal of length N into factors as 

                   N = N1 x N2 x N3 x N4 x .... Nn
                   
where the factors (N1, ....., Nn) are sorted such that N1 is largest. It thus decomposes 
N into n-dimensional matrix. It than applies fft along each dimension, multiply by twiddle
factors and transposes the matrix as follow 

                      N2 x N3 x N4 x ............ x Nn x N1   (fft along N1 and transpose)
                      N3 x N4 x N5 x ....    x Nn x N2 x N1   (fft along N2 and transpose)
                      N4 x N5 x N6 x .. x Nn x N3 x N2 x N1   (fft along N3 and transpose)
                      
                      ......
                     Nn x Nn-1 x Nn-2 x ........ N3 x N2 x N1 (fft along Nn and transpose)
                     
 Decomposition is such that algorithm is fully optimized for memory hierarchy. N1 (largest
 base radix) is constrained by maximum register usage by work item (largest size of in-register 
 fft) and product N2 x N3 .... x Nn determine the maximum size of work group which is constrained
 by local memory used by work group (local memory is used to share data among work items i.e.
 local transposes). Togather these two parameters determine the maximum size fft that can be 
 computed by just using register file and local memory without reverting to global memory 
 for transpose (i.e. these sizes do not require global transpose and thus no inter work group 
 communication). However, for larger sizes, global communication among workgroup is required
 and multiple kernel launches are needed depending on the size and the base radix used. 
 
 For details of parameters user can play with, please see the comments in fft_internal.h
 and kernel_string.cpp, which has the main kernel generator functions ... especially
 see the comments preceeding function getRadixArray and getGlobalRadixInfo.
 User can adjust these parameters you achieve best performance on his device. 

Description of API Calls
=========================
clFFT_Plan clFFT_CreatePlan( cl_context context, clFFT_Dim3 n, clFFT_Dimension dim, clFFT_DataFormat dataFormat, cl_int *error_code );

This function creates a plan and returns a handle to it for use with other functions below. 
context    context in which things are happening
n          n.x, n.y, n.z contain the dimension of signal (length along each dimension)
dim        much be one of clFFT_1D, clFFT_2D, clFFT_3D for one, two or three dimensional fft
dataFormat much be either clFFT_InterleavedComplexFormat or clFFT_SplitComplexFormat for either interleaved or plannar data (real and imaginary)
error_code pointer for getting error back in plan creation. In case of error NULL plan is returned
==========================
void clFFT_DestroyPlan( clFFT_Plan plan );

Function to release/free resources
==========================
cl_int clFFT_ExecuteInterleaved( cl_command_queue queue, clFFT_Plan plan, cl_int batchSize, clFFT_Direction dir, 
								 cl_mem data_in, cl_mem data_out,
								 cl_int num_events, cl_event *event_list, cl_event *event );
								 
Function for interleaved fft execution.
queue      command queue for the device on which fft needs to be executed. It should be present in the context for this plan was created
plan       fft plan that was created using clFFT_CreatePlan
batchSize  size of the batch for batched transform
dir        much be either clFFT_Forward or clFFT_Inverse for forward or inverse transform
data_in    input data
data_out   output data. For in-place transform, pass same mem object for both data_in and data_out
num_events, event_list and event are for future use for letting fft fit in other CL based application pipeline through event dependency.
Not implemented in this version yet so these parameters are redundant right now. Just pass NULL.

=========================
cl_int clFFT_ExecutePlannar( cl_command_queue queue, clFFT_Plan plan, cl_int batchSize, clFFT_Direction dir, 
							 cl_mem data_in_real, cl_mem data_in_imag, cl_mem data_out_real, cl_mem data_out_imag,
							 cl_int num_events, cl_event *event_list, cl_event *event );
							 
Same as above but for plannar data type.							 
=========================
cl_int clFFT_1DTwistInterleaved( clFFT_Plan plan, cl_mem mem, size_t numRows, size_t numCols, size_t startRow, clFFT_Direction dir );

Function for applying twist (twiddle factor multiplication) for virtualizing computation of very large ffts that cannot fit into global
memory at once but can be decomposed into many global memory fitting ffts followed by twiddle multiplication (twist) followed by transpose
followed by again many global memory fitting ffts.

=========================
cl_int clFFT_1DTwistPlanner( clFFT_Plan plan, cl_mem mem_real, cl_mem mem_imag, size_t numRows, size_t numCols, size_t startRow, clFFT_Direction dir );

Same fucntion as above but for plannar data
=========================	

void clFFT_DumpPlan( clFFT_Plan plan, FILE *file);	

Function to dump the plan. Passing stdout to file prints out the plan to standard out. It prints out
the kernel string and local, global dimension with which each kernel is executed in this plan.
						
==================================================================================
IMPORTANT NOTE ON PERFORMANCE:

Currently there are a few known performance issues (bug) that this sample has discovered
in rumtime and code generation that are being actively fixed. Hence, for sizes >= 1024, 
performance is much below the expected peak for any particular size. However, we have 
internally verified that once these bugs are fixed, performance should be on par with 
expected peak. Note that these are bugs in OpenCL runtime/compiler and not in this
sample.

===========================================================================
BUILD REQUIREMENTS:

Mac OS X v10.6 or later

If you are running in Xcode, be sure to pass file name "param.txt". You can do that
by double clicking OpenCL_FFT under executable and then click on Argument tab and 
add ./../../param.txt under "Arguments to be passed on launch" section. 

===========================================================================
RUNTIME REQUIREMENTS:

. Mac OS X v10.6 or later with OpenCL 1.0
. For good performance, device should support local memory. 
  FFT performance critically depend on how efficiently local shuffles 
  (matrix transposes) using local memory to reduce external DRAM bandwidth
  requirement.

===========================================================================
PACKAGING LIST:

AccelerateError.pdf
clFFT.h
Error.pdf
fft_base_kernels.h
fft_execute.cpp
fft_internal.h
fft_kernelstring.cpp
fft_setup.cpp
main.cpp
Makefile
OpenCL_FFT.xcodeproj
OpenCLError.pdf
param.txt
procs.h
ReadMe.txt

===========================================================================
CHANGES FROM PREVIOUS VERSIONS:

Version 1.0
- First version.

===========================================================================
Copyright (C) 2008 Apple Inc. All rights reserved.
\ No newline at end of file
+### OpenCL FFT (Fast Fourier Transform) ###

===========================================================================
DESCRIPTION:

This example shows how OpenCL can be used to compute FFT. Algorithm implemented
is described in the following references

1) Fitting FFT onto the G80 Architecture
   by Vasily Volkov and Brian Kazian
   University of California, Berkeley, May 19, 2008
   http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project6_report.pdf
   
2) High Performance Discrete Fourier Tansforms on Graphics Processors
   by Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manferdelli
   Supercomputing 2008.
   http://portal.acm.org/citation.cfm?id=1413373
   
Current version only supports power of two transform sizes however it should be straight forward
to extend the sample to non-power of two but power of base primes like 3, 5, 7. 

Current version supports 1D, 2D, 3D batched transforms. 

Current version supports both in-place and out-of-place transforms.

Current version supports both forward and inverse transform.

Current version supports both plannar and interleaved data format.

Current version only supports complex-to-complex transform. For real 
transform, one can use plannar data format with imaginary array mem set to zero. 

oCurrent version only supports transform on GPU device. Accelerate framework can be used on CPU.

Current version supports sizes that fits in device global memory although "Twist Kernel" is 
included in fft plan if user wants to virtualize (implement sizes larger than what can fit 
in GPU global memory).

Users can dump all the kernels and global, local dimensions with which these kernels are run 
so that they can not only inspect/modify these kernels and understand how FFT is being 
computed on GPU, but also create their own stand along app for executing FFT of size of
their interest.

For any given signal size n, sample crates a clFFT_Plan, that encapsulates the kernel string, 
associated compiled cl_program. Note that kernel string is generated at runtime based on 
input size, dimension (1D, 2D, 3D) and data format (plannar or interleaved) along with some 
device depended parameters encapsulated in the clFFT_Plan. These device dependent parameters 
are set such that kernel is generated for high performance meeting following requirements

   1) Access pattern to global memory (of-chip DRAM) is such that memory transaction 
      coalesceing is achieved if device supports it thus achieving full bandwidth
   2) Local shuffles (matrix transposes or data sharing among work items of a workgroup)
      are band conflict free if local memory is banked.
   3) Kernel is fully optimized for memory hierarcy meaning that it uses GPU's large 
      vector register file, which is fastest, first before reverting to local memory 
      for data sharing among work items to save global DRAM bandwidth and only then 
      reverts to global memory if signal size is such that transform cannnot be computed
      by singal workgroup and thus require global communation among work groups.
      
Users can modify these parameters to get best performance on their particular GPU.     

Users how really want to understand the details of implementation are highly encouraged 
to read above two references but here is a high level description.
At a higher the algorithm decomposes signal of length N into factors as 

                   N = N1 x N2 x N3 x N4 x .... Nn
                   
where the factors (N1, ....., Nn) are sorted such that N1 is largest. It thus decomposes 
N into n-dimensional matrix. It than applies fft along each dimension, multiply by twiddle
factors and transposes the matrix as follow 

                      N2 x N3 x N4 x ............ x Nn x N1   (fft along N1 and transpose)
                      N3 x N4 x N5 x ....    x Nn x N2 x N1   (fft along N2 and transpose)
                      N4 x N5 x N6 x .. x Nn x N3 x N2 x N1   (fft along N3 and transpose)
                      
                      ......
                     Nn x Nn-1 x Nn-2 x ........ N3 x N2 x N1 (fft along Nn and transpose)
                     
 Decomposition is such that algorithm is fully optimized for memory hierarchy. N1 (largest
 base radix) is constrained by maximum register usage by work item (largest size of in-register 
 fft) and product N2 x N3 .... x Nn determine the maximum size of work group which is constrained
 by local memory used by work group (local memory is used to share data among work items i.e.
 local transposes). Togather these two parameters determine the maximum size fft that can be 
 computed by just using register file and local memory without reverting to global memory 
 for transpose (i.e. these sizes do not require global transpose and thus no inter work group 
 communication). However, for larger sizes, global communication among workgroup is required
 and multiple kernel launches are needed depending on the size and the base radix used. 
 
 For details of parameters user can play with, please see the comments in fft_internal.h
 and kernel_string.cpp, which has the main kernel generator functions ... especially
 see the comments preceeding function getRadixArray and getGlobalRadixInfo.
 User can adjust these parameters you achieve best performance on his device. 

Description of API Calls
=========================
clFFT_Plan clFFT_CreatePlan( cl_context context, clFFT_Dim3 n, clFFT_Dimension dim, clFFT_DataFormat dataFormat, cl_int *error_code );

This function creates a plan and returns a handle to it for use with other functions below. 
context    context in which things are happening
n          n.x, n.y, n.z contain the dimension of signal (length along each dimension)
dim        much be one of clFFT_1D, clFFT_2D, clFFT_3D for one, two or three dimensional fft
dataFormat much be either clFFT_InterleavedComplexFormat or clFFT_SplitComplexFormat for either interleaved or plannar data (real and imaginary)
error_code pointer for getting error back in plan creation. In case of error NULL plan is returned
==========================
void clFFT_DestroyPlan( clFFT_Plan plan );

Function to release/free resources
==========================
cl_int clFFT_ExecuteInterleaved( cl_command_queue queue, clFFT_Plan plan, cl_int batchSize, clFFT_Direction dir, 
								 cl_mem data_in, cl_mem data_out,
								 cl_int num_events, cl_event *event_list, cl_event *event );
								 
Function for interleaved fft execution.
queue      command queue for the device on which fft needs to be executed. It should be present in the context for this plan was created
plan       fft plan that was created using clFFT_CreatePlan
batchSize  size of the batch for batched transform
dir        much be either clFFT_Forward or clFFT_Inverse for forward or inverse transform
data_in    input data
data_out   output data. For in-place transform, pass same mem object for both data_in and data_out
num_events, event_list and event are for future use for letting fft fit in other CL based application pipeline through event dependency.
Not implemented in this version yet so these parameters are redundant right now. Just pass NULL.

=========================
cl_int clFFT_ExecutePlannar( cl_command_queue queue, clFFT_Plan plan, cl_int batchSize, clFFT_Direction dir, 
							 cl_mem data_in_real, cl_mem data_in_imag, cl_mem data_out_real, cl_mem data_out_imag,
							 cl_int num_events, cl_event *event_list, cl_event *event );
							 
Same as above but for plannar data type.							 
=========================
cl_int clFFT_1DTwistInterleaved( clFFT_Plan plan, cl_mem mem, size_t numRows, size_t numCols, size_t startRow, clFFT_Direction dir );

Function for applying twist (twiddle factor multiplication) for virtualizing computation of very large ffts that cannot fit into global
memory at once but can be decomposed into many global memory fitting ffts followed by twiddle multiplication (twist) followed by transpose
followed by again many global memory fitting ffts.

=========================
cl_int clFFT_1DTwistPlanner( clFFT_Plan plan, cl_mem mem_real, cl_mem mem_imag, size_t numRows, size_t numCols, size_t startRow, clFFT_Direction dir );

Same fucntion as above but for plannar data
=========================	

void clFFT_DumpPlan( clFFT_Plan plan, FILE *file);	

Function to dump the plan. Passing stdout to file prints out the plan to standard out. It prints out
the kernel string and local, global dimension with which each kernel is executed in this plan.
						
==================================================================================
IMPORTANT NOTE ON PERFORMANCE:

Currently there are a few known performance issues (bug) that this sample has discovered
in rumtime and code generation that are being actively fixed. Hence, for sizes >= 1024, 
performance is much below the expected peak for any particular size. However, we have 
internally verified that once these bugs are fixed, performance should be on par with 
expected peak. Note that these are bugs in OpenCL runtime/compiler and not in this
sample.

===========================================================================
BUILD REQUIREMENTS:

Mac OS X v10.6 or later

If you are running in Xcode, be sure to pass file name "param.txt". You can do that
by double clicking OpenCL_FFT under executable and then click on Argument tab and 
add ./../../param.txt under "Arguments to be passed on launch" section. 

===========================================================================
RUNTIME REQUIREMENTS:

. Mac OS X v10.6 or later with OpenCL 1.0
. For good performance, device should support local memory. 
  FFT performance critically depend on how efficiently local shuffles 
  (matrix transposes) using local memory to reduce external DRAM bandwidth
  requirement.

===========================================================================
PACKAGING LIST:

AccelerateError.pdf
clFFT.h
Error.pdf
fft_base_kernels.h
fft_execute.cpp
fft_internal.h
fft_kernelstring.cpp
fft_setup.cpp
main.cpp
Makefile
OpenCL_FFT.xcodeproj
OpenCLError.pdf
param.txt
procs.h
ReadMe.txt

===========================================================================
CHANGES FROM PREVIOUS VERSIONS:

Version 1.0
- First version.

===========================================================================
Copyright (C) 2008 Apple Inc. All rights reserved.
\ No newline at end of file
--- a/fft_Example/clFFT.h	Tue Feb 05 15:12:19 2013 +0900
+++ b/fft_Example/clFFT.h	Tue Feb 05 15:19:02 2013 +0900
@@ -106,19 +106,19 @@
 void clFFT_DestroyPlan( clFFT_Plan plan );
 
 cl_int clFFT_ExecuteInterleaved( cl_command_queue queue, clFFT_Plan plan, cl_int batchSize, clFFT_Direction dir, 
-								 cl_mem data_in, cl_mem data_out,
-								 cl_int num_events, cl_event *event_list, cl_event *event );
+                 cl_mem data_in, cl_mem data_out,
+                 cl_int num_events, cl_event *event_list, cl_event *event );
 
 cl_int clFFT_ExecutePlannar( cl_command_queue queue, clFFT_Plan plan, cl_int batchSize, clFFT_Direction dir, 
-							 cl_mem data_in_real, cl_mem data_in_imag, cl_mem data_out_real, cl_mem data_out_imag,
-							 cl_int num_events, cl_event *event_list, cl_event *event );
+            	 cl_mem data_in_real, cl_mem data_in_imag, cl_mem data_out_real, cl_mem data_out_imag,
+            	 cl_int num_events, cl_event *event_list, cl_event *event );
 
 cl_int clFFT_1DTwistInterleaved(clFFT_Plan Plan, cl_command_queue queue, cl_mem array, 
-						        size_t numRows, size_t numCols, size_t startRow, size_t rowsToProcess, clFFT_Direction dir);
+                    size_t numRows, size_t numCols, size_t startRow, size_t rowsToProcess, clFFT_Direction dir);
 	
 
 cl_int clFFT_1DTwistPlannar(clFFT_Plan Plan, cl_command_queue queue, cl_mem array_real, cl_mem array_imag, 
-					        size_t numRows, size_t numCols, size_t startRow, size_t rowsToProcess, clFFT_Direction dir);
+        	        size_t numRows, size_t numCols, size_t startRow, size_t rowsToProcess, clFFT_Direction dir);
 	
 void clFFT_DumpPlan( clFFT_Plan plan, FILE *file);	
 
--- a/fft_Example/fft_base_kernels.h	Tue Feb 05 15:12:19 2013 +0900
+++ b/fft_Example/fft_base_kernels.h	Tue Feb 05 15:19:02 2013 +0900
@@ -55,222 +55,222 @@
 
 static string baseKernels = string(
                           "#ifndef M_PI\n"
-						  "#define M_PI 0x1.921fb54442d18p+1\n"
-						  "#endif\n"
-						  "#define complexMul(a,b) ((float2)(mad(-(a).y, (b).y, (a).x * (b).x), mad((a).y, (b).x, (a).x * (b).y)))\n"
-						  "#define conj(a) ((float2)((a).x, -(a).y))\n"
-						  "#define conjTransp(a) ((float2)(-(a).y, (a).x))\n"		   
-						  "\n"
-						  "#define fftKernel2(a,dir) \\\n"
-						  "{ \\\n"
-						  "    float2 c = (a)[0];    \\\n"
-						  "    (a)[0] = c + (a)[1];  \\\n"
-						  "    (a)[1] = c - (a)[1];  \\\n"
-						  "}\n"
-						  "\n"						  
-						  "#define fftKernel2S(d1,d2,dir) \\\n"
-						  "{ \\\n"
-						  "    float2 c = (d1);   \\\n"
-						  "    (d1) = c + (d2);   \\\n"
-						  "    (d2) = c - (d2);   \\\n"
-						  "}\n"
-						  "\n"						  
-						  "#define fftKernel4(a,dir) \\\n"
-						  "{ \\\n"
-						  "    fftKernel2S((a)[0], (a)[2], dir); \\\n"
-						  "    fftKernel2S((a)[1], (a)[3], dir); \\\n"
-						  "    fftKernel2S((a)[0], (a)[1], dir); \\\n"
-						  "    (a)[3] = (float2)(dir)*(conjTransp((a)[3])); \\\n"
-						  "    fftKernel2S((a)[2], (a)[3], dir); \\\n"
-						  "    float2 c = (a)[1]; \\\n"
-						  "    (a)[1] = (a)[2]; \\\n"
-						  "    (a)[2] = c; \\\n"
-						  "}\n"
-						  "\n"						  
-						  "#define fftKernel4s(a0,a1,a2,a3,dir) \\\n"
-						  "{ \\\n"
-						  "    fftKernel2S((a0), (a2), dir); \\\n"
-						  "    fftKernel2S((a1), (a3), dir); \\\n"
-						  "    fftKernel2S((a0), (a1), dir); \\\n"
-						  "    (a3) = (float2)(dir)*(conjTransp((a3))); \\\n"
-						  "    fftKernel2S((a2), (a3), dir); \\\n"
-						  "    float2 c = (a1); \\\n"
-						  "    (a1) = (a2); \\\n"
-						  "    (a2) = c; \\\n" 
-						  "}\n"
-						  "\n"						  
-						  "#define bitreverse8(a) \\\n"
-						  "{ \\\n"
-						  "    float2 c; \\\n"
-						  "    c = (a)[1]; \\\n"
-						  "    (a)[1] = (a)[4]; \\\n"
-						  "    (a)[4] = c; \\\n"
-						  "    c = (a)[3]; \\\n"
-						  "    (a)[3] = (a)[6]; \\\n"
-						  "    (a)[6] = c; \\\n"
-						  "}\n"
-						  "\n"						  
-						  "#define fftKernel8(a,dir) \\\n"
-						  "{ \\\n"
-						  "	const float2 w1  = (float2)(0x1.6a09e6p-1f,  dir*0x1.6a09e6p-1f);  \\\n"
-						  "	const float2 w3  = (float2)(-0x1.6a09e6p-1f, dir*0x1.6a09e6p-1f);  \\\n"
-						  "	float2 c; \\\n"
-						  "	fftKernel2S((a)[0], (a)[4], dir); \\\n"
-						  "	fftKernel2S((a)[1], (a)[5], dir); \\\n"
-						  "	fftKernel2S((a)[2], (a)[6], dir); \\\n"
-						  "	fftKernel2S((a)[3], (a)[7], dir); \\\n"
-						  "	(a)[5] = complexMul(w1, (a)[5]); \\\n"
-						  "	(a)[6] = (float2)(dir)*(conjTransp((a)[6])); \\\n"
-						  "	(a)[7] = complexMul(w3, (a)[7]); \\\n"
-						  "	fftKernel2S((a)[0], (a)[2], dir); \\\n"
-						  "	fftKernel2S((a)[1], (a)[3], dir); \\\n"
-						  "	fftKernel2S((a)[4], (a)[6], dir); \\\n"
-						  "	fftKernel2S((a)[5], (a)[7], dir); \\\n"
-						  "	(a)[3] = (float2)(dir)*(conjTransp((a)[3])); \\\n"
-						  "	(a)[7] = (float2)(dir)*(conjTransp((a)[7])); \\\n"
-						  "	fftKernel2S((a)[0], (a)[1], dir); \\\n"
-						  "	fftKernel2S((a)[2], (a)[3], dir); \\\n"
-						  "	fftKernel2S((a)[4], (a)[5], dir); \\\n"
-						  "	fftKernel2S((a)[6], (a)[7], dir); \\\n"
-						  "	bitreverse8((a)); \\\n"
-						  "}\n"
-						  "\n"						  
-						  "#define bitreverse4x4(a) \\\n"
-						  "{ \\\n"
-						  "	float2 c; \\\n"
-						  "	c = (a)[1];  (a)[1]  = (a)[4];  (a)[4]  = c; \\\n"
-						  "	c = (a)[2];  (a)[2]  = (a)[8];  (a)[8]  = c; \\\n"
-						  "	c = (a)[3];  (a)[3]  = (a)[12]; (a)[12] = c; \\\n"
-						  "	c = (a)[6];  (a)[6]  = (a)[9];  (a)[9]  = c; \\\n"
-						  "	c = (a)[7];  (a)[7]  = (a)[13]; (a)[13] = c; \\\n"
-						  "	c = (a)[11]; (a)[11] = (a)[14]; (a)[14] = c; \\\n"
-						  "}\n"
-						  "\n"						  
-						  "#define fftKernel16(a,dir) \\\n"
-						  "{ \\\n"
-						  "    const float w0 = 0x1.d906bcp-1f; \\\n"
-						  "    const float w1 = 0x1.87de2ap-2f; \\\n"
-						  "    const float w2 = 0x1.6a09e6p-1f; \\\n"
-						  "    fftKernel4s((a)[0], (a)[4], (a)[8],  (a)[12], dir); \\\n"
-						  "    fftKernel4s((a)[1], (a)[5], (a)[9],  (a)[13], dir); \\\n"
-						  "    fftKernel4s((a)[2], (a)[6], (a)[10], (a)[14], dir); \\\n"
-						  "    fftKernel4s((a)[3], (a)[7], (a)[11], (a)[15], dir); \\\n"
-						  "    (a)[5]  = complexMul((a)[5], (float2)(w0, dir*w1)); \\\n"
-						  "    (a)[6]  = complexMul((a)[6], (float2)(w2, dir*w2)); \\\n"
-						  "    (a)[7]  = complexMul((a)[7], (float2)(w1, dir*w0)); \\\n"
-						  "    (a)[9]  = complexMul((a)[9], (float2)(w2, dir*w2)); \\\n"
-						  "    (a)[10] = (float2)(dir)*(conjTransp((a)[10])); \\\n"
-						  "    (a)[11] = complexMul((a)[11], (float2)(-w2, dir*w2)); \\\n"
-						  "    (a)[13] = complexMul((a)[13], (float2)(w1, dir*w0)); \\\n"
-						  "    (a)[14] = complexMul((a)[14], (float2)(-w2, dir*w2)); \\\n"
-						  "    (a)[15] = complexMul((a)[15], (float2)(-w0, dir*-w1)); \\\n"
-						  "    fftKernel4((a), dir); \\\n"
-						  "    fftKernel4((a) + 4, dir); \\\n"
-						  "    fftKernel4((a) + 8, dir); \\\n"
-						  "    fftKernel4((a) + 12, dir); \\\n"
-						  "    bitreverse4x4((a)); \\\n"
-						  "}\n"
-						  "\n"						  
-						  "#define bitreverse32(a) \\\n"
-						  "{ \\\n"
-						  "    float2 c1, c2; \\\n"
-						  "    c1 = (a)[2];   (a)[2] = (a)[1];   c2 = (a)[4];   (a)[4] = c1;   c1 = (a)[8];   (a)[8] = c2;    c2 = (a)[16];  (a)[16] = c1;   (a)[1] = c2; \\\n"
-						  "    c1 = (a)[6];   (a)[6] = (a)[3];   c2 = (a)[12];  (a)[12] = c1;  c1 = (a)[24];  (a)[24] = c2;   c2 = (a)[17];  (a)[17] = c1;   (a)[3] = c2; \\\n"
-						  "    c1 = (a)[10];  (a)[10] = (a)[5];  c2 = (a)[20];  (a)[20] = c1;  c1 = (a)[9];   (a)[9] = c2;    c2 = (a)[18];  (a)[18] = c1;   (a)[5] = c2; \\\n"
-						  "    c1 = (a)[14];  (a)[14] = (a)[7];  c2 = (a)[28];  (a)[28] = c1;  c1 = (a)[25];  (a)[25] = c2;   c2 = (a)[19];  (a)[19] = c1;   (a)[7] = c2; \\\n"
-						  "    c1 = (a)[22];  (a)[22] = (a)[11]; c2 = (a)[13];  (a)[13] = c1;  c1 = (a)[26];  (a)[26] = c2;   c2 = (a)[21];  (a)[21] = c1;   (a)[11] = c2; \\\n"
-						  "    c1 = (a)[30];  (a)[30] = (a)[15]; c2 = (a)[29];  (a)[29] = c1;  c1 = (a)[27];  (a)[27] = c2;   c2 = (a)[23];  (a)[23] = c1;   (a)[15] = c2; \\\n"
-						  "}\n"
-						  "\n"						  
-						  "#define fftKernel32(a,dir) \\\n"
-						  "{ \\\n"
-						  "    fftKernel2S((a)[0],  (a)[16], dir); \\\n"
-						  "    fftKernel2S((a)[1],  (a)[17], dir); \\\n"
-						  "    fftKernel2S((a)[2],  (a)[18], dir); \\\n"
-						  "    fftKernel2S((a)[3],  (a)[19], dir); \\\n"
-						  "    fftKernel2S((a)[4],  (a)[20], dir); \\\n"
-						  "    fftKernel2S((a)[5],  (a)[21], dir); \\\n"
-						  "    fftKernel2S((a)[6],  (a)[22], dir); \\\n"
-						  "    fftKernel2S((a)[7],  (a)[23], dir); \\\n"
-						  "    fftKernel2S((a)[8],  (a)[24], dir); \\\n"
-						  "    fftKernel2S((a)[9],  (a)[25], dir); \\\n"
-						  "    fftKernel2S((a)[10], (a)[26], dir); \\\n"
-						  "    fftKernel2S((a)[11], (a)[27], dir); \\\n"
-						  "    fftKernel2S((a)[12], (a)[28], dir); \\\n"
-						  "    fftKernel2S((a)[13], (a)[29], dir); \\\n"
-						  "    fftKernel2S((a)[14], (a)[30], dir); \\\n"
-						  "    fftKernel2S((a)[15], (a)[31], dir); \\\n"
-						  "    (a)[17] = complexMul((a)[17], (float2)(0x1.f6297cp-1f, dir*0x1.8f8b84p-3f)); \\\n"
-						  "    (a)[18] = complexMul((a)[18], (float2)(0x1.d906bcp-1f, dir*0x1.87de2ap-2f)); \\\n"
-						  "    (a)[19] = complexMul((a)[19], (float2)(0x1.a9b662p-1f, dir*0x1.1c73b4p-1f)); \\\n"
-						  "    (a)[20] = complexMul((a)[20], (float2)(0x1.6a09e6p-1f, dir*0x1.6a09e6p-1f)); \\\n"
-						  "    (a)[21] = complexMul((a)[21], (float2)(0x1.1c73b4p-1f, dir*0x1.a9b662p-1f)); \\\n"
-						  "    (a)[22] = complexMul((a)[22], (float2)(0x1.87de2ap-2f, dir*0x1.d906bcp-1f)); \\\n"
-						  "    (a)[23] = complexMul((a)[23], (float2)(0x1.8f8b84p-3f, dir*0x1.f6297cp-1f)); \\\n"
-						  "    (a)[24] = complexMul((a)[24], (float2)(0x0p+0f, dir*0x1p+0f)); \\\n"
-						  "    (a)[25] = complexMul((a)[25], (float2)(-0x1.8f8b84p-3f, dir*0x1.f6297cp-1f)); \\\n"
-						  "    (a)[26] = complexMul((a)[26], (float2)(-0x1.87de2ap-2f, dir*0x1.d906bcp-1f)); \\\n"
-						  "    (a)[27] = complexMul((a)[27], (float2)(-0x1.1c73b4p-1f, dir*0x1.a9b662p-1f)); \\\n"
-						  "    (a)[28] = complexMul((a)[28], (float2)(-0x1.6a09e6p-1f, dir*0x1.6a09e6p-1f)); \\\n"
-						  "    (a)[29] = complexMul((a)[29], (float2)(-0x1.a9b662p-1f, dir*0x1.1c73b4p-1f)); \\\n"
-						  "    (a)[30] = complexMul((a)[30], (float2)(-0x1.d906bcp-1f, dir*0x1.87de2ap-2f)); \\\n"
-						  "    (a)[31] = complexMul((a)[31], (float2)(-0x1.f6297cp-1f, dir*0x1.8f8b84p-3f)); \\\n"
-						  "    fftKernel16((a), dir); \\\n"
-						  "    fftKernel16((a) + 16, dir); \\\n"
-						  "    bitreverse32((a)); \\\n"
-						  "}\n\n"
-						  );
+                          "#define M_PI 0x1.921fb54442d18p+1\n"
+                          "#endif\n"
+                          "#define complexMul(a,b) ((float2)(mad(-(a).y, (b).y, (a).x * (b).x), mad((a).y, (b).x, (a).x * (b).y)))\n"
+                          "#define conj(a) ((float2)((a).x, -(a).y))\n"
+                          "#define conjTransp(a) ((float2)(-(a).y, (a).x))\n"           
+                          "\n"
+                          "#define fftKernel2(a,dir) \\\n"
+                          "{ \\\n"
+                          "    float2 c = (a)[0];    \\\n"
+                          "    (a)[0] = c + (a)[1];  \\\n"
+                          "    (a)[1] = c - (a)[1];  \\\n"
+                          "}\n"
+                          "\n"                          
+                          "#define fftKernel2S(d1,d2,dir) \\\n"
+                          "{ \\\n"
+                          "    float2 c = (d1);   \\\n"
+                          "    (d1) = c + (d2);   \\\n"
+                          "    (d2) = c - (d2);   \\\n"
+                          "}\n"
+                          "\n"                          
+                          "#define fftKernel4(a,dir) \\\n"
+                          "{ \\\n"
+                          "    fftKernel2S((a)[0], (a)[2], dir); \\\n"
+                          "    fftKernel2S((a)[1], (a)[3], dir); \\\n"
+                          "    fftKernel2S((a)[0], (a)[1], dir); \\\n"
+                          "    (a)[3] = (float2)(dir)*(conjTransp((a)[3])); \\\n"
+                          "    fftKernel2S((a)[2], (a)[3], dir); \\\n"
+                          "    float2 c = (a)[1]; \\\n"
+                          "    (a)[1] = (a)[2]; \\\n"
+                          "    (a)[2] = c; \\\n"
+                          "}\n"
+                          "\n"                          
+                          "#define fftKernel4s(a0,a1,a2,a3,dir) \\\n"
+                          "{ \\\n"
+                          "    fftKernel2S((a0), (a2), dir); \\\n"
+                          "    fftKernel2S((a1), (a3), dir); \\\n"
+                          "    fftKernel2S((a0), (a1), dir); \\\n"
+                          "    (a3) = (float2)(dir)*(conjTransp((a3))); \\\n"
+                          "    fftKernel2S((a2), (a3), dir); \\\n"
+                          "    float2 c = (a1); \\\n"
+                          "    (a1) = (a2); \\\n"
+                          "    (a2) = c; \\\n" 
+                          "}\n"
+                          "\n"                          
+                          "#define bitreverse8(a) \\\n"
+                          "{ \\\n"
+                          "    float2 c; \\\n"
+                          "    c = (a)[1]; \\\n"
+                          "    (a)[1] = (a)[4]; \\\n"
+                          "    (a)[4] = c; \\\n"
+                          "    c = (a)[3]; \\\n"
+                          "    (a)[3] = (a)[6]; \\\n"
+                          "    (a)[6] = c; \\\n"
+                          "}\n"
+                          "\n"                          
+                          "#define fftKernel8(a,dir) \\\n"
+                          "{ \\\n"
+                          "    const float2 w1  = (float2)(0x1.6a09e6p-1f,  dir*0x1.6a09e6p-1f);  \\\n"
+                          "    const float2 w3  = (float2)(-0x1.6a09e6p-1f, dir*0x1.6a09e6p-1f);  \\\n"
+                          "    float2 c; \\\n"
+                          "    fftKernel2S((a)[0], (a)[4], dir); \\\n"
+                          "    fftKernel2S((a)[1], (a)[5], dir); \\\n"
+                          "    fftKernel2S((a)[2], (a)[6], dir); \\\n"
+                          "    fftKernel2S((a)[3], (a)[7], dir); \\\n"
+                          "    (a)[5] = complexMul(w1, (a)[5]); \\\n"
+                          "    (a)[6] = (float2)(dir)*(conjTransp((a)[6])); \\\n"
+                          "    (a)[7] = complexMul(w3, (a)[7]); \\\n"
+                          "    fftKernel2S((a)[0], (a)[2], dir); \\\n"
+                          "    fftKernel2S((a)[1], (a)[3], dir); \\\n"
+                          "    fftKernel2S((a)[4], (a)[6], dir); \\\n"
+                          "    fftKernel2S((a)[5], (a)[7], dir); \\\n"
+                          "    (a)[3] = (float2)(dir)*(conjTransp((a)[3])); \\\n"
+                          "    (a)[7] = (float2)(dir)*(conjTransp((a)[7])); \\\n"
+                          "    fftKernel2S((a)[0], (a)[1], dir); \\\n"
+                          "    fftKernel2S((a)[2], (a)[3], dir); \\\n"
+                          "    fftKernel2S((a)[4], (a)[5], dir); \\\n"
+                          "    fftKernel2S((a)[6], (a)[7], dir); \\\n"
+                          "    bitreverse8((a)); \\\n"
+                          "}\n"
+                          "\n"                          
+                          "#define bitreverse4x4(a) \\\n"
+                          "{ \\\n"
+                          "    float2 c; \\\n"
+                          "    c = (a)[1];  (a)[1]  = (a)[4];  (a)[4]  = c; \\\n"
+                          "    c = (a)[2];  (a)[2]  = (a)[8];  (a)[8]  = c; \\\n"
+                          "    c = (a)[3];  (a)[3]  = (a)[12]; (a)[12] = c; \\\n"
+                          "    c = (a)[6];  (a)[6]  = (a)[9];  (a)[9]  = c; \\\n"
+                          "    c = (a)[7];  (a)[7]  = (a)[13]; (a)[13] = c; \\\n"
+                          "    c = (a)[11]; (a)[11] = (a)[14]; (a)[14] = c; \\\n"
+                          "}\n"
+                          "\n"                          
+                          "#define fftKernel16(a,dir) \\\n"
+                          "{ \\\n"
+                          "    const float w0 = 0x1.d906bcp-1f; \\\n"
+                          "    const float w1 = 0x1.87de2ap-2f; \\\n"
+                          "    const float w2 = 0x1.6a09e6p-1f; \\\n"
+                          "    fftKernel4s((a)[0], (a)[4], (a)[8],  (a)[12], dir); \\\n"
+                          "    fftKernel4s((a)[1], (a)[5], (a)[9],  (a)[13], dir); \\\n"
+                          "    fftKernel4s((a)[2], (a)[6], (a)[10], (a)[14], dir); \\\n"
+                          "    fftKernel4s((a)[3], (a)[7], (a)[11], (a)[15], dir); \\\n"
+                          "    (a)[5]  = complexMul((a)[5], (float2)(w0, dir*w1)); \\\n"
+                          "    (a)[6]  = complexMul((a)[6], (float2)(w2, dir*w2)); \\\n"
+                          "    (a)[7]  = complexMul((a)[7], (float2)(w1, dir*w0)); \\\n"
+                          "    (a)[9]  = complexMul((a)[9], (float2)(w2, dir*w2)); \\\n"
+                          "    (a)[10] = (float2)(dir)*(conjTransp((a)[10])); \\\n"
+                          "    (a)[11] = complexMul((a)[11], (float2)(-w2, dir*w2)); \\\n"
+                          "    (a)[13] = complexMul((a)[13], (float2)(w1, dir*w0)); \\\n"
+                          "    (a)[14] = complexMul((a)[14], (float2)(-w2, dir*w2)); \\\n"
+                          "    (a)[15] = complexMul((a)[15], (float2)(-w0, dir*-w1)); \\\n"
+                          "    fftKernel4((a), dir); \\\n"
+                          "    fftKernel4((a) + 4, dir); \\\n"
+                          "    fftKernel4((a) + 8, dir); \\\n"
+                          "    fftKernel4((a) + 12, dir); \\\n"
+                          "    bitreverse4x4((a)); \\\n"
+                          "}\n"
+                          "\n"                          
+                          "#define bitreverse32(a) \\\n"
+                          "{ \\\n"
+                          "    float2 c1, c2; \\\n"
+                          "    c1 = (a)[2];   (a)[2] = (a)[1];   c2 = (a)[4];   (a)[4] = c1;   c1 = (a)[8];   (a)[8] = c2;    c2 = (a)[16];  (a)[16] = c1;   (a)[1] = c2; \\\n"
+                          "    c1 = (a)[6];   (a)[6] = (a)[3];   c2 = (a)[12];  (a)[12] = c1;  c1 = (a)[24];  (a)[24] = c2;   c2 = (a)[17];  (a)[17] = c1;   (a)[3] = c2; \\\n"
+                          "    c1 = (a)[10];  (a)[10] = (a)[5];  c2 = (a)[20];  (a)[20] = c1;  c1 = (a)[9];   (a)[9] = c2;    c2 = (a)[18];  (a)[18] = c1;   (a)[5] = c2; \\\n"
+                          "    c1 = (a)[14];  (a)[14] = (a)[7];  c2 = (a)[28];  (a)[28] = c1;  c1 = (a)[25];  (a)[25] = c2;   c2 = (a)[19];  (a)[19] = c1;   (a)[7] = c2; \\\n"
+                          "    c1 = (a)[22];  (a)[22] = (a)[11]; c2 = (a)[13];  (a)[13] = c1;  c1 = (a)[26];  (a)[26] = c2;   c2 = (a)[21];  (a)[21] = c1;   (a)[11] = c2; \\\n"
+                          "    c1 = (a)[30];  (a)[30] = (a)[15]; c2 = (a)[29];  (a)[29] = c1;  c1 = (a)[27];  (a)[27] = c2;   c2 = (a)[23];  (a)[23] = c1;   (a)[15] = c2; \\\n"
+                          "}\n"
+                          "\n"                          
+                          "#define fftKernel32(a,dir) \\\n"
+                          "{ \\\n"
+                          "    fftKernel2S((a)[0],  (a)[16], dir); \\\n"
+                          "    fftKernel2S((a)[1],  (a)[17], dir); \\\n"
+                          "    fftKernel2S((a)[2],  (a)[18], dir); \\\n"
+                          "    fftKernel2S((a)[3],  (a)[19], dir); \\\n"
+                          "    fftKernel2S((a)[4],  (a)[20], dir); \\\n"
+                          "    fftKernel2S((a)[5],  (a)[21], dir); \\\n"
+                          "    fftKernel2S((a)[6],  (a)[22], dir); \\\n"
+                          "    fftKernel2S((a)[7],  (a)[23], dir); \\\n"
+                          "    fftKernel2S((a)[8],  (a)[24], dir); \\\n"
+                          "    fftKernel2S((a)[9],  (a)[25], dir); \\\n"
+                          "    fftKernel2S((a)[10], (a)[26], dir); \\\n"
+                          "    fftKernel2S((a)[11], (a)[27], dir); \\\n"
+                          "    fftKernel2S((a)[12], (a)[28], dir); \\\n"
+                          "    fftKernel2S((a)[13], (a)[29], dir); \\\n"
+                          "    fftKernel2S((a)[14], (a)[30], dir); \\\n"
+                          "    fftKernel2S((a)[15], (a)[31], dir); \\\n"
+                          "    (a)[17] = complexMul((a)[17], (float2)(0x1.f6297cp-1f, dir*0x1.8f8b84p-3f)); \\\n"
+                          "    (a)[18] = complexMul((a)[18], (float2)(0x1.d906bcp-1f, dir*0x1.87de2ap-2f)); \\\n"
+                          "    (a)[19] = complexMul((a)[19], (float2)(0x1.a9b662p-1f, dir*0x1.1c73b4p-1f)); \\\n"
+                          "    (a)[20] = complexMul((a)[20], (float2)(0x1.6a09e6p-1f, dir*0x1.6a09e6p-1f)); \\\n"
+                          "    (a)[21] = complexMul((a)[21], (float2)(0x1.1c73b4p-1f, dir*0x1.a9b662p-1f)); \\\n"
+                          "    (a)[22] = complexMul((a)[22], (float2)(0x1.87de2ap-2f, dir*0x1.d906bcp-1f)); \\\n"
+                          "    (a)[23] = complexMul((a)[23], (float2)(0x1.8f8b84p-3f, dir*0x1.f6297cp-1f)); \\\n"
+                          "    (a)[24] = complexMul((a)[24], (float2)(0x0p+0f, dir*0x1p+0f)); \\\n"
+                          "    (a)[25] = complexMul((a)[25], (float2)(-0x1.8f8b84p-3f, dir*0x1.f6297cp-1f)); \\\n"
+                          "    (a)[26] = complexMul((a)[26], (float2)(-0x1.87de2ap-2f, dir*0x1.d906bcp-1f)); \\\n"
+                          "    (a)[27] = complexMul((a)[27], (float2)(-0x1.1c73b4p-1f, dir*0x1.a9b662p-1f)); \\\n"
+                          "    (a)[28] = complexMul((a)[28], (float2)(-0x1.6a09e6p-1f, dir*0x1.6a09e6p-1f)); \\\n"
+                          "    (a)[29] = complexMul((a)[29], (float2)(-0x1.a9b662p-1f, dir*0x1.1c73b4p-1f)); \\\n"
+                          "    (a)[30] = complexMul((a)[30], (float2)(-0x1.d906bcp-1f, dir*0x1.87de2ap-2f)); \\\n"
+                          "    (a)[31] = complexMul((a)[31], (float2)(-0x1.f6297cp-1f, dir*0x1.8f8b84p-3f)); \\\n"
+                          "    fftKernel16((a), dir); \\\n"
+                          "    fftKernel16((a) + 16, dir); \\\n"
+                          "    bitreverse32((a)); \\\n"
+                          "}\n\n"
+                          );
 
 static string twistKernelInterleaved = string(
-											  "__kernel void \\\n"
-											  "clFFT_1DTwistInterleaved(__global float2 *in, unsigned int startRow, unsigned int numCols, unsigned int N, unsigned int numRowsToProcess, int dir) \\\n"
-											  "{ \\\n"
-											  "   float2 a, w; \\\n"
-											  "   float ang; \\\n"
-											  "   unsigned int j; \\\n"
-											  "	unsigned int i = get_global_id(0); \\\n"
-											  "	unsigned int startIndex = i; \\\n"
-											  "	 \\\n"
-											  "	if(i < numCols) \\\n"
-											  "	{ \\\n"
-											  "	    for(j = 0; j < numRowsToProcess; j++) \\\n"
-											  "	    { \\\n"
-											  "	        a = in[startIndex]; \\\n"
-											  "	        ang = 2.0f * M_PI * dir * i * (startRow + j) / N; \\\n"
-											  "	        w = (float2)(native_cos(ang), native_sin(ang)); \\\n"
-											  "	        a = complexMul(a, w); \\\n"
-											  "	        in[startIndex] = a; \\\n"
-											  "	        startIndex += numCols; \\\n"
-											  "	    } \\\n"
-											  "	}	 \\\n"
-											  "} \\\n"
-											  );
+                                              "__kernel void \\\n"
+                                              "clFFT_1DTwistInterleaved(__global float2 *in, unsigned int startRow, unsigned int numCols, unsigned int N, unsigned int numRowsToProcess, int dir) \\\n"
+                                              "{ \\\n"
+                                              "   float2 a, w; \\\n"
+                                              "   float ang; \\\n"
+                                              "   unsigned int j; \\\n"
+                                              "    unsigned int i = get_global_id(0); \\\n"
+                                              "    unsigned int startIndex = i; \\\n"
+                                              "     \\\n"
+                                              "    if(i < numCols) \\\n"
+                                              "    { \\\n"
+                                              "        for(j = 0; j < numRowsToProcess; j++) \\\n"
+                                              "        { \\\n"
+                                              "            a = in[startIndex]; \\\n"
+                                              "            ang = 2.0f * M_PI * dir * i * (startRow + j) / N; \\\n"
+                                              "            w = (float2)(native_cos(ang), native_sin(ang)); \\\n"
+                                              "            a = complexMul(a, w); \\\n"
+                                              "            in[startIndex] = a; \\\n"
+                                              "            startIndex += numCols; \\\n"
+                                              "        } \\\n"
+                                              "    }     \\\n"
+                                              "} \\\n"
+                                              );
 
 static string twistKernelPlannar = string(
-										  "__kernel void \\\n"
-										  "clFFT_1DTwistSplit(__global float *in_real, __global float *in_imag , unsigned int startRow, unsigned int numCols, unsigned int N, unsigned int numRowsToProcess, int dir) \\\n"
-										  "{ \\\n"
-										  "    float2 a, w; \\\n"
-										  "    float ang; \\\n"
-										  "    unsigned int j; \\\n"
-										  "	unsigned int i = get_global_id(0); \\\n"
-										  "	unsigned int startIndex = i; \\\n"
-										  "	 \\\n"
-										  "	if(i < numCols) \\\n"
-										  "	{ \\\n"
-										  "	    for(j = 0; j < numRowsToProcess; j++) \\\n"
-										  "	    { \\\n"
-										  "	        a = (float2)(in_real[startIndex], in_imag[startIndex]); \\\n"
-										  "	        ang = 2.0f * M_PI * dir * i * (startRow + j) / N; \\\n"
-										  "	        w = (float2)(native_cos(ang), native_sin(ang)); \\\n"
-										  "	        a = complexMul(a, w); \\\n"
-										  "	        in_real[startIndex] = a.x; \\\n"
-										  "	        in_imag[startIndex] = a.y; \\\n"
-										  "	        startIndex += numCols; \\\n"
-										  "	    } \\\n"
-										  "	}	 \\\n"
-										  "} \\\n"
-										  );										  
+                                          "__kernel void \\\n"
+                                          "clFFT_1DTwistSplit(__global float *in_real, __global float *in_imag , unsigned int startRow, unsigned int numCols, unsigned int N, unsigned int numRowsToProcess, int dir) \\\n"
+                                          "{ \\\n"
+                                          "    float2 a, w; \\\n"
+                                          "    float ang; \\\n"
+                                          "    unsigned int j; \\\n"
+                                          "    unsigned int i = get_global_id(0); \\\n"
+                                          "    unsigned int startIndex = i; \\\n"
+                                          "     \\\n"
+                                          "    if(i < numCols) \\\n"
+                                          "    { \\\n"
+                                          "        for(j = 0; j < numRowsToProcess; j++) \\\n"
+                                          "        { \\\n"
+                                          "            a = (float2)(in_real[startIndex], in_imag[startIndex]); \\\n"
+                                          "            ang = 2.0f * M_PI * dir * i * (startRow + j) / N; \\\n"
+                                          "            w = (float2)(native_cos(ang), native_sin(ang)); \\\n"
+                                          "            a = complexMul(a, w); \\\n"
+                                          "            in_real[startIndex] = a.x; \\\n"
+                                          "            in_imag[startIndex] = a.y; \\\n"
+                                          "            startIndex += numCols; \\\n"
+                                          "        } \\\n"
+                                          "    }     \\\n"
+                                          "} \\\n"
+                                          );                                          
 
 
 
--- a/fft_Example/fft_internal.h	Tue Feb 05 15:12:19 2013 +0900
+++ b/fft_Example/fft_internal.h	Tue Feb 05 15:19:02 2013 +0900
@@ -80,82 +80,82 @@
 {
 	// context in which fft resources are created and kernels are executed
 	cl_context              context;
-	
-	// size of signal
-	clFFT_Dim3              n;
-	
-	// dimension of transform ... must be either 1D, 2D or 3D
-	clFFT_Dimension			dim;
-	
-	// data format ... must be either interleaved or plannar
-	clFFT_DataFormat		format;
-	
-	// string containing kernel source. Generated at runtime based on
-	// n, dim, format and other parameters
-	string                  *kernel_string;
-	
-	// CL program containing source and kernel this particular 
-	// n, dim, data format
-	cl_program				program;
-	
-	// linked list of kernels which needs to be executed for this fft
-	cl_fft_kernel_info		*kernel_info;
-	
-	// number of kernels
-	int                     num_kernels;
-	
-	// twist kernel for virtualizing fft of very large sizes that do not
-	// fit in GPU global memory
-	cl_kernel				twist_kernel;
-	
-	// flag indicating if temporary intermediate buffer is needed or not.
-	// this depends on fft kernels being executed and if transform is 
-	// in-place or out-of-place. e.g. Local memory fft (say 1D 1024 ... 
-	// one that does not require global transpose do not need temporary buffer)
-	// 2D 1024x1024 out-of-place fft however do require intermediate buffer.
-	// If temp buffer is needed, its allocation is lazy i.e. its not allocated
-	// until its needed
-	cl_int                  temp_buffer_needed;
-	
-	// Batch size is runtime parameter and size of temporary buffer (if needed)
-	// depends on batch size. Allocation of temporary buffer is lazy i.e. its
-	// only created when needed. Once its created at first call of clFFT_Executexxx
-	// it is not allocated next time if next time clFFT_Executexxx is called with 
-	// batch size different than the first call. last_batch_size caches the last
-	// batch size with which this plan is used so that we dont keep allocating/deallocating
-	// temp buffer if same batch size is used again and again.
-	unsigned                  last_batch_size;
-	
-	// temporary buffer for interleaved plan
-	cl_mem   				tempmemobj;
-	
-	// temporary buffer for planner plan. Only one of tempmemobj or 
-	// (tempmemobj_real, tempmemobj_imag) pair is valid (allocated) depending 
-	// data format of plan (plannar or interleaved)
-	cl_mem                  tempmemobj_real, tempmemobj_imag;
-	
-	// Maximum size of signal for which local memory transposed based
-	// fft is sufficient i.e. no global mem transpose (communication)
-	// is needed
-	unsigned					max_localmem_fft_size;
-	
-	// Maximum work items per work group allowed. This, along with max_radix below controls 
-	// maximum local memory being used by fft kernels of this plan. Set to 256 by default
-	unsigned                  max_work_item_per_workgroup;
-	
-	// Maximum base radix for local memory fft ... this controls the maximum register 
-	// space used by work items. Currently defaults to 16
-	unsigned                  max_radix;
-	
-	// Device depended parameter that tells how many work-items need to be read consecutive
-	// values to make sure global memory access by work-items of a work-group result in 
-	// coalesced memory access to utilize full bandwidth e.g. on NVidia tesla, this is 16
-	unsigned                  min_mem_coalesce_width;
-	
-	// Number of local memory banks. This is used to geneate kernel with local memory 
-	// transposes with appropriate padding to avoid bank conflicts to local memory
-	// e.g. on NVidia it is 16.
-	unsigned                  num_local_mem_banks;
+    
+    // size of signal
+    clFFT_Dim3              n;
+    
+    // dimension of transform ... must be either 1D, 2D or 3D
+    clFFT_Dimension            dim;
+    
+    // data format ... must be either interleaved or plannar
+    clFFT_DataFormat        format;
+    
+    // string containing kernel source. Generated at runtime based on
+    // n, dim, format and other parameters
+    string                  *kernel_string;
+    
+    // CL program containing source and kernel this particular 
+    // n, dim, data format
+    cl_program                program;
+    
+    // linked list of kernels which needs to be executed for this fft
+    cl_fft_kernel_info        *kernel_info;
+    
+    // number of kernels
+    int                     num_kernels;
+    
+    // twist kernel for virtualizing fft of very large sizes that do not
+    // fit in GPU global memory
+    cl_kernel                twist_kernel;
+    
+    // flag indicating if temporary intermediate buffer is needed or not.
+    // this depends on fft kernels being executed and if transform is 
+    // in-place or out-of-place. e.g. Local memory fft (say 1D 1024 ... 
+    // one that does not require global transpose do not need temporary buffer)
+    // 2D 1024x1024 out-of-place fft however do require intermediate buffer.
+    // If temp buffer is needed, its allocation is lazy i.e. its not allocated
+    // until its needed
+    cl_int                  temp_buffer_needed;
+    
+    // Batch size is runtime parameter and size of temporary buffer (if needed)
+    // depends on batch size. Allocation of temporary buffer is lazy i.e. its
+    // only created when needed. Once its created at first call of clFFT_Executexxx
+    // it is not allocated next time if next time clFFT_Executexxx is called with 
+    // batch size different than the first call. last_batch_size caches the last
+    // batch size with which this plan is used so that we dont keep allocating/deallocating
+    // temp buffer if same batch size is used again and again.
+    unsigned                  last_batch_size;
+    
+    // temporary buffer for interleaved plan
+    cl_mem                   tempmemobj;
+    
+    // temporary buffer for planner plan. Only one of tempmemobj or 
+    // (tempmemobj_real, tempmemobj_imag) pair is valid (allocated) depending 
+    // data format of plan (plannar or interleaved)
+    cl_mem                  tempmemobj_real, tempmemobj_imag;
+    
+    // Maximum size of signal for which local memory transposed based
+    // fft is sufficient i.e. no global mem transpose (communication)
+    // is needed
+    unsigned                    max_localmem_fft_size;
+    
+    // Maximum work items per work group allowed. This, along with max_radix below controls 
+    // maximum local memory being used by fft kernels of this plan. Set to 256 by default
+    unsigned                  max_work_item_per_workgroup;
+    
+    // Maximum base radix for local memory fft ... this controls the maximum register 
+    // space used by work items. Currently defaults to 16
+    unsigned                  max_radix;
+    
+    // Device depended parameter that tells how many work-items need to be read consecutive
+    // values to make sure global memory access by work-items of a work-group result in 
+    // coalesced memory access to utilize full bandwidth e.g. on NVidia tesla, this is 16
+    unsigned                  min_mem_coalesce_width;
+    
+    // Number of local memory banks. This is used to geneate kernel with local memory 
+    // transposes with appropriate padding to avoid bank conflicts to local memory
+    // e.g. on NVidia it is 16.
+    unsigned                  num_local_mem_banks;
 }cl_fft_plan;
 
 void FFT1D(cl_fft_plan *plan, cl_fft_kernel_dir dir);
--- a/fft_Example/fft_kernelstring.cc	Tue Feb 05 15:12:19 2013 +0900
+++ b/fft_Example/fft_kernelstring.cc	Tue Feb 05 15:19:02 2013 +0900
@@ -61,131 +61,132 @@
 #define max(A,B) ((A) > (B) ? (A) : (B))
 #define min(A,B) ((A) < (B) ? (A) : (B))
 
-static string 
+static string
 num2str(int num)
 {
-	char temp[200];
-	sprintf(temp, "%d", num);
-	return string(temp);
+    char temp[200];
+    sprintf(temp, "%d", num);
+    return string(temp);
 }
 
-// For any n, this function decomposes n into factors for loacal memory tranpose 
+// For any n, this function decomposes n into factors for loacal memory tranpose
 // based fft. Factors (radices) are sorted such that the first one (radixArray[0])
 // is the largest. This base radix determines the number of registers used by each
 // work item and product of remaining radices determine the size of work group needed.
 // To make things concrete with and example, suppose n = 1024. It is decomposed into
-// 1024 = 16 x 16 x 4. Hence kernel uses float2 a[16], for local in-register fft and 
+// 1024 = 16 x 16 x 4. Hence kernel uses float2 a[16], for local in-register fft and
 // needs 16 x 4 = 64 work items per work group. So kernel first performance 64 length
-// 16 ffts (64 work items working in parallel) following by transpose using local 
+// 16 ffts (64 work items working in parallel) following by transpose using local
 // memory followed by again 64 length 16 ffts followed by transpose using local memory
-// followed by 256 length 4 ffts. For the last step since with size of work group is 
+// followed by 256 length 4 ffts. For the last step since with size of work group is
 // 64 and each work item can array for 16 values, 64 work items can compute 256 length
-// 4 ffts by each work item computing 4 length 4 ffts. 
+// 4 ffts by each work item computing 4 length 4 ffts.
 // Similarly for n = 2048 = 8 x 8 x 8 x 4, each work group has 8 x 8 x 4 = 256 work
 // iterms which each computes 256 (in-parallel) length 8 ffts in-register, followed
 // by transpose using local memory, followed by 256 length 8 in-register ffts, followed
 // by transpose using local memory, followed by 256 length 8 in-register ffts, followed
 // by transpose using local memory, followed by 512 length 4 in-register ffts. Again,
 // for the last step, each work item computes two length 4 in-register ffts and thus
-// 256 work items are needed to compute all 512 ffts. 
-// For n = 32 = 8 x 4, 4 work items first compute 4 in-register 
+// 256 work items are needed to compute all 512 ffts.
+// For n = 32 = 8 x 4, 4 work items first compute 4 in-register
 // lenth 8 ffts, followed by transpose using local memory followed by 8 in-register
 // length 4 ffts, where each work item computes two length 4 ffts thus 4 work items
-// can compute 8 length 4 ffts. However if work group size of say 64 is choosen, 
-// each work group can compute 64/ 4 = 16 size 32 ffts (batched transform). 
+// can compute 8 length 4 ffts. However if work group size of say 64 is choosen,
+// each work group can compute 64/ 4 = 16 size 32 ffts (batched transform).
 // Users can play with these parameters to figure what gives best performance on
 // their particular device i.e. some device have less register space thus using
-// smaller base radix can avoid spilling ... some has small local memory thus 
+// smaller base radix can avoid spilling ... some has small local memory thus
 // using smaller work group size may be required etc
 
-static void 
+static void
 getRadixArray(unsigned int n, unsigned int *radixArray, unsigned int *numRadices, unsigned int maxRadix)
 {
     if(maxRadix > 1)
-    {
-        maxRadix = min(n, maxRadix);
-        unsigned int cnt = 0;
-        while(n > maxRadix)
+        {
+            maxRadix = min(n, maxRadix);
+            unsigned int cnt = 0;
+            while(n > maxRadix)
+                {
+                    radixArray[cnt++] = maxRadix;
+                    n /= maxRadix;
+                }
+            radixArray[cnt++] = n;
+            *numRadices = cnt;
+            return;
+        }
+
+    switch(n)
         {
-            radixArray[cnt++] = maxRadix;
-            n /= maxRadix;
-        }
-        radixArray[cnt++] = n;
-        *numRadices = cnt;
-        return;
-    }
+        case 2:
+            *numRadices = 1;
+            radixArray[0] = 2;
+            break;
+
+        case 4:
+            *numRadices = 1;
+            radixArray[0] = 4;
+            break;
+
+        case 8:
+            *numRadices = 1;
+            radixArray[0] = 8;
+            break;
+
+        case 16:
+            *numRadices = 2;
+            radixArray[0] = 8; radixArray[1] = 2;
+            break;
 
-	switch(n) 
-	{
-		case 2:
-			*numRadices = 1;
-			radixArray[0] = 2;
-			break;
-			
-		case 4:
-			*numRadices = 1;
-			radixArray[0] = 4;
-			break;
-			
-		case 8:
-			*numRadices = 1;
-			radixArray[0] = 8;
-			break;
-			
-		case 16:
-			*numRadices = 2;
-			radixArray[0] = 8; radixArray[1] = 2; 
-			break;
-			
-		case 32:
-			*numRadices = 2;
-			radixArray[0] = 8; radixArray[1] = 4;
-			break;
-			
-		case 64:
-			*numRadices = 2;
-			radixArray[0] = 8; radixArray[1] = 8;
-			break;
-			
-		case 128:
-			*numRadices = 3;
-			radixArray[0] = 8; radixArray[1] = 4; radixArray[2] = 4;
-			break;
-			
-		case 256:
-			*numRadices = 4;
-			radixArray[0] = 4; radixArray[1] = 4; radixArray[2] = 4; radixArray[3] = 4;
-			break;
-			
-		case 512:
-			*numRadices = 3;
-			radixArray[0] = 8; radixArray[1] = 8; radixArray[2] = 8;
-			break;			
-			
-		case 1024:
-			*numRadices = 3;
-			radixArray[0] = 16; radixArray[1] = 16; radixArray[2] = 4;
-			break;	
-		case 2048:
-			*numRadices = 4;
-			radixArray[0] = 8; radixArray[1] = 8; radixArray[2] = 8; radixArray[3] = 4;
-			break;
-		default:
-			*numRadices = 0;
-			return;
-	}
+        case 32:
+            *numRadices = 2;
+            radixArray[0] = 8; radixArray[1] = 4;
+            break;
+
+        case 64:
+            *numRadices = 2;
+            radixArray[0] = 8; radixArray[1] = 8;
+            break;
+
+        case 128:
+            *numRadices = 3;
+            radixArray[0] = 8; radixArray[1] = 4; radixArray[2] = 4;
+            break;
+
+        case 256:
+            *numRadices = 4;
+            radixArray[0] = 4; radixArray[1] = 4; radixArray[2] = 4; radixArray[3] = 4;
+            break;
+
+        case 512:
+            *numRadices = 3;
+            radixArray[0] = 8; radixArray[1] = 8; radixArray[2] = 8;
+            break;
+
+        case 1024:
+            *numRadices = 3;
+            radixArray[0] = 16; radixArray[1] = 16; radixArray[2] = 4;
+            break;
+        case 2048:
+            *numRadices = 4;
+            radixArray[0] = 8; radixArray[1] = 8; radixArray[2] = 8; radixArray[3] = 4;
+            break;
+        default:
+            *numRadices = 0;
+            return;
+        }
 }
 
 static void
 insertHeader(string &kernelString, string &kernelName, clFFT_DataFormat dataFormat)
 {
-	if(dataFormat == clFFT_SplitComplexFormat) 
-		kernelString += string("__kernel void ") + kernelName + string("(__global float *in_real, __global float *in_imag, __global float *out_real, __global float *out_imag, int dir, int S)\n");
-	else 
-		kernelString += string("__kernel void ") + kernelName + string("(__global float2 *in, __global float2 *out, int dir, int S)\n");
+    if(dataFormat == clFFT_SplitComplexFormat)
+        kernelString += string("__kernel void ") + kernelName + string("(__global float *in_real, __global float *in_imag, __global float *out_real, __global float *out_imag, int dir, int S)\n");
+    else
+        kernelString += string("__kernel void ") + kernelName + string("(__global float2 *in, __global float2 *out, int dir, int S)\n");
+    printf("%s\n",kernelName.c_str());
 }
 
-static void 
+static void
 insertVariables(string &kStream, int maxRadix)
 {
     kStream += string("    int i, j, r, indexIn, indexOut, index, tid, bNum, xNum, k, l;\n");
@@ -202,1056 +203,1059 @@
 static void
 formattedLoad(string &kernelString, int aIndex, int gIndex, clFFT_DataFormat dataFormat)
 {
-	if(dataFormat == clFFT_InterleavedComplexFormat)
-		kernelString += string("        a[") + num2str(aIndex) + string("] = in[") + num2str(gIndex) + string("];\n");
-	else
-	{
-		kernelString += string("        a[") + num2str(aIndex) + string("].x = in_real[") + num2str(gIndex) + string("];\n");
-		kernelString += string("        a[") + num2str(aIndex) + string("].y = in_imag[") + num2str(gIndex) + string("];\n");
-	}
+    if(dataFormat == clFFT_InterleavedComplexFormat)
+        kernelString += string("        a[") + num2str(aIndex) + string("] = in[") + num2str(gIndex) + string("];\n");
+    else
+        {
+            kernelString += string("        a[") + num2str(aIndex) + string("].x = in_real[") + num2str(gIndex) + string("];\n");
+            kernelString += string("        a[") + num2str(aIndex) + string("].y = in_imag[") + num2str(gIndex) + string("];\n");
+        }
 }
 
 static void
 formattedStore(string &kernelString, int aIndex, int gIndex, clFFT_DataFormat dataFormat)
 {
-	if(dataFormat == clFFT_InterleavedComplexFormat)
-		kernelString += string("        out[") + num2str(gIndex) + string("] = a[") + num2str(aIndex) + string("];\n");
-	else
-	{
-		kernelString += string("        out_real[") + num2str(gIndex) + string("] = a[") + num2str(aIndex) + string("].x;\n");
-		kernelString += string("        out_imag[") + num2str(gIndex) + string("] = a[") + num2str(aIndex) + string("].y;\n");
-	}
+    if(dataFormat == clFFT_InterleavedComplexFormat)
+        kernelString += string("        out[") + num2str(gIndex) + string("] = a[") + num2str(aIndex) + string("];\n");
+    else
+        {
+            kernelString += string("        out_real[") + num2str(gIndex) + string("] = a[") + num2str(aIndex) + string("].x;\n");
+            kernelString += string("        out_imag[") + num2str(gIndex) + string("] = a[") + num2str(aIndex) + string("].y;\n");
+        }
 }
 
 static int
 insertGlobalLoadsAndTranspose(string &kernelString, int N, int numWorkItemsPerXForm, int numXFormsPerWG, int R0, int mem_coalesce_width, clFFT_DataFormat dataFormat)
 {
-	int log2NumWorkItemsPerXForm = (int) log2(numWorkItemsPerXForm);
-	int groupSize = numWorkItemsPerXForm * numXFormsPerWG;
-	int i, j;
-	int lMemSize = 0;
-	
-	if(numXFormsPerWG > 1)
-	    kernelString += string("        s = S & ") + num2str(numXFormsPerWG - 1) + string(";\n");
-	
+    int log2NumWorkItemsPerXForm = (int) log2(numWorkItemsPerXForm);
+    int groupSize = numWorkItemsPerXForm * numXFormsPerWG;
+    int i, j;
+    int lMemSize = 0;
+
+    if(numXFormsPerWG > 1)
+        kernelString += string("        s = S & ") + num2str(numXFormsPerWG - 1) + string(";\n");
+
     if(numWorkItemsPerXForm >= mem_coalesce_width)
-    {   		
-		if(numXFormsPerWG > 1)
-		{
-            kernelString += string("    ii = lId & ") + num2str(numWorkItemsPerXForm-1) + string(";\n");
-            kernelString += string("    jj = lId >> ") + num2str(log2NumWorkItemsPerXForm) + string(";\n");
-            kernelString += string("    if( !s || (groupId < get_num_groups(0)-1) || (jj < s) ) {\n");
-			kernelString += string("        offset = mad24( mad24(groupId, ") + num2str(numXFormsPerWG) + string(", jj), ") + num2str(N) + string(", ii );\n");
-			if(dataFormat == clFFT_InterleavedComplexFormat)
-			{
-			    kernelString += string("        in += offset;\n");
-			    kernelString += string("        out += offset;\n");
-			}
-			else
-			{
-			    kernelString += string("        in_real += offset;\n");
-				kernelString += string("        in_imag += offset;\n");
-			    kernelString += string("        out_real += offset;\n");
-				kernelString += string("        out_imag += offset;\n");
-			}
-			for(i = 0; i < R0; i++)
-				formattedLoad(kernelString, i, i*numWorkItemsPerXForm, dataFormat);
-			kernelString += string("    }\n");
-		}
-		else
-		{
-			kernelString += string("    ii = lId;\n");
-			kernelString += string("    jj = 0;\n");
-			kernelString += string("    offset =  mad24(groupId, ") + num2str(N) + string(", ii);\n");
-			if(dataFormat == clFFT_InterleavedComplexFormat)
-			{
-			    kernelString += string("        in += offset;\n");
-			    kernelString += string("        out += offset;\n");
-			}
-			else
-			{
-			    kernelString += string("        in_real += offset;\n");
-				kernelString += string("        in_imag += offset;\n");
-			    kernelString += string("        out_real += offset;\n");
-				kernelString += string("        out_imag += offset;\n");
-			}
-			for(i = 0; i < R0; i++)
-				formattedLoad(kernelString, i, i*numWorkItemsPerXForm, dataFormat);
-		}
-    }
-    else if( N >= mem_coalesce_width )
-    {
-        int numInnerIter = N / mem_coalesce_width;
-        int numOuterIter = numXFormsPerWG / ( groupSize / mem_coalesce_width );
-		
-        kernelString += string("    ii = lId & ") + num2str(mem_coalesce_width - 1) + string(";\n");
-        kernelString += string("    jj = lId >> ") + num2str((int)log2(mem_coalesce_width)) + string(";\n");
-        kernelString += string("    lMemStore = sMem + mad24( jj, ") + num2str(N + numWorkItemsPerXForm) + string(", ii );\n");
-        kernelString += string("    offset = mad24( groupId, ") + num2str(numXFormsPerWG) + string(", jj);\n");
-        kernelString += string("    offset = mad24( offset, ") + num2str(N) + string(", ii );\n");
-		if(dataFormat == clFFT_InterleavedComplexFormat)
-		{
-			kernelString += string("        in += offset;\n");
-			kernelString += string("        out += offset;\n");
-		}
-		else
-		{
-			kernelString += string("        in_real += offset;\n");
-			kernelString += string("        in_imag += offset;\n");
-			kernelString += string("        out_real += offset;\n");
-			kernelString += string("        out_imag += offset;\n");
-		}
-        
-		kernelString += string("if((groupId == get_num_groups(0)-1) && s) {\n");
-        for(i = 0; i < numOuterIter; i++ )
         {
-            kernelString += string("    if( jj < s ) {\n");
-			for(j = 0; j < numInnerIter; j++ ) 
-				formattedLoad(kernelString, i * numInnerIter + j, j * mem_coalesce_width + i * ( groupSize / mem_coalesce_width ) * N, dataFormat);
-			kernelString += string("    }\n"); 
-			if(i != numOuterIter - 1)
-			    kernelString += string("    jj += ") + num2str(groupSize / mem_coalesce_width) + string(";\n");			 
+            if(numXFormsPerWG > 1)
+                {
+                    kernelString += string("    ii = lId & ") + num2str(numWorkItemsPerXForm-1) + string(";\n");
+                    kernelString += string("    jj = lId >> ") + num2str(log2NumWorkItemsPerXForm) + string(";\n");
+                    kernelString += string("    if( !s || (groupId < get_num_groups(0)-1) || (jj < s) ) {\n");
+                    kernelString += string("        offset = mad24( mad24(groupId, ") + num2str(numXFormsPerWG) + string(", jj), ") + num2str(N) + string(", ii );\n");
+                    if(dataFormat == clFFT_InterleavedComplexFormat)
+                        {
+                            kernelString += string("        in += offset;\n");
+                            kernelString += string("        out += offset;\n");
+                        }
+                    else
+                        {
+                            kernelString += string("        in_real += offset;\n");
+                            kernelString += string("        in_imag += offset;\n");
+                            kernelString += string("        out_real += offset;\n");
+                            kernelString += string("        out_imag += offset;\n");
+                        }
+                    for(i = 0; i < R0; i++)
+                        formattedLoad(kernelString, i, i*numWorkItemsPerXForm, dataFormat);
+                    kernelString += string("    }\n");
+                }
+            else
+                {
+                    kernelString += string("    ii = lId;\n");
+                    kernelString += string("    jj = 0;\n");
+                    kernelString += string("    offset =  mad24(groupId, ") + num2str(N) + string(", ii);\n");
+                    if(dataFormat == clFFT_InterleavedComplexFormat)
+                        {
+                            kernelString += string("        in += offset;\n");
+                            kernelString += string("        out += offset;\n");
+                        }
+                    else
+                        {
+                            kernelString += string("        in_real += offset;\n");
+                            kernelString += string("        in_imag += offset;\n");
+                            kernelString += string("        out_real += offset;\n");
+                            kernelString += string("        out_imag += offset;\n");
+                        }
+                    for(i = 0; i < R0; i++)
+                        formattedLoad(kernelString, i, i*numWorkItemsPerXForm, dataFormat);
+                }
         }
-		kernelString += string("}\n ");
-		kernelString += string("else {\n");
-        for(i = 0; i < numOuterIter; i++ )
+    else if( N >= mem_coalesce_width )
         {
-			for(j = 0; j < numInnerIter; j++ ) 
-				formattedLoad(kernelString, i * numInnerIter + j, j * mem_coalesce_width + i * ( groupSize / mem_coalesce_width ) * N, dataFormat);			
-        }		
-		kernelString += string("}\n");
-        
-		kernelString += string("    ii = lId & ") + num2str(numWorkItemsPerXForm - 1) + string(";\n");
-		kernelString += string("    jj = lId >> ") + num2str(log2NumWorkItemsPerXForm) + string(";\n");
-        kernelString += string("    lMemLoad  = sMem + mad24( jj, ") + num2str(N + numWorkItemsPerXForm) + string(", ii);\n");  
-		
-        for( i = 0; i < numOuterIter; i++ )
-		{
-			for( j = 0; j < numInnerIter; j++ )
-			{	
-				kernelString += string("    lMemStore[") + num2str(j * mem_coalesce_width + i * ( groupSize / mem_coalesce_width ) * (N + numWorkItemsPerXForm )) + string("] = a[") + 
-				                num2str(i * numInnerIter + j) + string("].x;\n");
-			}
-		}	
-        kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
-        
-        for( i = 0; i < R0; i++ )
-			kernelString += string("    a[") + num2str(i) + string("].x = lMemLoad[") + num2str(i * numWorkItemsPerXForm) + string("];\n");            
-		kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");  
+            int numInnerIter = N / mem_coalesce_width;
+            int numOuterIter = numXFormsPerWG / ( groupSize / mem_coalesce_width );
 
-	    for( i = 0; i < numOuterIter; i++ )
-		{
-			for( j = 0; j < numInnerIter; j++ )
-			{	
-				kernelString += string("    lMemStore[") + num2str(j * mem_coalesce_width + i * ( groupSize / mem_coalesce_width ) * (N + numWorkItemsPerXForm )) + string("] = a[") + 
-								num2str(i * numInnerIter + j) + string("].y;\n");
-			}
-	    }	
-		kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
-																						   
-		for( i = 0; i < R0; i++ )
-			kernelString += string("    a[") + num2str(i) + string("].y = lMemLoad[") + num2str(i * numWorkItemsPerXForm) + string("];\n");            
-		kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");  
-		
-		lMemSize = (N + numWorkItemsPerXForm) * numXFormsPerWG;
-    }  
-    else
-    {
-        kernelString += string("    offset = mad24( groupId,  ") + num2str(N * numXFormsPerWG) + string(", lId );\n");
-		if(dataFormat == clFFT_InterleavedComplexFormat)
-		{
-			kernelString += string("        in += offset;\n");
-			kernelString += string("        out += offset;\n");
-		}
-		else
-		{
-			kernelString += string("        in_real += offset;\n");
-			kernelString += string("        in_imag += offset;\n");
-			kernelString += string("        out_real += offset;\n");
-			kernelString += string("        out_imag += offset;\n");
-		}
-        
-        kernelString += string("    ii = lId & ") + num2str(N-1) + string(";\n");
-        kernelString += string("    jj = lId >> ") + num2str((int)log2(N)) + string(";\n");
-        kernelString += string("    lMemStore = sMem + mad24( jj, ") + num2str(N + numWorkItemsPerXForm) + string(", ii );\n");
-        
-		kernelString += string("if((groupId == get_num_groups(0)-1) && s) {\n");
-        for( i = 0; i < R0; i++ )
-        {
-            kernelString += string("    if(jj < s )\n");
-			formattedLoad(kernelString, i, i*groupSize, dataFormat);
-			if(i != R0 - 1)
-			    kernelString += string("    jj += ") + num2str(groupSize / N) + string(";\n");
-        }
-		kernelString += string("}\n");
-		kernelString += string("else {\n");
-        for( i = 0; i < R0; i++ )
-        {
-			formattedLoad(kernelString, i, i*groupSize, dataFormat);
-        }		
-		kernelString += string("}\n");
-        
-		if(numWorkItemsPerXForm > 1)
-		{
+            kernelString += string("    ii = lId & ") + num2str(mem_coalesce_width - 1) + string(";\n");
+            kernelString += string("    jj = lId >> ") + num2str((int)log2(mem_coalesce_width)) + string(";\n");
+            kernelString += string("    lMemStore = sMem + mad24( jj, ") + num2str(N + numWorkItemsPerXForm) + string(", ii );\n");
+            kernelString += string("    offset = mad24( groupId, ") + num2str(numXFormsPerWG) + string(", jj);\n");
+            kernelString += string("    offset = mad24( offset, ") + num2str(N) + string(", ii );\n");
+            if(dataFormat == clFFT_InterleavedComplexFormat)
+                {
+                    kernelString += string("        in += offset;\n");
+                    kernelString += string("        out += offset;\n");
+                }
+            else
+                {
+                    kernelString += string("        in_real += offset;\n");
+                    kernelString += string("        in_imag += offset;\n");
+                    kernelString += string("        out_real += offset;\n");
+                    kernelString += string("        out_imag += offset;\n");
+                }
+
+            kernelString += string("if((groupId == get_num_groups(0)-1) && s) {\n");
+            for(i = 0; i < numOuterIter; i++ )
+                {
+                    kernelString += string("    if( jj < s ) {\n");
+                    for(j = 0; j < numInnerIter; j++ )
+                        formattedLoad(kernelString, i * numInnerIter + j, j * mem_coalesce_width + i * ( groupSize / mem_coalesce_width ) * N, dataFormat);
+                    kernelString += string("    }\n");
+                    if(i != numOuterIter - 1)
+                        kernelString += string("    jj += ") + num2str(groupSize / mem_coalesce_width) + string(";\n");
+                }
+            kernelString += string("}\n ");
+            kernelString += string("else {\n");
+            for(i = 0; i < numOuterIter; i++ )
+                {
+                    for(j = 0; j < numInnerIter; j++ )
+                        formattedLoad(kernelString, i * numInnerIter + j, j * mem_coalesce_width + i * ( groupSize / mem_coalesce_width ) * N, dataFormat);
+                }
+            kernelString += string("}\n");
+
             kernelString += string("    ii = lId & ") + num2str(numWorkItemsPerXForm - 1) + string(";\n");
             kernelString += string("    jj = lId >> ") + num2str(log2NumWorkItemsPerXForm) + string(";\n");
-            kernelString += string("    lMemLoad = sMem + mad24( jj, ") + num2str(N + numWorkItemsPerXForm) + string(", ii );\n"); 
-		}
-		else 
-		{
-            kernelString += string("    ii = 0;\n");
-            kernelString += string("    jj = lId;\n");
-            kernelString += string("    lMemLoad = sMem + mul24( jj, ") + num2str(N + numWorkItemsPerXForm) + string(");\n"); 			
-		}
+            kernelString += string("    lMemLoad  = sMem + mad24( jj, ") + num2str(N + numWorkItemsPerXForm) + string(", ii);\n");
+
+            for( i = 0; i < numOuterIter; i++ )
+                {
+                    for( j = 0; j < numInnerIter; j++ )
+                        {
+                            kernelString += string("    lMemStore[") + num2str(j * mem_coalesce_width + i * ( groupSize / mem_coalesce_width ) * (N + numWorkItemsPerXForm )) + string("] = a[") +
+                                num2str(i * numInnerIter + j) + string("].x;\n");
+                        }
+                }
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            for( i = 0; i < R0; i++ )
+                kernelString += string("    a[") + num2str(i) + string("].x = lMemLoad[") + num2str(i * numWorkItemsPerXForm) + string("];\n");
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            for( i = 0; i < numOuterIter; i++ )
+                {
+                    for( j = 0; j < numInnerIter; j++ )
+                        {
+                            kernelString += string("    lMemStore[") + num2str(j * mem_coalesce_width + i * ( groupSize / mem_coalesce_width ) * (N + numWorkItemsPerXForm )) + string("] = a[") +
+                                num2str(i * numInnerIter + j) + string("].y;\n");
+                        }
+                }
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            for( i = 0; i < R0; i++ )
+                kernelString += string("    a[") + num2str(i) + string("].y = lMemLoad[") + num2str(i * numWorkItemsPerXForm) + string("];\n");
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            lMemSize = (N + numWorkItemsPerXForm) * numXFormsPerWG;
+        }
+    else
+        {
+            kernelString += string("    offset = mad24( groupId,  ") + num2str(N * numXFormsPerWG) + string(", lId );\n");
+            if(dataFormat == clFFT_InterleavedComplexFormat)
+                {
+                    kernelString += string("        in += offset;\n");
+                    kernelString += string("        out += offset;\n");
+                }
+            else
+                {
+                    kernelString += string("        in_real += offset;\n");
+                    kernelString += string("        in_imag += offset;\n");
+                    kernelString += string("        out_real += offset;\n");
+                    kernelString += string("        out_imag += offset;\n");
+                }
 
-		
-        for( i = 0; i < R0; i++ )
-            kernelString += string("    lMemStore[") + num2str(i * ( groupSize / N ) * ( N + numWorkItemsPerXForm )) + string("] = a[") + num2str(i) + string("].x;\n"); 
-        kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n"); 
-        
-        for( i = 0; i < R0; i++ )
-            kernelString += string("    a[") + num2str(i) + string("].x = lMemLoad[") + num2str(i * numWorkItemsPerXForm) + string("];\n");
-		kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
-        
-        for( i = 0; i < R0; i++ )
-            kernelString += string("    lMemStore[") + num2str(i * ( groupSize / N ) * ( N + numWorkItemsPerXForm )) + string("] = a[") + num2str(i) + string("].y;\n"); 
-        kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n"); 
-        
-        for( i = 0; i < R0; i++ )
-            kernelString += string("    a[") + num2str(i) + string("].y = lMemLoad[") + num2str(i * numWorkItemsPerXForm) + string("];\n");
-		kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
-		
-		lMemSize = (N + numWorkItemsPerXForm) * numXFormsPerWG;
-    }
-	
-	return lMemSize;
+            kernelString += string("    ii = lId & ") + num2str(N-1) + string(";\n");
+            kernelString += string("    jj = lId >> ") + num2str((int)log2(N)) + string(";\n");
+            kernelString += string("    lMemStore = sMem + mad24( jj, ") + num2str(N + numWorkItemsPerXForm) + string(", ii );\n");
+
+            kernelString += string("if((groupId == get_num_groups(0)-1) && s) {\n");
+            for( i = 0; i < R0; i++ )
+                {
+                    kernelString += string("    if(jj < s )\n");
+                    formattedLoad(kernelString, i, i*groupSize, dataFormat);
+                    if(i != R0 - 1)
+                        kernelString += string("    jj += ") + num2str(groupSize / N) + string(";\n");
+                }
+            kernelString += string("}\n");
+            kernelString += string("else {\n");
+            for( i = 0; i < R0; i++ )
+                {
+                    formattedLoad(kernelString, i, i*groupSize, dataFormat);
+                }
+            kernelString += string("}\n");
+
+            if(numWorkItemsPerXForm > 1)
+                {
+                    kernelString += string("    ii = lId & ") + num2str(numWorkItemsPerXForm - 1) + string(";\n");
+                    kernelString += string("    jj = lId >> ") + num2str(log2NumWorkItemsPerXForm) + string(";\n");
+                    kernelString += string("    lMemLoad = sMem + mad24( jj, ") + num2str(N + numWorkItemsPerXForm) + string(", ii );\n");
+                }
+            else
+                {
+                    kernelString += string("    ii = 0;\n");
+                    kernelString += string("    jj = lId;\n");
+                    kernelString += string("    lMemLoad = sMem + mul24( jj, ") + num2str(N + numWorkItemsPerXForm) + string(");\n");
+                }
+
+
+            for( i = 0; i < R0; i++ )
+                kernelString += string("    lMemStore[") + num2str(i * ( groupSize / N ) * ( N + numWorkItemsPerXForm )) + string("] = a[") + num2str(i) + string("].x;\n");
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            for( i = 0; i < R0; i++ )
+                kernelString += string("    a[") + num2str(i) + string("].x = lMemLoad[") + num2str(i * numWorkItemsPerXForm) + string("];\n");
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            for( i = 0; i < R0; i++ )
+                kernelString += string("    lMemStore[") + num2str(i * ( groupSize / N ) * ( N + numWorkItemsPerXForm )) + string("] = a[") + num2str(i) + string("].y;\n");
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            for( i = 0; i < R0; i++ )
+                kernelString += string("    a[") + num2str(i) + string("].y = lMemLoad[") + num2str(i * numWorkItemsPerXForm) + string("];\n");
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            lMemSize = (N + numWorkItemsPerXForm) * numXFormsPerWG;
+        }
+
+    return lMemSize;
 }
 
 static int
 insertGlobalStoresAndTranspose(string &kernelString, int N, int maxRadix, int Nr, int numWorkItemsPerXForm, int numXFormsPerWG, int mem_coalesce_width, clFFT_DataFormat dataFormat)
 {
-	int groupSize = numWorkItemsPerXForm * numXFormsPerWG;
-	int i, j, k, ind;
-	int lMemSize = 0;
-	int numIter = maxRadix / Nr;
-	string indent = string("");
-	
+    int groupSize = numWorkItemsPerXForm * numXFormsPerWG;
+    int i, j, k, ind;
+    int lMemSize = 0;
+    int numIter = maxRadix / Nr;
+    string indent = string("");
+
     if( numWorkItemsPerXForm >= mem_coalesce_width )
-    {   
-		if(numXFormsPerWG > 1)
-		{
-            kernelString += string("    if( !s || (groupId < get_num_groups(0)-1) || (jj < s) ) {\n");
-			indent = string("    ");
-		}	
-		for(i = 0; i < maxRadix; i++) 
-		{
-			j = i % numIter;
-			k = i / numIter;
-			ind = j * Nr + k;
-			formattedStore(kernelString, ind, i*numWorkItemsPerXForm, dataFormat);
-		}
-		if(numXFormsPerWG > 1)
-		    kernelString += string("    }\n");
-    }
+        {
+            if(numXFormsPerWG > 1)
+                {
+                    kernelString += string("    if( !s || (groupId < get_num_groups(0)-1) || (jj < s) ) {\n");
+                    indent = string("    ");
+                }
+            for(i = 0; i < maxRadix; i++)
+                {
+                    j = i % numIter;
+                    k = i / numIter;
+                    ind = j * Nr + k;
+                    formattedStore(kernelString, ind, i*numWorkItemsPerXForm, dataFormat);
+                }
+            if(numXFormsPerWG > 1)
+                kernelString += string("    }\n");
+        }
     else if( N >= mem_coalesce_width )
-    {
-        int numInnerIter = N / mem_coalesce_width;
-        int numOuterIter = numXFormsPerWG / ( groupSize / mem_coalesce_width );
-		
-        kernelString += string("    lMemLoad  = sMem + mad24( jj, ") + num2str(N + numWorkItemsPerXForm) + string(", ii );\n");  
-        kernelString += string("    ii = lId & ") + num2str(mem_coalesce_width - 1) + string(";\n");
-        kernelString += string("    jj = lId >> ") + num2str((int)log2(mem_coalesce_width)) + string(";\n");
-        kernelString += string("    lMemStore = sMem + mad24( jj,") + num2str(N + numWorkItemsPerXForm) + string(", ii );\n");
-		
-        for( i = 0; i < maxRadix; i++ )
-		{
-			j = i % numIter;
-			k = i / numIter;
-			ind = j * Nr + k;
-            kernelString += string("    lMemLoad[") + num2str(i*numWorkItemsPerXForm) + string("] = a[") + num2str(ind) + string("].x;\n");            
-		}	
-        kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");         
-		
-        for( i = 0; i < numOuterIter; i++ )
-			for( j = 0; j < numInnerIter; j++ )
-				kernelString += string("    a[") + num2str(i*numInnerIter + j) + string("].x = lMemStore[") + num2str(j*mem_coalesce_width + i*( groupSize / mem_coalesce_width )*(N + numWorkItemsPerXForm)) + string("];\n");
-        kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
-		
-        for( i = 0; i < maxRadix; i++ )
-		{
-			j = i % numIter;
-			k = i / numIter;
-			ind = j * Nr + k;
-            kernelString += string("    lMemLoad[") + num2str(i*numWorkItemsPerXForm) + string("] = a[") + num2str(ind) + string("].y;\n");            
-		}	
-        kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");         
-		
-        for( i = 0; i < numOuterIter; i++ )
-			for( j = 0; j < numInnerIter; j++ )
-				kernelString += string("    a[") + num2str(i*numInnerIter + j) + string("].y = lMemStore[") + num2str(j*mem_coalesce_width + i*( groupSize / mem_coalesce_width )*(N + numWorkItemsPerXForm)) + string("];\n");
-        kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n"); 
-		
-		kernelString += string("if((groupId == get_num_groups(0)-1) && s) {\n");
-		for(i = 0; i < numOuterIter; i++ )
         {
-            kernelString += string("    if( jj < s ) {\n");
-			for(j = 0; j < numInnerIter; j++ ) 
-				formattedStore(kernelString, i*numInnerIter + j, j*mem_coalesce_width + i*(groupSize/mem_coalesce_width)*N, dataFormat); 
-			kernelString += string("    }\n"); 
-			if(i != numOuterIter - 1)
-			    kernelString += string("    jj += ") + num2str(groupSize / mem_coalesce_width) + string(";\n");			 
+            int numInnerIter = N / mem_coalesce_width;
+            int numOuterIter = numXFormsPerWG / ( groupSize / mem_coalesce_width );
+
+            kernelString += string("    lMemLoad  = sMem + mad24( jj, ") + num2str(N + numWorkItemsPerXForm) + string(", ii );\n");
+            kernelString += string("    ii = lId & ") + num2str(mem_coalesce_width - 1) + string(";\n");
+            kernelString += string("    jj = lId >> ") + num2str((int)log2(mem_coalesce_width)) + string(";\n");
+            kernelString += string("    lMemStore = sMem + mad24( jj,") + num2str(N + numWorkItemsPerXForm) + string(", ii );\n");
+
+            for( i = 0; i < maxRadix; i++ )
+                {
+                    j = i % numIter;
+                    k = i / numIter;
+                    ind = j * Nr + k;
+                    kernelString += string("    lMemLoad[") + num2str(i*numWorkItemsPerXForm) + string("] = a[") + num2str(ind) + string("].x;\n");
+                }
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            for( i = 0; i < numOuterIter; i++ )
+                for( j = 0; j < numInnerIter; j++ )
+                    kernelString += string("    a[") + num2str(i*numInnerIter + j) + string("].x = lMemStore[") + num2str(j*mem_coalesce_width + i*( groupSize / mem_coalesce_width )*(N + numWorkItemsPerXForm)) + string("];\n");
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            for( i = 0; i < maxRadix; i++ )
+                {
+                    j = i % numIter;
+                    k = i / numIter;
+                    ind = j * Nr + k;
+                    kernelString += string("    lMemLoad[") + num2str(i*numWorkItemsPerXForm) + string("] = a[") + num2str(ind) + string("].y;\n");
+                }
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            for( i = 0; i < numOuterIter; i++ )
+                for( j = 0; j < numInnerIter; j++ )
+                    kernelString += string("    a[") + num2str(i*numInnerIter + j) + string("].y = lMemStore[") + num2str(j*mem_coalesce_width + i*( groupSize / mem_coalesce_width )*(N + numWorkItemsPerXForm)) + string("];\n");
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            kernelString += string("if((groupId == get_num_groups(0)-1) && s) {\n");
+            for(i = 0; i < numOuterIter; i++ )
+                {
+                    kernelString += string("    if( jj < s ) {\n");
+                    for(j = 0; j < numInnerIter; j++ )
+                        formattedStore(kernelString, i*numInnerIter + j, j*mem_coalesce_width + i*(groupSize/mem_coalesce_width)*N, dataFormat);
+                    kernelString += string("    }\n");
+                    if(i != numOuterIter - 1)
+                        kernelString += string("    jj += ") + num2str(groupSize / mem_coalesce_width) + string(";\n");
+                }
+            kernelString += string("}\n");
+            kernelString += string("else {\n");
+            for(i = 0; i < numOuterIter; i++ )
+                {
+                    for(j = 0; j < numInnerIter; j++ )
+                        formattedStore(kernelString, i*numInnerIter + j, j*mem_coalesce_width + i*(groupSize/mem_coalesce_width)*N, dataFormat);
+                }
+            kernelString += string("}\n");
+
+            lMemSize = (N + numWorkItemsPerXForm) * numXFormsPerWG;
         }
-		kernelString += string("}\n");
-		kernelString += string("else {\n");
-		for(i = 0; i < numOuterIter; i++ )
-        {
-			for(j = 0; j < numInnerIter; j++ ) 
-				formattedStore(kernelString, i*numInnerIter + j, j*mem_coalesce_width + i*(groupSize/mem_coalesce_width)*N, dataFormat); 
-        }		
-		kernelString += string("}\n");
-		
-		lMemSize = (N + numWorkItemsPerXForm) * numXFormsPerWG;
-	}   	
     else
-    {   
-        kernelString += string("    lMemLoad  = sMem + mad24( jj,") + num2str(N + numWorkItemsPerXForm) + string(", ii );\n");  
-        
-		kernelString += string("    ii = lId & ") + num2str(N - 1) + string(";\n");
-        kernelString += string("    jj = lId >> ") + num2str((int) log2(N)) + string(";\n");
-        kernelString += string("    lMemStore = sMem + mad24( jj,") + num2str(N + numWorkItemsPerXForm) + string(", ii );\n");
-        
-        for( i = 0; i < maxRadix; i++ )
-		{
-			j = i % numIter;
-			k = i / numIter;
-			ind = j * Nr + k;
-            kernelString += string("    lMemLoad[") + num2str(i*numWorkItemsPerXForm) + string("] = a[") + num2str(ind) + string("].x;\n");
-		}	
-        kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
-        
-        for( i = 0; i < maxRadix; i++ )
-            kernelString += string("    a[") + num2str(i) + string("].x = lMemStore[") + num2str(i*( groupSize / N )*( N + numWorkItemsPerXForm )) + string("];\n"); 
-        kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n"); 
-        
-        for( i = 0; i < maxRadix; i++ )
-		{
-			j = i % numIter;
-			k = i / numIter;
-			ind = j * Nr + k;
-            kernelString += string("    lMemLoad[") + num2str(i*numWorkItemsPerXForm) + string("] = a[") + num2str(ind) + string("].y;\n");
-		}	
-        kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
-        
-        for( i = 0; i < maxRadix; i++ )
-            kernelString += string("    a[") + num2str(i) + string("].y = lMemStore[") + num2str(i*( groupSize / N )*( N + numWorkItemsPerXForm )) + string("];\n"); 
-        kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n"); 
-        
-		kernelString += string("if((groupId == get_num_groups(0)-1) && s) {\n");
-		for( i = 0; i < maxRadix; i++ )
         {
-            kernelString += string("    if(jj < s ) {\n");
-			formattedStore(kernelString, i, i*groupSize, dataFormat);
-			kernelString += string("    }\n");
-			if( i != maxRadix - 1)
-				kernelString += string("    jj +=") + num2str(groupSize / N) + string(";\n");
-        } 
-		kernelString += string("}\n");
-		kernelString += string("else {\n");
-		for( i = 0; i < maxRadix; i++ )
-        {
-			formattedStore(kernelString, i, i*groupSize, dataFormat);
-        } 		
-		kernelString += string("}\n");
-		
-		lMemSize = (N + numWorkItemsPerXForm) * numXFormsPerWG;
-    }
-	
-	return lMemSize;
+            kernelString += string("    lMemLoad  = sMem + mad24( jj,") + num2str(N + numWorkItemsPerXForm) + string(", ii );\n");
+
+            kernelString += string("    ii = lId & ") + num2str(N - 1) + string(";\n");
+            kernelString += string("    jj = lId >> ") + num2str((int) log2(N)) + string(";\n");
+            kernelString += string("    lMemStore = sMem + mad24( jj,") + num2str(N + numWorkItemsPerXForm) + string(", ii );\n");
+
+            for( i = 0; i < maxRadix; i++ )
+                {
+                    j = i % numIter;
+                    k = i / numIter;
+                    ind = j * Nr + k;
+                    kernelString += string("    lMemLoad[") + num2str(i*numWorkItemsPerXForm) + string("] = a[") + num2str(ind) + string("].x;\n");
+                }
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            for( i = 0; i < maxRadix; i++ )
+                kernelString += string("    a[") + num2str(i) + string("].x = lMemStore[") + num2str(i*( groupSize / N )*( N + numWorkItemsPerXForm )) + string("];\n");
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            for( i = 0; i < maxRadix; i++ )
+                {
+                    j = i % numIter;
+                    k = i / numIter;
+                    ind = j * Nr + k;
+                    kernelString += string("    lMemLoad[") + num2str(i*numWorkItemsPerXForm) + string("] = a[") + num2str(ind) + string("].y;\n");
+                }
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            for( i = 0; i < maxRadix; i++ )
+                kernelString += string("    a[") + num2str(i) + string("].y = lMemStore[") + num2str(i*( groupSize / N )*( N + numWorkItemsPerXForm )) + string("];\n");
+            kernelString += string("    barrier( CLK_LOCAL_MEM_FENCE );\n");
+
+            kernelString += string("if((groupId == get_num_groups(0)-1) && s) {\n");
+            for( i = 0; i < maxRadix; i++ )
+                {
+                    kernelString += string("    if(jj < s ) {\n");
+                    formattedStore(kernelString, i, i*groupSize, dataFormat);
+                    kernelString += string("    }\n");
+                    if( i != maxRadix - 1)
+                        kernelString += string("    jj +=") + num2str(groupSize / N) + string(";\n");
+                }
+            kernelString += string("}\n");
+            kernelString += string("else {\n");
+            for( i = 0; i < maxRadix; i++ )
+                {
+                    formattedStore(kernelString, i, i*groupSize, dataFormat);
+                }
+            kernelString += string("}\n");
+
+            lMemSize = (N + numWorkItemsPerXForm) * numXFormsPerWG;
+        }
+
+    return lMemSize;
 }
 
-static void 
+static void
 insertfftKernel(string &kernelString, int Nr, int numIter)
 {
-	int i;
-	for(i = 0; i < numIter; i++) 
-	{
-		kernelString += string("    fftKernel") + num2str(Nr) + string("(a+") + num2str(i*Nr) + string(", dir);\n");
-	}
+    int i;
+    for(i = 0; i < numIter; i++)
+        {
+            kernelString += string("    fftKernel") + num2str(Nr) + string("(a+") + num2str(i*Nr) + string(", dir);\n");
+        }
 }
 
 static void
 insertTwiddleKernel(string &kernelString, int Nr, int numIter, int Nprev, int len, int numWorkItemsPerXForm)
 {
-	int z, k;
-	int logNPrev = (int)log2(Nprev);
-	
-	for(z = 0; z < numIter; z++) 
-	{
-		if(z == 0)
-		{
-			if(Nprev > 1)
-			    kernelString += string("    angf = (float) (ii >> ") + num2str(logNPrev) + string(");\n");
-			else
-				kernelString += string("    angf = (float) ii;\n");
-		}	
-		else
-		{
-			if(Nprev > 1)
-			    kernelString += string("    angf = (float) ((") + num2str(z*numWorkItemsPerXForm) + string(" + ii) >>") + num2str(logNPrev) + string(");\n"); 
-			else
-				kernelString += string("    angf = (float) (") + num2str(z*numWorkItemsPerXForm) + string(" + ii);\n");
-		}	
-	
-		for(k = 1; k < Nr; k++) {
-			int ind = z*Nr + k;
-			//float fac =  (float) (2.0 * M_PI * (double) k / (double) len);
-			kernelString += string("    ang = dir * ( 2.0f * M_PI * ") + num2str(k) + string(".0f / ") + num2str(len) + string(".0f )") + string(" * angf;\n");
-			kernelString += string("    w = (float2)(native_cos(ang), native_sin(ang));\n");
-			kernelString += string("    a[") + num2str(ind) + string("] = complexMul(a[") + num2str(ind) + string("], w);\n");
-		}
-	}
+    int z, k;
+    int logNPrev = (int)log2(Nprev);
+
+    for(z = 0; z < numIter; z++)
+        {
+            if(z == 0)
+                {
+                    if(Nprev > 1)
+                        kernelString += string("    angf = (float) (ii >> ") + num2str(logNPrev) + string(");\n");
+                    else
+                        kernelString += string("    angf = (float) ii;\n");
+                }
+            else
+                {
+                    if(Nprev > 1)
+                        kernelString += string("    angf = (float) ((") + num2str(z*numWorkItemsPerXForm) + string(" + ii) >>") + num2str(logNPrev) + string(");\n");
+                    else
+                        kernelString += string("    angf = (float) (") + num2str(z*numWorkItemsPerXForm) + string(" + ii);\n");
+                }
+
+            for(k = 1; k < Nr; k++) {
+                int ind = z*Nr + k;
+                //float fac =  (float) (2.0 * M_PI * (double) k / (double) len);
+                kernelString += string("    ang = dir * ( 2.0f * M_PI * ") + num2str(k) + string(".0f / ") + num2str(len) + string(".0f )") + string(" * angf;\n");
+                kernelString += string("    w = (float2)(native_cos(ang), native_sin(ang));\n");
+                kernelString += string("    a[") + num2str(ind) + string("] = complexMul(a[") + num2str(ind) + string("], w);\n");
+            }
+        }
 }
 
 static int
 getPadding(int numWorkItemsPerXForm, int Nprev, int numWorkItemsReq, int numXFormsPerWG, int Nr, int numBanks, int *offset, int *midPad)
 {
-	if((numWorkItemsPerXForm <= Nprev) || (Nprev >= numBanks))
-		*offset = 0;
-	else {
-		int numRowsReq = ((numWorkItemsPerXForm < numBanks) ? numWorkItemsPerXForm : numBanks) / Nprev;
-		int numColsReq = 1;
-		if(numRowsReq > Nr)
-			numColsReq = numRowsReq / Nr;
-		numColsReq = Nprev * numColsReq;
-		*offset = numColsReq;
-	}
-	
-	if(numWorkItemsPerXForm >= numBanks || numXFormsPerWG == 1)
-		*midPad = 0;
-	else {
-		int bankNum = ( (numWorkItemsReq + *offset) * Nr ) & (numBanks - 1);
-		if( bankNum >= numWorkItemsPerXForm )
-			*midPad = 0;
-		else
-			*midPad = numWorkItemsPerXForm - bankNum;
-	}
-	
-	int lMemSize = ( numWorkItemsReq + *offset) * Nr * numXFormsPerWG + *midPad * (numXFormsPerWG - 1);
-	return lMemSize;
+    if((numWorkItemsPerXForm <= Nprev) || (Nprev >= numBanks))
+        *offset = 0;
+    else {
+        int numRowsReq = ((numWorkItemsPerXForm < numBanks) ? numWorkItemsPerXForm : numBanks) / Nprev;
+        int numColsReq = 1;
+        if(numRowsReq > Nr)
+            numColsReq = numRowsReq / Nr;
+        numColsReq = Nprev * numColsReq;
+        *offset = numColsReq;
+    }
+
+    if(numWorkItemsPerXForm >= numBanks || numXFormsPerWG == 1)
+        *midPad = 0;
+    else {
+        int bankNum = ( (numWorkItemsReq + *offset) * Nr ) & (numBanks - 1);
+        if( bankNum >= numWorkItemsPerXForm )
+            *midPad = 0;
+        else
+            *midPad = numWorkItemsPerXForm - bankNum;
+    }
+
+    int lMemSize = ( numWorkItemsReq + *offset) * Nr * numXFormsPerWG + *midPad * (numXFormsPerWG - 1);
+    return lMemSize;
 }
 
 
-static void 
+static void
 insertLocalStores(string &kernelString, int numIter, int Nr, int numWorkItemsPerXForm, int numWorkItemsReq, int offset, string &comp)
 {
-	int z, k;
+    int z, k;
 
-	for(z = 0; z < numIter; z++) {
-		for(k = 0; k < Nr; k++) {
-			int index = k*(numWorkItemsReq + offset) + z*numWorkItemsPerXForm;
-			kernelString += string("    lMemStore[") + num2str(index) + string("] = a[") + num2str(z*Nr + k) + string("].") + comp + string(";\n");
-		}
-	}
-	kernelString += string("    barrier(CLK_LOCAL_MEM_FENCE);\n");
+    for(z = 0; z < numIter; z++) {
+        for(k = 0; k < Nr; k++) {
+            int index = k*(numWorkItemsReq + offset) + z*numWorkItemsPerXForm;
+            kernelString += string("    lMemStore[") + num2str(index) + string("] = a[") + num2str(z*Nr + k) + string("].") + comp + string(";\n");
+        }
+    }
+    kernelString += string("    barrier(CLK_LOCAL_MEM_FENCE);\n");
 }
 
-static void 
+static void
 insertLocalLoads(string &kernelString, int n, int Nr, int Nrn, int Nprev, int Ncurr, int numWorkItemsPerXForm, int numWorkItemsReq, int offset, string &comp)
 {
-	int numWorkItemsReqN = n / Nrn;										
-	int interBlockHNum = max( Nprev / numWorkItemsPerXForm, 1 );			
-	int interBlockHStride = numWorkItemsPerXForm;							
-	int vertWidth = max(numWorkItemsPerXForm / Nprev, 1);					
-	vertWidth = min( vertWidth, Nr);									
-	int vertNum = Nr / vertWidth;										
-	int vertStride = ( n / Nr + offset ) * vertWidth;					
-	int iter = max( numWorkItemsReqN / numWorkItemsPerXForm, 1);
-	int intraBlockHStride = (numWorkItemsPerXForm / (Nprev*Nr)) > 1 ? (numWorkItemsPerXForm / (Nprev*Nr)) : 1;
-	intraBlockHStride *= Nprev;
-	
-	int stride = numWorkItemsReq / Nrn;									
-	int i;
-	for(i = 0; i < iter; i++) {
-		int ii = i / (interBlockHNum * vertNum);
-		int zz = i % (interBlockHNum * vertNum);
-		int jj = zz % interBlockHNum;
-		int kk = zz / interBlockHNum;
-		int z;
-		for(z = 0; z < Nrn; z++) {
-			int st = kk * vertStride + jj * interBlockHStride + ii * intraBlockHStride + z * stride;
-			kernelString += string("    a[") + num2str(i*Nrn + z) + string("].") + comp + string(" = lMemLoad[") + num2str(st) + string("];\n");
-		}
-	}
-	kernelString += string("    barrier(CLK_LOCAL_MEM_FENCE);\n");
+    int numWorkItemsReqN = n / Nrn;
+    int interBlockHNum = max( Nprev / numWorkItemsPerXForm, 1 );
+    int interBlockHStride = numWorkItemsPerXForm;
+    int vertWidth = max(numWorkItemsPerXForm / Nprev, 1);
+    vertWidth = min( vertWidth, Nr);
+    int vertNum = Nr / vertWidth;
+    int vertStride = ( n / Nr + offset ) * vertWidth;
+    int iter = max( numWorkItemsReqN / numWorkItemsPerXForm, 1);
+    int intraBlockHStride = (numWorkItemsPerXForm / (Nprev*Nr)) > 1 ? (numWorkItemsPerXForm / (Nprev*Nr)) : 1;
+    intraBlockHStride *= Nprev;
+
+    int stride = numWorkItemsReq / Nrn;
+    int i;
+    for(i = 0; i < iter; i++) {
+        int ii = i / (interBlockHNum * vertNum);
+        int zz = i % (interBlockHNum * vertNum);
+        int jj = zz % interBlockHNum;
+        int kk = zz / interBlockHNum;
+        int z;
+        for(z = 0; z < Nrn; z++) {
+            int st = kk * vertStride + jj * interBlockHStride + ii * intraBlockHStride + z * stride;
+            kernelString += string("    a[") + num2str(i*Nrn + z) + string("].") + comp + string(" = lMemLoad[") + num2str(st) + string("];\n");
+        }
+    }
+    kernelString += string("    barrier(CLK_LOCAL_MEM_FENCE);\n");
 }
 
 static void
 insertLocalLoadIndexArithmatic(string &kernelString, int Nprev, int Nr, int numWorkItemsReq, int numWorkItemsPerXForm, int numXFormsPerWG, int offset, int midPad)
-{	
-	int Ncurr = Nprev * Nr;
-	int logNcurr = (int)log2(Ncurr);
-	int logNprev = (int)log2(Nprev);
-	int incr = (numWorkItemsReq + offset) * Nr + midPad;
-	
-	if(Ncurr < numWorkItemsPerXForm) 
-	{
-		if(Nprev == 1)
-		    kernelString += string("    j = ii & ") + num2str(Ncurr - 1) + string(";\n");
-		else
-			kernelString += string("    j = (ii & ") + num2str(Ncurr - 1) + string(") >> ") + num2str(logNprev) + string(";\n");
-		
-		if(Nprev == 1) 
-			kernelString += string("    i = ii >> ") + num2str(logNcurr) + string(";\n");
-		else 
-			kernelString += string("    i = mad24(ii >> ") + num2str(logNcurr) + string(", ") + num2str(Nprev) + string(", ii & ") + num2str(Nprev - 1) + string(");\n"); 
-	}	
-	else 
-	{
-		if(Nprev == 1)
-		    kernelString += string("    j = ii;\n");
-		else
-			kernelString += string("    j = ii >> ") + num2str(logNprev) + string(";\n");
-		if(Nprev == 1) 
-			kernelString += string("    i = 0;\n"); 
-		else 
-			kernelString += string("    i = ii & ") + num2str(Nprev - 1) + string(";\n");
-	}
+{
+    int Ncurr = Nprev * Nr;
+    int logNcurr = (int)log2(Ncurr);
+    int logNprev = (int)log2(Nprev);
+    int incr = (numWorkItemsReq + offset) * Nr + midPad;
+
+    if(Ncurr < numWorkItemsPerXForm)
+        {
+            if(Nprev == 1)
+                kernelString += string("    j = ii & ") + num2str(Ncurr - 1) + string(";\n");
+            else
+                kernelString += string("    j = (ii & ") + num2str(Ncurr - 1) + string(") >> ") + num2str(logNprev) + string(";\n");
+
+            if(Nprev == 1)
+                kernelString += string("    i = ii >> ") + num2str(logNcurr) + string(";\n");
+            else
+                kernelString += string("    i = mad24(ii >> ") + num2str(logNcurr) + string(", ") + num2str(Nprev) + string(", ii & ") + num2str(Nprev - 1) + string(");\n");
+        }
+    else
+        {
+            if(Nprev == 1)
+                kernelString += string("    j = ii;\n");
+            else
+                kernelString += string("    j = ii >> ") + num2str(logNprev) + string(";\n");
+            if(Nprev == 1)
+                kernelString += string("    i = 0;\n");
+            else
+                kernelString += string("    i = ii & ") + num2str(Nprev - 1) + string(";\n");
+        }
 
     if(numXFormsPerWG > 1)
-        kernelString += string("    i = mad24(jj, ") + num2str(incr) + string(", i);\n");		
+        kernelString += string("    i = mad24(jj, ") + num2str(incr) + string(", i);\n");
 
-    kernelString += string("    lMemLoad = sMem + mad24(j, ") + num2str(numWorkItemsReq + offset) + string(", i);\n"); 
+    kernelString += string("    lMemLoad = sMem + mad24(j, ") + num2str(numWorkItemsReq + offset) + string(", i);\n");
 }
 
 static void
 insertLocalStoreIndexArithmatic(string &kernelString, int numWorkItemsReq, int numXFormsPerWG, int Nr, int offset, int midPad)
 {
-	if(numXFormsPerWG == 1) {
-		kernelString += string("    lMemStore = sMem + ii;\n");		
-	}
-	else {
-		kernelString += string("    lMemStore = sMem + mad24(jj, ") + num2str((numWorkItemsReq + offset)*Nr + midPad) + string(", ii);\n");	
-	}
+    if(numXFormsPerWG == 1) {
+        kernelString += string("    lMemStore = sMem + ii;\n");
+    }
+    else {
+        kernelString += string("    lMemStore = sMem + mad24(jj, ") + num2str((numWorkItemsReq + offset)*Nr + midPad) + string(", ii);\n");
+    }
 }
 
 
 static void
 createLocalMemfftKernelString(cl_fft_plan *plan)
 {
-	unsigned int radixArray[10];
-	unsigned int numRadix;
-	 
-	unsigned int n = plan->n.x;
-	
-	assert(n <= plan->max_work_item_per_workgroup * plan->max_radix && "signal lenght too big for local mem fft\n");
-	
-	getRadixArray(n, radixArray, &numRadix, 0);
-	assert(numRadix > 0 && "no radix array supplied\n");
-	
-	if(n/radixArray[0] > plan->max_work_item_per_workgroup)
-	    getRadixArray(n, radixArray, &numRadix, plan->max_radix);
+    unsigned int radixArray[10];
+    unsigned int numRadix;
+
+    unsigned int n = plan->n.x;
+
+    assert(n <= plan->max_work_item_per_workgroup * plan->max_radix && "signal lenght too big for local mem fft\n");
+
+    getRadixArray(n, radixArray, &numRadix, 0);
+    assert(numRadix > 0 && "no radix array supplied\n");
+
+    if(n/radixArray[0] > plan->max_work_item_per_workgroup)
+        getRadixArray(n, radixArray, &numRadix, plan->max_radix);
+
+    assert(radixArray[0] <= plan->max_radix && "max radix choosen is greater than allowed\n");
+    assert(n/radixArray[0] <= plan->max_work_item_per_workgroup && "required work items per xform greater than maximum work items allowed per work group for local mem fft\n");
+
+    unsigned int tmpLen = 1;
+    unsigned int i;
+    for(i = 0; i < numRadix; i++)
+        {
+            assert( radixArray[i] && !( (radixArray[i] - 1) & radixArray[i] ) );
+            tmpLen *= radixArray[i];
+        }
+    assert(tmpLen == n && "product of radices choosen doesnt match the length of signal\n");
+
+    int offset, midPad;
+    string localString(""), kernelName("");
+
+    clFFT_DataFormat dataFormat = plan->format;
+    string *kernelString = plan->kernel_string;
+
+
+    cl_fft_kernel_info **kInfo = &plan->kernel_info;
+    int kCount = 0;
+
+    while(*kInfo)
+        {
+            kInfo = &(*kInfo)->next;
+            kCount++;
+        }
+
+    kernelName = string("fft") + num2str(kCount);
+
+    *kInfo = (cl_fft_kernel_info *) malloc(sizeof(cl_fft_kernel_info));
+    (*kInfo)->kernel = 0;
+    (*kInfo)->lmem_size = 0;
+    (*kInfo)->num_workgroups = 0;
+    (*kInfo)->num_workitems_per_workgroup = 0;
+    (*kInfo)->dir = cl_fft_kernel_x;
+    (*kInfo)->in_place_possible = 1;
+    (*kInfo)->next = NULL;
+    (*kInfo)->kernel_name = (char *) malloc(sizeof(char)*(kernelName.size()+1));
+    strcpy((*kInfo)->kernel_name, kernelName.c_str());
 
-	assert(radixArray[0] <= plan->max_radix && "max radix choosen is greater than allowed\n");
-	assert(n/radixArray[0] <= plan->max_work_item_per_workgroup && "required work items per xform greater than maximum work items allowed per work group for local mem fft\n");
-	
-	unsigned int tmpLen = 1;
-	unsigned int i;
-	for(i = 0; i < numRadix; i++)
-	{	
-		assert( radixArray[i] && !( (radixArray[i] - 1) & radixArray[i] ) );
-	    tmpLen *= radixArray[i];
-	}
-	assert(tmpLen == n && "product of radices choosen doesnt match the length of signal\n");
-	
-	int offset, midPad;
-	string localString(""), kernelName("");
-	
-	clFFT_DataFormat dataFormat = plan->format;
-	string *kernelString = plan->kernel_string;
-	
-	
-	cl_fft_kernel_info **kInfo = &plan->kernel_info;
-	int kCount = 0;
-	
-	while(*kInfo)
-	{
-		kInfo = &(*kInfo)->next;
-		kCount++;
-	}
-	
-	kernelName = string("fft") + num2str(kCount);
-	
-	*kInfo = (cl_fft_kernel_info *) malloc(sizeof(cl_fft_kernel_info));
-	(*kInfo)->kernel = 0;
-	(*kInfo)->lmem_size = 0;
-	(*kInfo)->num_workgroups = 0;
-	(*kInfo)->num_workitems_per_workgroup = 0;
-	(*kInfo)->dir = cl_fft_kernel_x;
-	(*kInfo)->in_place_possible = 1;
-	(*kInfo)->next = NULL;
-	(*kInfo)->kernel_name = (char *) malloc(sizeof(char)*(kernelName.size()+1));
-	strcpy((*kInfo)->kernel_name, kernelName.c_str());
-	
-	unsigned int numWorkItemsPerXForm = n / radixArray[0];
-	unsigned int numWorkItemsPerWG = numWorkItemsPerXForm <= 64 ? 64 : numWorkItemsPerXForm; 
-	assert(numWorkItemsPerWG <= plan->max_work_item_per_workgroup);
-	int numXFormsPerWG = numWorkItemsPerWG / numWorkItemsPerXForm;
-	(*kInfo)->num_workgroups = 1;
+    unsigned int numWorkItemsPerXForm = n / radixArray[0];
+    unsigned int numWorkItemsPerWG = numWorkItemsPerXForm <= 64 ? 64 : numWorkItemsPerXForm;
+    assert(numWorkItemsPerWG <= plan->max_work_item_per_workgroup);
+    int numXFormsPerWG = numWorkItemsPerWG / numWorkItemsPerXForm;
+    (*kInfo)->num_workgroups = 1;
     (*kInfo)->num_xforms_per_workgroup = numXFormsPerWG;
-	(*kInfo)->num_workitems_per_workgroup = numWorkItemsPerWG;
-	
-	unsigned int *N = radixArray;
-	unsigned int maxRadix = N[0];
-	unsigned int lMemSize = 0;
-		
-	insertVariables(localString, maxRadix);
-	
-	lMemSize = insertGlobalLoadsAndTranspose(localString, n, numWorkItemsPerXForm, numXFormsPerWG, maxRadix, plan->min_mem_coalesce_width, dataFormat);
-	(*kInfo)->lmem_size = (lMemSize > (*kInfo)->lmem_size) ? lMemSize : (*kInfo)->lmem_size;
-	
-	string xcomp = string("x");
-	string ycomp = string("y");
-	
-	unsigned int Nprev = 1;
-	unsigned int len = n;
-	unsigned int r;
-	for(r = 0; r < numRadix; r++) 
-	{
-		int numIter = N[0] / N[r];
-		int numWorkItemsReq = n / N[r];
-		int Ncurr = Nprev * N[r];
-		insertfftKernel(localString, N[r], numIter);
-		
-		if(r < (numRadix - 1)) {
-			insertTwiddleKernel(localString, N[r], numIter, Nprev, len, numWorkItemsPerXForm);
-			lMemSize = getPadding(numWorkItemsPerXForm, Nprev, numWorkItemsReq, numXFormsPerWG, N[r], plan->num_local_mem_banks, &offset, &midPad);
-			(*kInfo)->lmem_size = (lMemSize > (*kInfo)->lmem_size) ? lMemSize : (*kInfo)->lmem_size;
-			insertLocalStoreIndexArithmatic(localString, numWorkItemsReq, numXFormsPerWG, N[r], offset, midPad);
-			insertLocalLoadIndexArithmatic(localString, Nprev, N[r], numWorkItemsReq, numWorkItemsPerXForm, numXFormsPerWG, offset, midPad);
-			insertLocalStores(localString, numIter, N[r], numWorkItemsPerXForm, numWorkItemsReq, offset, xcomp);
-			insertLocalLoads(localString, n, N[r], N[r+1], Nprev, Ncurr, numWorkItemsPerXForm, numWorkItemsReq, offset, xcomp);
-			insertLocalStores(localString, numIter, N[r], numWorkItemsPerXForm, numWorkItemsReq, offset, ycomp);
-			insertLocalLoads(localString, n, N[r], N[r+1], Nprev, Ncurr, numWorkItemsPerXForm, numWorkItemsReq, offset, ycomp);
-			Nprev = Ncurr;
-			len = len / N[r];
-		}
-	}
-	
-	lMemSize = insertGlobalStoresAndTranspose(localString, n, maxRadix, N[numRadix - 1], numWorkItemsPerXForm, numXFormsPerWG, plan->min_mem_coalesce_width, dataFormat);
-	(*kInfo)->lmem_size = (lMemSize > (*kInfo)->lmem_size) ? lMemSize : (*kInfo)->lmem_size;
-	
-	insertHeader(*kernelString, kernelName, dataFormat);
-	*kernelString += string("{\n");
-	if((*kInfo)->lmem_size)
+    (*kInfo)->num_workitems_per_workgroup = numWorkItemsPerWG;
+
+    unsigned int *N = radixArray;
+    unsigned int maxRadix = N[0];
+    unsigned int lMemSize = 0;
+
+    insertVariables(localString, maxRadix);
+
+    lMemSize = insertGlobalLoadsAndTranspose(localString, n, numWorkItemsPerXForm, numXFormsPerWG, maxRadix, plan->min_mem_coalesce_width, dataFormat);
+    (*kInfo)->lmem_size = (lMemSize > (*kInfo)->lmem_size) ? lMemSize : (*kInfo)->lmem_size;
+
+    string xcomp = string("x");
+    string ycomp = string("y");
+
+    unsigned int Nprev = 1;
+    unsigned int len = n;
+    unsigned int r;
+    for(r = 0; r < numRadix; r++)
+        {
+            int numIter = N[0] / N[r];
+            int numWorkItemsReq = n / N[r];
+            int Ncurr = Nprev * N[r];
+            insertfftKernel(localString, N[r], numIter);
+
+            if(r < (numRadix - 1)) {
+                insertTwiddleKernel(localString, N[r], numIter, Nprev, len, numWorkItemsPerXForm);
+                lMemSize = getPadding(numWorkItemsPerXForm, Nprev, numWorkItemsReq, numXFormsPerWG, N[r], plan->num_local_mem_banks, &offset, &midPad);
+                (*kInfo)->lmem_size = (lMemSize > (*kInfo)->lmem_size) ? lMemSize : (*kInfo)->lmem_size;
+                insertLocalStoreIndexArithmatic(localString, numWorkItemsReq, numXFormsPerWG, N[r], offset, midPad);
+                insertLocalLoadIndexArithmatic(localString, Nprev, N[r], numWorkItemsReq, numWorkItemsPerXForm, numXFormsPerWG, offset, midPad);
+                insertLocalStores(localString, numIter, N[r], numWorkItemsPerXForm, numWorkItemsReq, offset, xcomp);
+                insertLocalLoads(localString, n, N[r], N[r+1], Nprev, Ncurr, numWorkItemsPerXForm, numWorkItemsReq, offset, xcomp);
+                insertLocalStores(localString, numIter, N[r], numWorkItemsPerXForm, numWorkItemsReq, offset, ycomp);
+                insertLocalLoads(localString, n, N[r], N[r+1], Nprev, Ncurr, numWorkItemsPerXForm, numWorkItemsReq, offset, ycomp);
+                Nprev = Ncurr;
+                len = len / N[r];
+            }
+        }
+
+    lMemSize = insertGlobalStoresAndTranspose(localString, n, maxRadix, N[numRadix - 1], numWorkItemsPerXForm, numXFormsPerWG, plan->min_mem_coalesce_width, dataFormat);
+    (*kInfo)->lmem_size = (lMemSize > (*kInfo)->lmem_size) ? lMemSize : (*kInfo)->lmem_size;
+
+    insertHeader(*kernelString, kernelName, dataFormat);
+    *kernelString += string("{\n");
+    if((*kInfo)->lmem_size)
         *kernelString += string("    __local float sMem[") + num2str((*kInfo)->lmem_size) + string("];\n");
-	*kernelString += localString;
-	*kernelString += string("}\n");
+    *kernelString += localString;
+    *kernelString += string("}\n");
 }
 
 // For n larger than what can be computed using local memory fft, global transposes
 // multiple kernel launces is needed. For these sizes, n can be decomposed using
 // much larger base radices i.e. say n = 262144 = 128 x 64 x 32. Thus three kernel
 // launches will be needed, first computing 64 x 32, length 128 ffts, second computing
-// 128 x 32 length 64 ffts, and finally a kernel computing 128 x 64 length 32 ffts. 
-// Each of these base radices can futher be divided into factors so that each of these 
-// base ffts can be computed within one kernel launch using in-register ffts and local 
-// memory transposes i.e for the first kernel above which computes 64 x 32 ffts on length 
-// 128, 128 can be decomposed into 128 = 16 x 8 i.e. 8 work items can compute 8 length 
-// 16 ffts followed by transpose using local memory followed by each of these eight 
-// work items computing 2 length 8 ffts thus computing 16 length 8 ffts in total. This 
+// 128 x 32 length 64 ffts, and finally a kernel computing 128 x 64 length 32 ffts.
+// Each of these base radices can futher be divided into factors so that each of these
+// base ffts can be computed within one kernel launch using in-register ffts and local
+// memory transposes i.e for the first kernel above which computes 64 x 32 ffts on length
+// 128, 128 can be decomposed into 128 = 16 x 8 i.e. 8 work items can compute 8 length
+// 16 ffts followed by transpose using local memory followed by each of these eight
+// work items computing 2 length 8 ffts thus computing 16 length 8 ffts in total. This
 // means only 8 work items are needed for computing one length 128 fft. If we choose
 // work group size of say 64, we can compute 64/8 = 8 length 128 ffts within one
-// work group. Since we need to compute 64 x 32 length 128 ffts in first kernel, this 
-// means we need to launch 64 x 32 / 8 = 256 work groups with 64 work items in each 
+// work group. Since we need to compute 64 x 32 length 128 ffts in first kernel, this
+// means we need to launch 64 x 32 / 8 = 256 work groups with 64 work items in each
 // work group where each work group is computing 8 length 128 ffts where each length
 // 128 fft is computed by 8 work items. Same logic can be applied to other two kernels
-// in this example. Users can play with difference base radices and difference 
+// in this example. Users can play with difference base radices and difference
 // decompositions of base radices to generates different kernels and see which gives
 // best performance. Following function is just fixed to use 128 as base radix
 
 void
 getGlobalRadixInfo(int n, int *radix, int *R1, int *R2, int *numRadices)
 {
-	int baseRadix = min(n, 128);
-	
-	int numR = 0;
-	int N = n;
-	while(N > baseRadix) 
-	{
-		N /= baseRadix;
-		numR++;
-	}
-	
-	for(int i = 0; i < numR; i++)
-		radix[i] = baseRadix;
-	
-	radix[numR] = N;
-	numR++;
-	*numRadices = numR;
-		
-	for(int i = 0; i < numR; i++)
-	{
-		int B = radix[i];
-		if(B <= 8)
-		{
-			R1[i] = B;
-			R2[i] = 1;
-			continue;
-		}
-		
-		int r1 = 2; 
-		int r2 = B / r1;
-	    while(r2 > r1)
-	    {
-		   r1 *=2;
-		   r2 = B / r1;
-	    }
-		R1[i] = r1;
-		R2[i] = r2;
-	}	
+    int baseRadix = min(n, 128);
+
+    int numR = 0;
+    int N = n;
+    while(N > baseRadix)
+        {
+            N /= baseRadix;
+            numR++;
+        }
+
+    for(int i = 0; i < numR; i++)
+        radix[i] = baseRadix;
+
+    radix[numR] = N;
+    numR++;
+    *numRadices = numR;
+
+    for(int i = 0; i < numR; i++)
+        {
+            int B = radix[i];
+            if(B <= 8)
+                {
+                    R1[i] = B;
+                    R2[i] = 1;
+                    continue;
+                }
+
+            int r1 = 2;
+            int r2 = B / r1;
+            while(r2 > r1)
+                {
+                    r1 *=2;
+                    r2 = B / r1;
+                }
+            R1[i] = r1;
+            R2[i] = r2;
+        }
 }
 
 static void
 createGlobalFFTKernelString(cl_fft_plan *plan, int n, int BS, cl_fft_kernel_dir dir, int vertBS)
-{		
-	int i, j, k, t;
-	int radixArr[10] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
+{
+    int i, j, k, t;
+    int radixArr[10] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
     int R1Arr[10] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
     int R2Arr[10] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
-	int radix, R1, R2;
-	int numRadices;
-	
-	int maxThreadsPerBlock = plan->max_work_item_per_workgroup;
-	int maxArrayLen = plan->max_radix;
-	int batchSize = plan->min_mem_coalesce_width;	
-	clFFT_DataFormat dataFormat = plan->format;
-	int vertical = (dir == cl_fft_kernel_x) ? 0 : 1;	
-	
-	getGlobalRadixInfo(n, radixArr, R1Arr, R2Arr, &numRadices);
-		
-	int numPasses = numRadices;
-	
-	string localString(""), kernelName("");
-	string *kernelString = plan->kernel_string;
-	cl_fft_kernel_info **kInfo = &plan->kernel_info; 
-	int kCount = 0;
-	
-	while(*kInfo)
-	{
-		kInfo = &(*kInfo)->next;
-		kCount++;
-	}
-	
-	int N = n;
-	int m = (int)log2(n);
-	int Rinit = vertical ? BS : 1;
-	batchSize = vertical ? min(BS, batchSize) : batchSize;
-	int passNum;
-	
-	for(passNum = 0; passNum < numPasses; passNum++) 
-	{
-		
-		localString.clear();
-		kernelName.clear();
-		
-		radix = radixArr[passNum];
-		R1 = R1Arr[passNum];
-		R2 = R2Arr[passNum];
-				
-		int strideI = Rinit;
-		for(i = 0; i < numPasses; i++)
-			if(i != passNum)
-				strideI *= radixArr[i];
-		
-		int strideO = Rinit;
-		for(i = 0; i < passNum; i++)
-			strideO *= radixArr[i];
-		
-		int threadsPerXForm = R2;
-		batchSize = R2 == 1 ? plan->max_work_item_per_workgroup : batchSize;
-		batchSize = min(batchSize, strideI);
-		int threadsPerBlock = batchSize * threadsPerXForm;
-		threadsPerBlock = min(threadsPerBlock, maxThreadsPerBlock);
-		batchSize = threadsPerBlock / threadsPerXForm;
-		assert(R2 <= R1);
-		assert(R1*R2 == radix);
-		assert(R1 <= maxArrayLen);
-		assert(threadsPerBlock <= maxThreadsPerBlock);
-		
-		int numIter = R1 / R2;
-		int gInInc = threadsPerBlock / batchSize;
-		
-		
-		int lgStrideO = (int)log2(strideO);
-		int numBlocksPerXForm = strideI / batchSize;
-		int numBlocks = numBlocksPerXForm;
-		if(!vertical)
-			numBlocks *= BS;
-		else
-			numBlocks *= vertBS;
-		
-		kernelName = string("fft") + num2str(kCount);
-		*kInfo = (cl_fft_kernel_info *) malloc(sizeof(cl_fft_kernel_info));
-		(*kInfo)->kernel = 0;
-		if(R2 == 1)
-			(*kInfo)->lmem_size = 0;
-		else
-		{
-		    if(strideO == 1)
-		        (*kInfo)->lmem_size = (radix + 1)*batchSize;
-		    else
-			    (*kInfo)->lmem_size = threadsPerBlock*R1;
-		}
-		(*kInfo)->num_workgroups = numBlocks;
-        (*kInfo)->num_xforms_per_workgroup = 1;
-		(*kInfo)->num_workitems_per_workgroup = threadsPerBlock;
-		(*kInfo)->dir = dir;
-		if( (passNum == (numPasses - 1)) && (numPasses & 1) )
-		    (*kInfo)->in_place_possible = 1;
-		else
-			(*kInfo)->in_place_possible = 0;
-		(*kInfo)->next = NULL;
-		(*kInfo)->kernel_name = (char *) malloc(sizeof(char)*(kernelName.size()+1));
-		strcpy((*kInfo)->kernel_name, kernelName.c_str());
-		
-		insertVariables(localString, R1);
-						
-		if(vertical) 
-		{
-			localString += string("xNum = groupId >> ") + num2str((int)log2(numBlocksPerXForm)) + string(";\n");
-			localString += string("groupId = groupId & ") + num2str(numBlocksPerXForm - 1) + string(";\n");
-			localString += string("indexIn = mad24(groupId, ") + num2str(batchSize) + string(", xNum << ") + num2str((int)log2(n*BS)) + string(");\n");
-			localString += string("tid = mul24(groupId, ") + num2str(batchSize) + string(");\n");
-			localString += string("i = tid >> ") + num2str(lgStrideO) + string(";\n");
-			localString += string("j = tid & ") + num2str(strideO - 1) + string(";\n");
-			int stride = radix*Rinit;
-			for(i = 0; i < passNum; i++)
-				stride *= radixArr[i];
-			localString += string("indexOut = mad24(i, ") + num2str(stride) + string(", j + ") + string("(xNum << ") + num2str((int) log2(n*BS)) + string("));\n");
-			localString += string("bNum = groupId;\n");
-		}
-		else 
-		{
-			int lgNumBlocksPerXForm = (int)log2(numBlocksPerXForm);
-			localString += string("bNum = groupId & ") + num2str(numBlocksPerXForm - 1) + string(";\n");
-			localString += string("xNum = groupId >> ") + num2str(lgNumBlocksPerXForm) + string(";\n");
-			localString += string("indexIn = mul24(bNum, ") + num2str(batchSize) + string(");\n");
-			localString += string("tid = indexIn;\n");
-			localString += string("i = tid >> ") + num2str(lgStrideO) + string(";\n");
-			localString += string("j = tid & ") + num2str(strideO - 1) + string(";\n"); 
-			int stride = radix*Rinit;
-			for(i = 0; i < passNum; i++)
-				stride *= radixArr[i];
-			localString += string("indexOut = mad24(i, ") + num2str(stride) + string(", j);\n");			
-			localString += string("indexIn += (xNum << ") + num2str(m) + string(");\n");
-			localString += string("indexOut += (xNum << ") + num2str(m) + string(");\n");	
-		}
-		
-		// Load Data
-		int lgBatchSize = (int)log2(batchSize);
-		localString += string("tid = lId;\n");
-		localString += string("i = tid & ") + num2str(batchSize - 1) + string(";\n");
-		localString += string("j = tid >> ") + num2str(lgBatchSize) + string(";\n"); 
-		localString += string("indexIn += mad24(j, ") + num2str(strideI) + string(", i);\n");
+    int radix, R1, R2;
+    int numRadices;
+
+    int maxThreadsPerBlock = plan->max_work_item_per_workgroup;
+    int maxArrayLen = plan->max_radix;
+    int batchSize = plan->min_mem_coalesce_width;
+    clFFT_DataFormat dataFormat = plan->format;
+    int vertical = (dir == cl_fft_kernel_x) ? 0 : 1;
+
+    getGlobalRadixInfo(n, radixArr, R1Arr, R2Arr, &numRadices);
+
+    int numPasses = numRadices;
+
+    string localString(""), kernelName("");
+    string *kernelString = plan->kernel_string;
+    cl_fft_kernel_info **kInfo = &plan->kernel_info;
+    int kCount = 0;
+
+    while(*kInfo)
+        {
+            kInfo = &(*kInfo)->next;
+            kCount++;
+        }
+
+    int N = n;
+    int m = (int)log2(n);
+    int Rinit = vertical ? BS : 1;
+    batchSize = vertical ? min(BS, batchSize) : batchSize;
+    int passNum;
+
+    for(passNum = 0; passNum < numPasses; passNum++)
+        {
+
+            localString.clear();
+            kernelName.clear();
+
+            radix = radixArr[passNum];
+            R1 = R1Arr[passNum];
+            R2 = R2Arr[passNum];
+
+            int strideI = Rinit;
+            for(i = 0; i < numPasses; i++)
+                if(i != passNum)
+                    strideI *= radixArr[i];
+
+            int strideO = Rinit;
+            for(i = 0; i < passNum; i++)
+                strideO *= radixArr[i];
+
+            int threadsPerXForm = R2;
+            batchSize = R2 == 1 ? plan->max_work_item_per_workgroup : batchSize;
+            batchSize = min(batchSize, strideI);
+            int threadsPerBlock = batchSize * threadsPerXForm;
+            threadsPerBlock = min(threadsPerBlock, maxThreadsPerBlock);
+            batchSize = threadsPerBlock / threadsPerXForm;
+            assert(R2 <= R1);
+            assert(R1*R2 == radix);
+            assert(R1 <= maxArrayLen);
+            assert(threadsPerBlock <= maxThreadsPerBlock);
+
+            int numIter = R1 / R2;
+            int gInInc = threadsPerBlock / batchSize;
+
+
+            int lgStrideO = (int)log2(strideO);
+            int numBlocksPerXForm = strideI / batchSize;
+            int numBlocks = numBlocksPerXForm;
+            if(!vertical)
+                numBlocks *= BS;
+            else
+                numBlocks *= vertBS;
+
+            kernelName = string("fft") + num2str(kCount);
+            *kInfo = (cl_fft_kernel_info *) malloc(sizeof(cl_fft_kernel_info));
+            (*kInfo)->kernel = 0;
+            if(R2 == 1)
+                (*kInfo)->lmem_size = 0;
+            else
+                {
+                    if(strideO == 1)
+                        (*kInfo)->lmem_size = (radix + 1)*batchSize;
+                    else
+                        (*kInfo)->lmem_size = threadsPerBlock*R1;
+                }
+            (*kInfo)->num_workgroups = numBlocks;
+            (*kInfo)->num_xforms_per_workgroup = 1;
+            (*kInfo)->num_workitems_per_workgroup = threadsPerBlock;
+            (*kInfo)->dir = dir;
+            if( (passNum == (numPasses - 1)) && (numPasses & 1) )
+                (*kInfo)->in_place_possible = 1;
+            else
+                (*kInfo)->in_place_possible = 0;
+            (*kInfo)->next = NULL;
+            (*kInfo)->kernel_name = (char *) malloc(sizeof(char)*(kernelName.size()+1));
+            strcpy((*kInfo)->kernel_name, kernelName.c_str());
+
+            insertVariables(localString, R1);
+
+            if(vertical)
+                {
+                    localString += string("xNum = groupId >> ") + num2str((int)log2(numBlocksPerXForm)) + string(";\n");
+                    localString += string("groupId = groupId & ") + num2str(numBlocksPerXForm - 1) + string(";\n");
+                    localString += string("indexIn = mad24(groupId, ") + num2str(batchSize) + string(", xNum << ") + num2str((int)log2(n*BS)) + string(");\n");
+                    localString += string("tid = mul24(groupId, ") + num2str(batchSize) + string(");\n");
+                    localString += string("i = tid >> ") + num2str(lgStrideO) + string(";\n");
+                    localString += string("j = tid & ") + num2str(strideO - 1) + string(";\n");
+                    int stride = radix*Rinit;
+                    for(i = 0; i < passNum; i++)
+                        stride *= radixArr[i];
+                    localString += string("indexOut = mad24(i, ") + num2str(stride) + string(", j + ") + string("(xNum << ") + num2str((int) log2(n*BS)) + string("));\n");
+                    localString += string("bNum = groupId;\n");
+                }
+            else
+                {
+                    int lgNumBlocksPerXForm = (int)log2(numBlocksPerXForm);
+                    localString += string("bNum = groupId & ") + num2str(numBlocksPerXForm - 1) + string(";\n");
+                    localString += string("xNum = groupId >> ") + num2str(lgNumBlocksPerXForm) + string(";\n");
+                    localString += string("indexIn = mul24(bNum, ") + num2str(batchSize) + string(");\n");
+                    localString += string("tid = indexIn;\n");
+                    localString += string("i = tid >> ") + num2str(lgStrideO) + string(";\n");
+                    localString += string("j = tid & ") + num2str(strideO - 1) + string(";\n");
+                    int stride = radix*Rinit;
+                    for(i = 0; i < passNum; i++)
+                        stride *= radixArr[i];
+                    localString += string("indexOut = mad24(i, ") + num2str(stride) + string(", j);\n");
+                    localString += string("indexIn += (xNum << ") + num2str(m) + string(");\n");
+                    localString += string("indexOut += (xNum << ") + num2str(m) + string(");\n");
+                }
+
+            // Load Data
+            int lgBatchSize = (int)log2(batchSize);
+            localString += string("tid = lId;\n");
+            localString += string("i = tid & ") + num2str(batchSize - 1) + string(";\n");
+            localString += string("j = tid >> ") + num2str(lgBatchSize) + string(";\n");
+            localString += string("indexIn += mad24(j, ") + num2str(strideI) + string(", i);\n");
 
-		if(dataFormat == clFFT_SplitComplexFormat) 
-		{
-			localString += string("in_real += indexIn;\n");
-			localString += string("in_imag += indexIn;\n");			
-			for(j = 0; j < R1; j++)
-				localString += string("a[") + num2str(j) + string("].x = in_real[") + num2str(j*gInInc*strideI) + string("];\n");
-			for(j = 0; j < R1; j++) 
-				localString += string("a[") + num2str(j) + string("].y = in_imag[") + num2str(j*gInInc*strideI) + string("];\n");
-		}
-		else 
-		{
-			localString += string("in += indexIn;\n");
-			for(j = 0; j < R1; j++)
-				localString += string("a[") + num2str(j) + string("] = in[") + num2str(j*gInInc*strideI) + string("];\n");
-	    }
-		
-		localString += string("fftKernel") + num2str(R1) + string("(a, dir);\n");							  
-		
-		if(R2 > 1)
-		{
-		    // twiddle
-		    for(k = 1; k < R1; k++) 
-		    {
-			    localString += string("ang = dir*(2.0f*M_PI*") + num2str(k) + string("/") + num2str(radix) + string(")*j;\n");
-			    localString += string("w = (float2)(native_cos(ang), native_sin(ang));\n");
-			    localString += string("a[") + num2str(k) + string("] = complexMul(a[") + num2str(k) + string("], w);\n"); 
-		    }
-		
-		    // shuffle
-		    numIter = R1 / R2;	
-		    localString += string("indexIn = mad24(j, ") + num2str(threadsPerBlock*numIter) + string(", i);\n");
-		    localString += string("lMemStore = sMem + tid;\n");
-		    localString += string("lMemLoad = sMem + indexIn;\n");
-		    for(k = 0; k < R1; k++) 
-			    localString += string("lMemStore[") + num2str(k*threadsPerBlock) + string("] = a[") + num2str(k) + string("].x;\n");
-		    localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");	
-		    for(k = 0; k < numIter; k++)
-			    for(t = 0; t < R2; t++)
-				    localString += string("a[") + num2str(k*R2+t) + string("].x = lMemLoad[") + num2str(t*batchSize + k*threadsPerBlock) + string("];\n");
-		    localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");
-		    for(k = 0; k < R1; k++) 
-			    localString += string("lMemStore[") + num2str(k*threadsPerBlock) + string("] = a[") + num2str(k) + string("].y;\n");
-		    localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");	
-		    for(k = 0; k < numIter; k++)
-			    for(t = 0; t < R2; t++)
-				    localString += string("a[") + num2str(k*R2+t) + string("].y = lMemLoad[") + num2str(t*batchSize + k*threadsPerBlock) + string("];\n");
-		    localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");
-		
-		    for(j = 0; j < numIter; j++)
-			    localString += string("fftKernel") + num2str(R2) + string("(a + ") + num2str(j*R2) + string(", dir);\n");
-		}
-		
-		// twiddle
-		if(passNum < (numPasses - 1)) 
-		{
-			localString += string("l = ((bNum << ") + num2str(lgBatchSize) + string(") + i) >> ") + num2str(lgStrideO) + string(";\n");
-			localString += string("k = j << ") + num2str((int)log2(R1/R2)) + string(";\n"); 
-			localString += string("ang1 = dir*(2.0f*M_PI/") + num2str(N) + string(")*l;\n");
-			for(t = 0; t < R1; t++) 
-			{
-				localString += string("ang = ang1*(k + ") + num2str((t%R2)*R1 + (t/R2)) + string(");\n");
-				localString += string("w = (float2)(native_cos(ang), native_sin(ang));\n");
-				localString += string("a[") + num2str(t) + string("] = complexMul(a[") + num2str(t) + string("], w);\n");
-			}
-		}
-		
-		// Store Data
-		if(strideO == 1) 
-		{
-			
-			localString += string("lMemStore = sMem + mad24(i, ") + num2str(radix + 1) + string(", j << ") + num2str((int)log2(R1/R2)) + string(");\n");
-			localString += string("lMemLoad = sMem + mad24(tid >> ") + num2str((int)log2(radix)) + string(", ") + num2str(radix+1) + string(", tid & ") + num2str(radix-1) + string(");\n");
-			
-			for(i = 0; i < R1/R2; i++)
-				for(j = 0; j < R2; j++)
-					localString += string("lMemStore[ ") + num2str(i + j*R1) + string("] = a[") + num2str(i*R2+j) + string("].x;\n");
-			localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");
-			if(threadsPerBlock >= radix)
-            {
-                for(i = 0; i < R1; i++)
-                localString += string("a[") + num2str(i) + string("].x = lMemLoad[") + num2str(i*(radix+1)*(threadsPerBlock/radix)) + string("];\n");
-            }
+            if(dataFormat == clFFT_SplitComplexFormat)
+                {
+                    localString += string("in_real += indexIn;\n");
+                    localString += string("in_imag += indexIn;\n");
+                    for(j = 0; j < R1; j++)
+                        localString += string("a[") + num2str(j) + string("].x = in_real[") + num2str(j*gInInc*strideI) + string("];\n");
+                    for(j = 0; j < R1; j++)
+                        localString += string("a[") + num2str(j) + string("].y = in_imag[") + num2str(j*gInInc*strideI) + string("];\n");
+                }
             else
-            {
-                int innerIter = radix/threadsPerBlock;
-                int outerIter = R1/innerIter;
-                for(i = 0; i < outerIter; i++)
-                    for(j = 0; j < innerIter; j++)
-                        localString += string("a[") + num2str(i*innerIter+j) + string("].x = lMemLoad[") + num2str(j*threadsPerBlock + i*(radix+1)) + string("];\n");
-            }
-			localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");
-			
-			for(i = 0; i < R1/R2; i++)
-				for(j = 0; j < R2; j++)
-					localString += string("lMemStore[ ") + num2str(i + j*R1) + string("] = a[") + num2str(i*R2+j) + string("].y;\n");
-			localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");
-			if(threadsPerBlock >= radix)
-            {
-                for(i = 0; i < R1; i++)
-                    localString += string("a[") + num2str(i) + string("].y = lMemLoad[") + num2str(i*(radix+1)*(threadsPerBlock/radix)) + string("];\n");
-            }
+                {
+                    localString += string("in += indexIn;\n");
+                    for(j = 0; j < R1; j++)
+                        localString += string("a[") + num2str(j) + string("] = in[") + num2str(j*gInInc*strideI) + string("];\n");
+                }
+
+            localString += string("fftKernel") + num2str(R1) + string("(a, dir);\n");
+
+            if(R2 > 1)
+                {
+                    // twiddle
+                    for(k = 1; k < R1; k++)
+                        {
+                            localString += string("ang = dir*(2.0f*M_PI*") + num2str(k) + string("/") + num2str(radix) + string(")*j;\n");
+                            localString += string("w = (float2)(native_cos(ang), native_sin(ang));\n");
+                            localString += string("a[") + num2str(k) + string("] = complexMul(a[") + num2str(k) + string("], w);\n");
+                        }
+
+                    // shuffle
+                    numIter = R1 / R2;
+                    localString += string("indexIn = mad24(j, ") + num2str(threadsPerBlock*numIter) + string(", i);\n");
+                    localString += string("lMemStore = sMem + tid;\n");
+                    localString += string("lMemLoad = sMem + indexIn;\n");
+                    for(k = 0; k < R1; k++)
+                        localString += string("lMemStore[") + num2str(k*threadsPerBlock) + string("] = a[") + num2str(k) + string("].x;\n");
+                    localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");
+                    for(k = 0; k < numIter; k++)
+                        for(t = 0; t < R2; t++)
+                            localString += string("a[") + num2str(k*R2+t) + string("].x = lMemLoad[") + num2str(t*batchSize + k*threadsPerBlock) + string("];\n");
+                    localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");
+                    for(k = 0; k < R1; k++)
+                        localString += string("lMemStore[") + num2str(k*threadsPerBlock) + string("] = a[") + num2str(k) + string("].y;\n");
+                    localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");
+                    for(k = 0; k < numIter; k++)
+                        for(t = 0; t < R2; t++)
+                            localString += string("a[") + num2str(k*R2+t) + string("].y = lMemLoad[") + num2str(t*batchSize + k*threadsPerBlock) + string("];\n");
+                    localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");
+
+                    for(j = 0; j < numIter; j++)
+                        localString += string("fftKernel") + num2str(R2) + string("(a + ") + num2str(j*R2) + string(", dir);\n");
+                }
+
+            // twiddle
+            if(passNum < (numPasses - 1))
+                {
+                    localString += string("l = ((bNum << ") + num2str(lgBatchSize) + string(") + i) >> ") + num2str(lgStrideO) + string(";\n");
+                    localString += string("k = j << ") + num2str((int)log2(R1/R2)) + string(";\n");
+                    localString += string("ang1 = dir*(2.0f*M_PI/") + num2str(N) + string(")*l;\n");
+                    for(t = 0; t < R1; t++)
+                        {
+                            localString += string("ang = ang1*(k + ") + num2str((t%R2)*R1 + (t/R2)) + string(");\n");
+                            localString += string("w = (float2)(native_cos(ang), native_sin(ang));\n");
+                            localString += string("a[") + num2str(t) + string("] = complexMul(a[") + num2str(t) + string("], w);\n");
+                        }
+                }
+
+            // Store Data
+            if(strideO == 1)
+                {
+
+                    localString += string("lMemStore = sMem + mad24(i, ") + num2str(radix + 1) + string(", j << ") + num2str((int)log2(R1/R2)) + string(");\n");
+                    localString += string("lMemLoad = sMem + mad24(tid >> ") + num2str((int)log2(radix)) + string(", ") + num2str(radix+1) + string(", tid & ") + num2str(radix-1) + string(");\n");
+
+                    for(i = 0; i < R1/R2; i++)
+                        for(j = 0; j < R2; j++)
+                            localString += string("lMemStore[ ") + num2str(i + j*R1) + string("] = a[") + num2str(i*R2+j) + string("].x;\n");
+                    localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");
+                    if(threadsPerBlock >= radix)
+                        {
+                            for(i = 0; i < R1; i++)
+                                localString += string("a[") + num2str(i) + string("].x = lMemLoad[") + num2str(i*(radix+1)*(threadsPerBlock/radix)) + string("];\n");
+                        }
+                    else
+                        {
+                            int innerIter = radix/threadsPerBlock;
+                            int outerIter = R1/innerIter;
+                            for(i = 0; i < outerIter; i++)
+                                for(j = 0; j < innerIter; j++)
+                                    localString += string("a[") + num2str(i*innerIter+j) + string("].x = lMemLoad[") + num2str(j*threadsPerBlock + i*(radix+1)) + string("];\n");
+                        }
+                    localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");
+
+                    for(i = 0; i < R1/R2; i++)
+                        for(j = 0; j < R2; j++)
+                            localString += string("lMemStore[ ") + num2str(i + j*R1) + string("] = a[") + num2str(i*R2+j) + string("].y;\n");
+                    localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");
+                    if(threadsPerBlock >= radix)
+                        {
+                            for(i = 0; i < R1; i++)
+                                localString += string("a[") + num2str(i) + string("].y = lMemLoad[") + num2str(i*(radix+1)*(threadsPerBlock/radix)) + string("];\n");
+                        }
+                    else
+                        {
+                            int innerIter = radix/threadsPerBlock;
+                            int outerIter = R1/innerIter;
+                            for(i = 0; i < outerIter; i++)
+                                for(j = 0; j < innerIter; j++)
+                                    localString += string("a[") + num2str(i*innerIter+j) + string("].y = lMemLoad[") + num2str(j*threadsPerBlock + i*(radix+1)) + string("];\n");
+                        }
+                    localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");
+
+                    localString += string("indexOut += tid;\n");
+                    if(dataFormat == clFFT_SplitComplexFormat) {
+                        localString += string("out_real += indexOut;\n");
+                        localString += string("out_imag += indexOut;\n");
+                        for(k = 0; k < R1; k++)
+                            localString += string("out_real[") + num2str(k*threadsPerBlock) + string("] = a[") + num2str(k) + string("].x;\n");
+                        for(k = 0; k < R1; k++)
+                            localString += string("out_imag[") + num2str(k*threadsPerBlock) + string("] = a[") + num2str(k) + string("].y;\n");
+                    }
+                    else {
+                        localString += string("out += indexOut;\n");
+                        for(k = 0; k < R1; k++)
+                            localString += string("out[") + num2str(k*threadsPerBlock) + string("] = a[") + num2str(k) + string("];\n");
+                    }
+
+                }
             else
-            {
-                int innerIter = radix/threadsPerBlock;
-                int outerIter = R1/innerIter;
-                for(i = 0; i < outerIter; i++)
-                    for(j = 0; j < innerIter; j++)
-                        localString += string("a[") + num2str(i*innerIter+j) + string("].y = lMemLoad[") + num2str(j*threadsPerBlock + i*(radix+1)) + string("];\n");
-            }
-			localString += string("barrier(CLK_LOCAL_MEM_FENCE);\n");
-			
-			localString += string("indexOut += tid;\n");
-			if(dataFormat == clFFT_SplitComplexFormat) {
-				localString += string("out_real += indexOut;\n");
-				localString += string("out_imag += indexOut;\n");
-				for(k = 0; k < R1; k++)
-					localString += string("out_real[") + num2str(k*threadsPerBlock) + string("] = a[") + num2str(k) + string("].x;\n");
-				for(k = 0; k < R1; k++)
-					localString += string("out_imag[") + num2str(k*threadsPerBlock) + string("] = a[") + num2str(k) + string("].y;\n");
-			}
-			else {
-				localString += string("out += indexOut;\n");
-				for(k = 0; k < R1; k++)
-					localString += string("out[") + num2str(k*threadsPerBlock) + string("] = a[") + num2str(k) + string("];\n");				
-			}
-		 
-		}
-		else 
-		{
-			localString += string("indexOut += mad24(j, ") + num2str(numIter*strideO) + string(", i);\n");
-			if(dataFormat == clFFT_SplitComplexFormat) {
-				localString += string("out_real += indexOut;\n");
-				localString += string("out_imag += indexOut;\n");			
-				for(k = 0; k < R1; k++)
-					localString += string("out_real[") + num2str(((k%R2)*R1 + (k/R2))*strideO) + string("] = a[") + num2str(k) + string("].x;\n");
-				for(k = 0; k < R1; k++)
-					localString += string("out_imag[") + num2str(((k%R2)*R1 + (k/R2))*strideO) + string("] = a[") + num2str(k) + string("].y;\n");
-			}
-			else {
-				localString += string("out += indexOut;\n");
-				for(k = 0; k < R1; k++)
-					localString += string("out[") + num2str(((k%R2)*R1 + (k/R2))*strideO) + string("] = a[") + num2str(k) + string("];\n");
-			}
-		}
-		
-		insertHeader(*kernelString, kernelName, dataFormat);
-		*kernelString += string("{\n");
-		if((*kInfo)->lmem_size)
-			*kernelString += string("    __local float sMem[") + num2str((*kInfo)->lmem_size) + string("];\n");
-		*kernelString += localString;
-		*kernelString += string("}\n");		
-		
-		N /= radix;
-		kInfo = &(*kInfo)->next;
-		kCount++;
-	}
+                {
+                    localString += string("indexOut += mad24(j, ") + num2str(numIter*strideO) + string(", i);\n");
+                    if(dataFormat == clFFT_SplitComplexFormat) {
+                        localString += string("out_real += indexOut;\n");
+                        localString += string("out_imag += indexOut;\n");
+                        for(k = 0; k < R1; k++)
+                            localString += string("out_real[") + num2str(((k%R2)*R1 + (k/R2))*strideO) + string("] = a[") + num2str(k) + string("].x;\n");
+                        for(k = 0; k < R1; k++)
+                            localString += string("out_imag[") + num2str(((k%R2)*R1 + (k/R2))*strideO) + string("] = a[") + num2str(k) + string("].y;\n");
+                    }
+                    else {
+                        localString += string("out += indexOut;\n");
+                        for(k = 0; k < R1; k++)
+                            localString += string("out[") + num2str(((k%R2)*R1 + (k/R2))*strideO) + string("] = a[") + num2str(k) + string("];\n");
+                    }
+                }
+
+            insertHeader(*kernelString, kernelName, dataFormat);
+            *kernelString += string("{\n");
+            if((*kInfo)->lmem_size)
+                *kernelString += string("    __local float sMem[") + num2str((*kInfo)->lmem_size) + string("];\n");
+            *kernelString += localString;
+            *kernelString += string("}\n");
+
+            N /= radix;
+            kInfo = &(*kInfo)->next;
+            kCount++;
+        }
 }
 
 void FFT1D(cl_fft_plan *plan, cl_fft_kernel_dir dir)
-{	
+{
     unsigned int radixArray[10];
     unsigned int numRadix;
-    
-	switch(dir)
-	{
-		case cl_fft_kernel_x:
-		    if(plan->n.x > plan->max_localmem_fft_size)
-		    {
-		        createGlobalFFTKernelString(plan, plan->n.x, 1, cl_fft_kernel_x, 1);
-		    }
-		    else if(plan->n.x > 1)
-		    {
-		        getRadixArray(plan->n.x, radixArray, &numRadix, 0);
-		        if(plan->n.x / radixArray[0] <= plan->max_work_item_per_workgroup)
-		        {
-				    createLocalMemfftKernelString(plan);
-				}
-			    else
-			    {
-			        getRadixArray(plan->n.x, radixArray, &numRadix, plan->max_radix);
-			        if(plan->n.x / radixArray[0] <= plan->max_work_item_per_workgroup)
-			            createLocalMemfftKernelString(plan);
-			        else
-				        createGlobalFFTKernelString(plan, plan->n.x, 1, cl_fft_kernel_x, 1);
-				}
-		    }
-			break;
-			
-		case cl_fft_kernel_y:
-			if(plan->n.y > 1)
-			    createGlobalFFTKernelString(plan, plan->n.y, plan->n.x, cl_fft_kernel_y, 1);
-			break;
-			
-		case cl_fft_kernel_z:
-			if(plan->n.z > 1)
-			    createGlobalFFTKernelString(plan, plan->n.z, plan->n.x*plan->n.y, cl_fft_kernel_z, 1);
-		default:
-			return;
-	}
+
+    switch(dir)
+        {
+        case cl_fft_kernel_x:
+            if(plan->n.x > plan->max_localmem_fft_size)
+                {
+                    createGlobalFFTKernelString(plan, plan->n.x, 1, cl_fft_kernel_x, 1);
+                }
+            else if(plan->n.x > 1)
+                {
+                    getRadixArray(plan->n.x, radixArray, &numRadix, 0);
+                    if(plan->n.x / radixArray[0] <= plan->max_work_item_per_workgroup)
+                        {
+                            createLocalMemfftKernelString(plan);
+                        }
+                    else
+                        {
+                            getRadixArray(plan->n.x, radixArray, &numRadix, plan->max_radix);
+                            if(plan->n.x / radixArray[0] <= plan->max_work_item_per_workgroup)
+                                createLocalMemfftKernelString(plan);
+                            else
+                                createGlobalFFTKernelString(plan, plan->n.x, 1, cl_fft_kernel_x, 1);
+                        }
+                }
+            break;
+
+        case cl_fft_kernel_y:
+            if(plan->n.y > 1)
+                createGlobalFFTKernelString(plan, plan->n.y, plan->n.x, cl_fft_kernel_y, 1);
+
+
+            break;
+
+        case cl_fft_kernel_z:
+            if(plan->n.z > 1)
+                createGlobalFFTKernelString(plan, plan->n.z, plan->n.x*plan->n.y, cl_fft_kernel_z, 1);
+        default:
+            return;
+        }
 }
 
+
--- a/fft_Example/fft_setup.cc	Tue Feb 05 15:12:19 2013 +0900
+++ b/fft_Example/fft_setup.cc	Tue Feb 05 15:19:02 2013 +0900
@@ -50,69 +50,70 @@
 #include "fft_base_kernels.h"
 #include <stdlib.h>
 #include <string.h>
+#include <fcntl.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <iostream>
 #include <string>
 #include <sstream>
-
+#include <fstream>
 using namespace std;
 
 extern void getKernelWorkDimensions(cl_fft_plan *plan, cl_fft_kernel_info *kernelInfo, cl_int *batchSize, size_t *gWorkItems, size_t *lWorkItems);
 
-static void 
+static void
 getBlockConfigAndKernelString(cl_fft_plan *plan)
 {
-	plan->temp_buffer_needed = 0;
-	*plan->kernel_string += baseKernels;
-	
-	if(plan->format == clFFT_SplitComplexFormat)
-		*plan->kernel_string += twistKernelPlannar;
-	else
-		*plan->kernel_string += twistKernelInterleaved;
-	
-	switch(plan->dim) 
-	{
-		case clFFT_1D:
-			FFT1D(plan, cl_fft_kernel_x);
-			break;
-			
-		case clFFT_2D:
-			FFT1D(plan, cl_fft_kernel_x); 
-			FFT1D(plan, cl_fft_kernel_y);  
-			break;
-			
-		case clFFT_3D:
-			FFT1D(plan, cl_fft_kernel_x); 
-			FFT1D(plan, cl_fft_kernel_y); 
-			FFT1D(plan, cl_fft_kernel_z); 
-			break;
-			
-		default:
-			return;
-	}
-	
-	plan->temp_buffer_needed = 0;
-	cl_fft_kernel_info *kInfo = plan->kernel_info;
-	while(kInfo)
-	{
-		plan->temp_buffer_needed |= !kInfo->in_place_possible;
-		kInfo = kInfo->next;
-	}
+    plan->temp_buffer_needed = 0;
+    *plan->kernel_string += baseKernels;
+
+    if(plan->format == clFFT_SplitComplexFormat)
+        *plan->kernel_string += twistKernelPlannar;
+    else
+        *plan->kernel_string += twistKernelInterleaved;
+
+    switch(plan->dim)
+    {
+        case clFFT_1D:
+            FFT1D(plan, cl_fft_kernel_x);
+            break;
+
+        case clFFT_2D:
+            FFT1D(plan, cl_fft_kernel_x);
+            FFT1D(plan, cl_fft_kernel_y);
+            break;
+
+        case clFFT_3D:
+            FFT1D(plan, cl_fft_kernel_x);
+            FFT1D(plan, cl_fft_kernel_y);
+            FFT1D(plan, cl_fft_kernel_z);
+            break;
+
+        default:
+            return;
+    }
+
+    plan->temp_buffer_needed = 0;
+    cl_fft_kernel_info *kInfo = plan->kernel_info;
+    while(kInfo)
+    {
+        plan->temp_buffer_needed |= !kInfo->in_place_possible;
+        kInfo = kInfo->next;
+    }
 }
 
- 
+
 static void
 deleteKernelInfo(cl_fft_kernel_info *kInfo)
 {
-	if(kInfo)
-	{
-	    if(kInfo->kernel_name)
-		    free(kInfo->kernel_name);
-	    if(kInfo->kernel)
-		    clReleaseKernel(kInfo->kernel);
-		free(kInfo);
-	}	
+    if(kInfo)
+    {
+        if(kInfo->kernel_name)
+            free(kInfo->kernel_name);
+        if(kInfo->kernel)
+            clReleaseKernel(kInfo->kernel);
+        free(kInfo);
+    }
 }
 
 static void
@@ -120,102 +121,102 @@
 {
     cl_fft_kernel_info *kernel_info = Plan->kernel_info;
 
-	while(kernel_info)
-	{
-		cl_fft_kernel_info *tmp = kernel_info->next;
-		deleteKernelInfo(kernel_info);
-		kernel_info = tmp;
-	}
-	
-	Plan->kernel_info = NULL;
-		
-	if(Plan->kernel_string)
-	{
-		delete Plan->kernel_string;
-		Plan->kernel_string = NULL;
-	}			
-	if(Plan->twist_kernel)
-	{
-		clReleaseKernel(Plan->twist_kernel);
-		Plan->twist_kernel = NULL;
-	}
-	if(Plan->program)
-	{
-		clReleaseProgram(Plan->program);
-		Plan->program = NULL;
-	}
-	if(Plan->tempmemobj) 
-	{
-		clReleaseMemObject(Plan->tempmemobj);
-		Plan->tempmemobj = NULL;
-	}
-	if(Plan->tempmemobj_real)
-	{
-		clReleaseMemObject(Plan->tempmemobj_real);
-		Plan->tempmemobj_real = NULL;
-	}
-	if(Plan->tempmemobj_imag)
-	{
-		clReleaseMemObject(Plan->tempmemobj_imag);
-		Plan->tempmemobj_imag = NULL;
-	}
+    while(kernel_info)
+    {
+        cl_fft_kernel_info *tmp = kernel_info->next;
+        deleteKernelInfo(kernel_info);
+        kernel_info = tmp;
+    }
+
+    Plan->kernel_info = NULL;
+
+    if(Plan->kernel_string)
+    {
+        delete Plan->kernel_string;
+        Plan->kernel_string = NULL;
+    }
+    if(Plan->twist_kernel)
+    {
+        clReleaseKernel(Plan->twist_kernel);
+        Plan->twist_kernel = NULL;
+    }
+    if(Plan->program)
+    {
+        clReleaseProgram(Plan->program);
+        Plan->program = NULL;
+    }
+    if(Plan->tempmemobj)
+    {
+        clReleaseMemObject(Plan->tempmemobj);
+        Plan->tempmemobj = NULL;
+    }
+    if(Plan->tempmemobj_real)
+    {
+        clReleaseMemObject(Plan->tempmemobj_real);
+        Plan->tempmemobj_real = NULL;
+    }
+    if(Plan->tempmemobj_imag)
+    {
+        clReleaseMemObject(Plan->tempmemobj_imag);
+        Plan->tempmemobj_imag = NULL;
+    }
 }
 
 static int
-createKernelList(cl_fft_plan *plan) 
+createKernelList(cl_fft_plan *plan)
 {
-	cl_program program = plan->program;
-	cl_fft_kernel_info *kernel_info = plan->kernel_info;
-	
-	cl_int err;
-	while(kernel_info)
-	{
-		kernel_info->kernel = clCreateKernel(program, kernel_info->kernel_name, &err);
-		if(!kernel_info->kernel || err != CL_SUCCESS)
-			return err;
-		kernel_info = kernel_info->next;		
-	}
-	
-	if(plan->format == clFFT_SplitComplexFormat)
-		plan->twist_kernel = clCreateKernel(program, "clFFT_1DTwistSplit", &err);
-	else
-		plan->twist_kernel = clCreateKernel(program, "clFFT_1DTwistInterleaved", &err);
-	
-	if(!plan->twist_kernel || err)
-		return err;
+    cl_program program = plan->program;
+    cl_fft_kernel_info *kernel_info = plan->kernel_info;
 
-	return CL_SUCCESS;
+    cl_int err;
+    while(kernel_info)
+    {
+        kernel_info->kernel = clCreateKernel(program, kernel_info->kernel_name, &err);
+        if(!kernel_info->kernel || err != CL_SUCCESS)
+            return err;
+        kernel_info = kernel_info->next;
+    }
+
+    if(plan->format == clFFT_SplitComplexFormat)
+        plan->twist_kernel = clCreateKernel(program, "clFFT_1DTwistSplit", &err);
+    else
+        plan->twist_kernel = clCreateKernel(program, "clFFT_1DTwistInterleaved", &err);
+
+    if(!plan->twist_kernel || err)
+        return err;
+
+    return CL_SUCCESS;
 }
 
 int getMaxKernelWorkGroupSize(cl_fft_plan *plan, unsigned int *max_wg_size, unsigned int num_devices, cl_device_id *devices)
-{	
+{
     int reg_needed = 0;
     *max_wg_size = INT_MAX;
     int err;
     unsigned wg_size;
-    
+
     unsigned int i;
     for(i = 0; i < num_devices; i++)
     {
-	    cl_fft_kernel_info *kInfo = plan->kernel_info;
-	    while(kInfo)
-	    {
-		    err = clGetKernelWorkGroupInfo(kInfo->kernel, devices[i], CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &wg_size, NULL);
-		    if(err != CL_SUCCESS)
-		        return -1;
-		        
-		    if(wg_size < kInfo->num_workitems_per_workgroup)
-		        reg_needed |= 1;
-		    
-		    if(*max_wg_size > wg_size)
-		        *max_wg_size = wg_size;
-		        
-		    kInfo = kInfo->next;
-	    }
-	}
-	
-	return reg_needed;
-}	
+        cl_fft_kernel_info *kInfo = plan->kernel_info;
+        while(kInfo)
+        {
+            err = clGetKernelWorkGroupInfo(kInfo->kernel, devices[i], CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &wg_size, NULL);
+            if(err != CL_SUCCESS)
+                return -1;
+
+            if(wg_size < kInfo->num_workitems_per_workgroup)
+                reg_needed |= 1;
+
+            if(*max_wg_size > wg_size)
+                *max_wg_size = wg_size;
+
+            kInfo = kInfo->next;
+        }
+    }
+
+    return reg_needed;
+}
 
 #define ERR_MACRO(err) { \
                          if( err != CL_SUCCESS) \
@@ -223,179 +224,207 @@
                            if(error_code) \
                                *error_code = err; \
                            clFFT_DestroyPlan((clFFT_Plan) plan); \
-						   return (clFFT_Plan) NULL; \
+                           return (clFFT_Plan) NULL; \
                          } \
-					   }
+                       }
 
 clFFT_Plan
 clFFT_CreatePlan(cl_context context, clFFT_Dim3 n, clFFT_Dimension dim, clFFT_DataFormat dataFormat, cl_int *error_code )
 {
-	int i;
-	cl_int err;
-	int isPow2 = 1;
-	cl_fft_plan *plan = NULL;
-	ostringstream kString;
-	int num_devices;
-	int gpu_found = 0;
-	cl_device_id devices[16];
-	size_t ret_size;
-	cl_device_type device_type;
-	
+    int i;
+    cl_int err;
+    int isPow2 = 1;
+    cl_fft_plan *plan = NULL;
+    ostringstream kString;
+    int num_devices;
+    int gpu_found = 0;
+    cl_device_id devices[16];
+    size_t ret_size;
+    cl_device_type device_type;
+
     if(!context)
-		ERR_MACRO(CL_INVALID_VALUE);
-	
-	isPow2 |= n.x && !( (n.x - 1) & n.x );
-	isPow2 |= n.y && !( (n.y - 1) & n.y );
-	isPow2 |= n.z && !( (n.z - 1) & n.z );
-	
-	if(!isPow2)
-		ERR_MACRO(CL_INVALID_VALUE);
-	
-	if( (dim == clFFT_1D && (n.y != 1 || n.z != 1)) || (dim == clFFT_2D && n.z != 1) )
-		ERR_MACRO(CL_INVALID_VALUE);
+        ERR_MACRO(CL_INVALID_VALUE);
+
+    isPow2 |= n.x && !( (n.x - 1) & n.x );
+    isPow2 |= n.y && !( (n.y - 1) & n.y );
+    isPow2 |= n.z && !( (n.z - 1) & n.z );
+
+    if(!isPow2)
+        ERR_MACRO(CL_INVALID_VALUE);
+
+    if( (dim == clFFT_1D && (n.y != 1 || n.z != 1)) || (dim == clFFT_2D && n.z != 1) )
+        ERR_MACRO(CL_INVALID_VALUE);
 
-	plan = (cl_fft_plan *) malloc(sizeof(cl_fft_plan));
-	if(!plan)
-		ERR_MACRO(CL_OUT_OF_RESOURCES);
-	
-	plan->context = context;
-	clRetainContext(context);
-	plan->n = n;
-	plan->dim = dim;
-	plan->format = dataFormat;
-	plan->kernel_info = 0;
-	plan->num_kernels = 0;
-	plan->twist_kernel = 0;
-	plan->program = 0;
-	plan->temp_buffer_needed = 0;
-	plan->last_batch_size = 0;
-	plan->tempmemobj = 0;
-	plan->tempmemobj_real = 0;
-	plan->tempmemobj_imag = 0;
-	plan->max_localmem_fft_size = 2048;
-	plan->max_work_item_per_workgroup = 256;
-	plan->max_radix = 16;
-	plan->min_mem_coalesce_width = 16;
-	plan->num_local_mem_banks = 16;	
-	
+    plan = (cl_fft_plan *) malloc(sizeof(cl_fft_plan));
+    if(!plan)
+        ERR_MACRO(CL_OUT_OF_RESOURCES);
+
+    plan->context = context;
+    clRetainContext(context);
+    plan->n = n;
+    plan->dim = dim;
+    plan->format = dataFormat;
+    plan->kernel_info = 0;
+    plan->num_kernels = 0;
+    plan->twist_kernel = 0;
+    plan->program = 0;
+    plan->temp_buffer_needed = 0;
+    plan->last_batch_size = 0;
+    plan->tempmemobj = 0;
+    plan->tempmemobj_real = 0;
+    plan->tempmemobj_imag = 0;
+    plan->max_localmem_fft_size = 2048;
+    plan->max_work_item_per_workgroup = 256;
+    plan->max_radix = 16;
+    plan->min_mem_coalesce_width = 16;
+    plan->num_local_mem_banks = 16;
+
 patch_kernel_source:
 
-	plan->kernel_string = new string("");
-	if(!plan->kernel_string)
+    plan->kernel_string = new string("");
+    if(!plan->kernel_string)
         ERR_MACRO(CL_OUT_OF_RESOURCES);
 
-	getBlockConfigAndKernelString(plan);
-	
-	const char *source_str = plan->kernel_string->c_str();
-	plan->program = clCreateProgramWithSource(context, 1, (const char**) &source_str, NULL, &err);
+    getBlockConfigAndKernelString(plan);
+
+    char *source_str;// = plan->kernel_string->c_str(); // gen kernel
+#ifdef gen_kernel
+    char *kernel_name_buf = new char[10];
+    strcpy(kernel_name_buf,plan->kernel_info->kernel_name);
+    std::ofstream ofs(strcat(kernel_name_buf,".cl"));
+    ofs<<*plan->kernel_string<<std::endl;
+    delete kernel_name_buf;
+    // printf("%s \n",plan->kernel_string->c_str());
+    //    exit(0);
+#endif
+    int fd = open("./fft0.cl",O_RDONLY);
+    
+    if (fd<0) {
+        fprintf(stderr, "Failed to load kernel.\n");
+        exit(1);
+    }
+    
+    struct stat stats;
+    fstat(fd,&stats);
+    off_t size = stats.st_size;
+    if (size<=0) {
+        fprintf(stderr, "Failed to load kernel.\n");
+        exit(1);
+    }
+
+    source_str = (char*)alloca(size);
+    size_t source_size = read(fd, source_str, size);
+    close(fd);
+
+    plan->program = clCreateProgramWithSource(context, 1, (const char**) &source_str, (const size_t *)&source_size, &err);
     ERR_MACRO(err);
 
-	err = clGetContextInfo(context, CL_CONTEXT_DEVICES, sizeof(devices), devices, &ret_size);
-	ERR_MACRO(err);
-	
-	num_devices = (int)(ret_size / sizeof(cl_device_id));
-	
-	for(i = 0; i < num_devices; i++)
-	{
-		err = clGetDeviceInfo(devices[i], CL_DEVICE_TYPE, sizeof(device_type), &device_type, NULL);
-		ERR_MACRO(err);
-		
-		if(device_type == CL_DEVICE_TYPE_GPU)
-		{	
-			gpu_found = 1;
-	        err = clBuildProgram(plan->program, 1, &devices[i], "-cl-mad-enable", NULL, NULL);
-	        if (err != CL_SUCCESS)
-	        {
-		        char *build_log;				
-				char devicename[200];
-		        size_t log_size;
-				
-		        err = clGetProgramBuildInfo(plan->program, devices[i], CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);
-				ERR_MACRO(err);
-				
-		        build_log = (char *) malloc(log_size + 1);
-				
-			    err = clGetProgramBuildInfo(plan->program, devices[i], CL_PROGRAM_BUILD_LOG, log_size, build_log, NULL);
-				ERR_MACRO(err);
-				
-				err = clGetDeviceInfo(devices[i], CL_DEVICE_NAME, sizeof(devicename), devicename, NULL);
-				ERR_MACRO(err);
-				
-				fprintf(stdout, "FFT program build log on device %s\n", devicename);
-		        fprintf(stdout, "%s\n", build_log);
-		        free(build_log);
-				
-				ERR_MACRO(err);
-			}	
-		}	
-	}
-	
-	if(!gpu_found)
-		ERR_MACRO(CL_INVALID_CONTEXT);
-	
-	err = createKernelList(plan); 
+    err = clGetContextInfo(context, CL_CONTEXT_DEVICES, sizeof(devices), devices, &ret_size);
     ERR_MACRO(err);
-    
+
+    num_devices = (int)(ret_size / sizeof(cl_device_id));
+
+    for(i = 0; i < num_devices; i++)
+    {
+        err = clGetDeviceInfo(devices[i], CL_DEVICE_TYPE, sizeof(device_type), &device_type, NULL);
+        ERR_MACRO(err);
+
+        if(device_type == CL_DEVICE_TYPE_GPU)
+        {
+            gpu_found = 1;
+            err = clBuildProgram(plan->program, 1, &devices[i], "-cl-mad-enable", NULL, NULL);
+            if (err != CL_SUCCESS)
+            {
+                char *build_log;
+                char devicename[200];
+                size_t log_size;
+
+                err = clGetProgramBuildInfo(plan->program, devices[i], CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);
+                ERR_MACRO(err);
+
+                build_log = (char *) malloc(log_size + 1);
+
+                err = clGetProgramBuildInfo(plan->program, devices[i], CL_PROGRAM_BUILD_LOG, log_size, build_log, NULL);
+                ERR_MACRO(err);
+
+                err = clGetDeviceInfo(devices[i], CL_DEVICE_NAME, sizeof(devicename), devicename, NULL);
+                ERR_MACRO(err);
+
+                fprintf(stdout, "FFT program build log on device %s\n", devicename);
+                fprintf(stdout, "%s\n", build_log);
+                free(build_log);
+
+                ERR_MACRO(err);
+            }
+        }
+    }
+
+    if(!gpu_found)
+        ERR_MACRO(CL_INVALID_CONTEXT);
+
+    err = createKernelList(plan);
+    ERR_MACRO(err);
+
     // we created program and kernels based on "some max work group size (default 256)" ... this work group size
-    // may be larger than what kernel may execute with ... if thats the case we need to regenerate the kernel source 
-    // setting this as limit i.e max group size and rebuild. 
-	unsigned int max_kernel_wg_size; 
-	int patching_req = getMaxKernelWorkGroupSize(plan, &max_kernel_wg_size, num_devices, devices);
-	if(patching_req == -1)
-	{
-	    ERR_MACRO(err);
-	}
-	
-	if(patching_req)
-	{
-	    destroy_plan(plan);
-	    plan->max_work_item_per_workgroup = max_kernel_wg_size;
-	    goto patch_kernel_source;
-	}
-	
-	cl_fft_kernel_info *kInfo = plan->kernel_info;
-	while(kInfo)
-	{
-		plan->num_kernels++;
-		kInfo = kInfo->next;
-	}
-	
-	if(error_code)
-		*error_code = CL_SUCCESS;
-			
-	return (clFFT_Plan) plan;
+    // may be larger than what kernel may execute with ... if thats the case we need to regenerate the kernel source
+    // setting this as limit i.e max group size and rebuild.
+    unsigned int max_kernel_wg_size;
+    int patching_req = getMaxKernelWorkGroupSize(plan, &max_kernel_wg_size, num_devices, devices);
+    if(patching_req == -1)
+    {
+        ERR_MACRO(err);
+    }
+
+    if(patching_req)
+    {
+        destroy_plan(plan);
+        plan->max_work_item_per_workgroup = max_kernel_wg_size;
+        goto patch_kernel_source;
+    }
+
+    cl_fft_kernel_info *kInfo = plan->kernel_info;
+    while(kInfo)
+    {
+        plan->num_kernels++;
+        kInfo = kInfo->next;
+    }
+
+    if(error_code)
+        *error_code = CL_SUCCESS;
+
+    return (clFFT_Plan) plan;
 }
 
-void		 
+void
 clFFT_DestroyPlan(clFFT_Plan plan)
 {
     cl_fft_plan *Plan = (cl_fft_plan *) plan;
-	if(Plan) 
-	{	
-		destroy_plan(Plan);	
-		clReleaseContext(Plan->context);
-		free(Plan);
-	}		
+    if(Plan)
+    {
+        destroy_plan(Plan);
+        clReleaseContext(Plan->context);
+        free(Plan);
+    }
 }
 
 void clFFT_DumpPlan( clFFT_Plan Plan, FILE *file)
 {
-	size_t gDim, lDim;
-	FILE *out;
-	if(!file)
-		out = stdout;
-	else 
-		out = file;
-	
-	cl_fft_plan *plan = (cl_fft_plan *) Plan;
-	cl_fft_kernel_info *kInfo = plan->kernel_info;
-	
-	while(kInfo)
-	{
-		cl_int s = 1;
-		getKernelWorkDimensions(plan, kInfo, &s, &gDim, &lDim);
-		fprintf(out, "Run kernel %s with global dim = {%zd*BatchSize}, local dim={%zd}\n", kInfo->kernel_name, gDim, lDim);
-		kInfo = kInfo->next;
-	}
-	fprintf(out, "%s\n", plan->kernel_string->c_str());
-}
\ No newline at end of file
+    size_t gDim, lDim;
+    FILE *out;
+    if(!file)
+        out = stdout;
+    else
+        out = file;
+
+    cl_fft_plan *plan = (cl_fft_plan *) Plan;
+    cl_fft_kernel_info *kInfo = plan->kernel_info;
+
+    while(kInfo)
+    {
+        cl_int s = 1;
+        getKernelWorkDimensions(plan, kInfo, &s, &gDim, &lDim);
+        fprintf(out, "Run kernel %s with global dim = {%zd*BatchSize}, local dim={%zd}\n", kInfo->kernel_name, gDim, lDim);
+        kInfo = kInfo->next;
+    }
+    fprintf(out, "%s\n", plan->kernel_string->c_str());
+}
--- a/fft_Example/param.txt	Tue Feb 05 15:12:19 2013 +0900
+++ b/fft_Example/param.txt	Tue Feb 05 15:19:02 2013 +0900
@@ -45,13 +45,10 @@
 //
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-
 -n 64 1 1 -batchsize 8192 -dir forward -dim 1D -format plannar -numiter 1000 -testtype out-of-place
 -n 1024 1 1 -batchsize 8192 -dir forward -dim 1D -format plannar -numiter 1000 -testtype out-of-place
--n 1048576 1 1 -batchsize 4 -dir inverse -dim 1D -format interleaved -numiter 1000 -testtype out-of-place
 -n 1024 512 1 -batchsize 8 -dir forward -dim 2D -format interleaved -numiter 1000 -testtype out-of-place
 -n 128 128 128 -batchsize 1 -dir inverse -dim 3D -format interleaved -numiter 1000 -testtype out-of-place
--n 16384 1 1 -batchsize 4 -dir forward -dim 1D -format interleaved -numiter 1 -testtype in-place
 -n 32 2048 1 -batchsize 8 -dir forward -dim 2D -format interleaved -numiter 1 -testtype in-place
 -n 4096 64 1 -batchsize 4 -dir inverse -dim 2D -format plannar -numiter 1 -testtype in-place
 -n 64 32 16 -batchsize 1 -dir inverse -dim 3D -format interleaved -numiter 1 -testtype out-of-place
--- a/fft_Example/param_small.txt	Tue Feb 05 15:12:19 2013 +0900
+++ b/fft_Example/param_small.txt	Tue Feb 05 15:19:02 2013 +0900
@@ -44,4 +44,5 @@
 // Copyright ( C ) 2008 Apple Inc. All Rights Reserved.
 //
 ////////////////////////////////////////////////////////////////////////////////////////////////////
--n 64 1 1 -batchsize 8192 -dir forward -dim 1D -format plannar -numiter 1000 -testtype out-of-place
+
+-n 1024 1 1 -batchsize 8192 -dir forward -dim 1D -format plannar -numiter 1000 -testtype out-of-place
Binary file fft_fixstart/sample2.jpg has changed
Binary file fft_fixstart/sample2.pgm has changed