Mercurial > hg > Papers > 2014 > kkb-sigos

--- a/benchmark.tex	Fri Apr 18 20:55:25 2014 +0900
+++ b/benchmark.tex	Mon Apr 21 03:30:37 2014 +0900
@@ -1,1 +1,73 @@
 \section{Benchmark}
+本章では、WordCount, FFT を例題として用い、本研究で実装した GpuScheduler および CudaScheduler の測定を行う。
+
+測定環境
+\begin{itemize}
+\item OS : MacOS 10.9.2
+\item CPU : 2*2.66GHz 6-Core Intel Xeon
+\item GPU : NVIDIA Quadro K5000 4096MB
+\item Memory : 16GB 1333MHz DDR3
+\item Compiler : Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
+\end{itemize}
+
+\section{WordCount}
+今回は 100MB のテキストファイルに対して WordCount を行なった。
+表:\ref{table:wordcount}は実行結果である。
+
+\begin{table}[!h]
+  \begin{center}
+    \small
+    \begin{tabular}[t]{c||r} \hline
+      & Run Time \\ \hline
+      1 CPU & 0.73s \\ \hline
+      2 CPU & 0.38s \\ \hline
+      4 CPU & 0.21s \\ \hline
+      8 CPU & 0.12s \\ \hline
+      OpenCL(no pipeline) & 48.32s \\ \hline
+      OpenCL(pipeline) & 46.74s \\ \hline
+      OpenCL Data Parallel & 0.38s \\ \hline
+      CUDA(no pipeline) & 55.71s \\ \hline
+      CUDA(pipeline) & 10.26s \\ \hline
+      CUDA Data Parallel & 0.71s \\ \hline
+    \end{tabular}
+    \caption{WordCount}
+    \label{table:wordcount}
+  \end{center}
+\end{table}
+
+パイプライン処理を行うことで CUDA では5.4倍の性能向上が見られた。
+しかし、OpenCL ではパイプライン処理による性能向上が見られなかった。
+OpenCL と CUDA を用いたそれぞれの Scheduler はほぼ同等な実装である。
+OpenCL でパイプライン処理を行うために実行機構を見直す必要がある。
+一方で、データ並列による実行は 1CPU に対して OpenCL では1.9倍、CUDA では1.02倍という結果になった。
+どちらもタスク並列による実行よりは優れた結果になっている。
+CUDA によるデータ並列実行の機構を見直す必要がある。
+
+\subsection{FFT}
+次に、フーリエ変換と周波数フィルタによる画像処理を行う例題を利用し測定を行う。
+使用する画像のサイズは512*512で、画像に対して High Pass Filter をかけて変換を行う。
+表:\ref{table:fft}は実行結果である。
+
+\begin{table}[!h]
+  \begin{center}
+    \small
+    \begin{tabular}[t]{c||r} \hline
+      & Run Time \\ \hline
+      1 CPU & 0.48s \\ \hline
+      2 CPU & 0.26s \\ \hline
+      4 CPU & 0.17s \\ \hline
+      8 CPU & 0.11s \\ \hline
+      OpenCL & 0.09s \\ \hline
+      CUDA & 0.21s \\ \hline
+    \end{tabular}
+    \label{table:fft}
+    \caption{FFT}
+  \end{center}
+\end{table}
+
+1CPU に対して OpenCL ではの5.3倍、CUDA では2.2倍の性能向上が見られた。
+しかし、WordCount の場合と同様に OpenCL と CUDA で差がある。
+WordCount と FFT の結果から CudaScheduler によるデータ並列実行機構を見直す必要がある。
+また、FFT の OpenCL の kernel は cl\_float2 というベクター型を用いている。
+CUDA では cl\_float2 を float に変換して演算している。
+OpenCL ではベクターの演算なので、その部分に最適化がかかっており結果が良くなっている可能性がある。
--- a/cerium.tex	Fri Apr 18 20:55:25 2014 +0900
+++ b/cerium.tex	Mon Apr 21 03:30:37 2014 +0900
@@ -28,11 +28,21 @@
   \label{fig:createTask}
 \end{figure}

-依存関係が解消され、実行可能になった Task は ActiveTaskList に移される。
-さらに、Scheduler に転送しやすい TaskList に変換してから cpy type に対応する Scheduler に転送される。
-Task が終了すると Scheduler から TaskManager に通知される。
-その通知に従って依存関係が解消され、再び TaskManager から Scheduler に Task が転送される。
 図:\ref{fig:createTask}は Cerium が Task を生成/実行する場合のクラスの構成である。
+TaskManager で依存関係が解消され、実行可能になった Task は ActiveTaskList に移される。
+ActiveTaskList に移された Task は依存関係が存在しないのでどのような順序で実行されても問題ない。
+Task は Scheduler に転送しやすい TaskList に変換してから cpy type に対応する Scheduler に Synchronized Queue である mail を通して転送される。
+Scheduler ではパイプラインで task が処理される(図:\ref{fig:task_scheduler})。
+Task が終了すると Scheduler から TaskManager に mail を通して通知される。
+その通知に従って依存関係が処理され、再び TaskManager から Scheduler に Task が転送される。
+
+\begin{figure}[!h]
+  \begin{center}
+    \includegraphics[scale=0.4]{./images/scheduler.pdf}
+  \end{center}
+  \caption{Task Scheduler}
+  \label{fig:task_scheduler}
+\end{figure}

 以下に Task を生成する例題を示す。
 表:\ref{table:TaskManager_api}は Task を生成に用いる API を示している。
@@ -50,8 +60,8 @@
       set\_param   & Task のパラメータ \\ \hline
       wait\_for    & Task の依存関係 \\ \hline
       set\_cpu     & Task を実行する Device の設定 \\ \hline
-      spawn        & Task を Queue に登録する \\ \hline
-      iterate      & データ並列で実行する Task として Queue に登録する \\ \hline
+      spawn        & Task を登録する \\ \hline
+      iterate      & データ並列で実行する Task として登録する \\ \hline
     \end{tabular}
     \caption{Task 生成に用いる API}
     \label{table:TaskManager_api}
@@ -86,7 +96,7 @@
   \begin{center}
     \small
     \begin{tabular}[t]{c||c|c|c|c} \hline
-      stage&CPU0& CPU1&CPU2&CPU3 \\ \hline
+      &CPU0& CPU1&CPU2&CPU3 \\ \hline
       1&0&1&2&3 \\ \hline
       2&4&5&6&7 \\ \hline
       3&8&9& & \\ \hline
--- a/cerium_gpu.tex	Fri Apr 18 20:55:25 2014 +0900
+++ b/cerium_gpu.tex	Mon Apr 21 03:30:37 2014 +0900
@@ -15,7 +15,7 @@

 \begin{figure}[htpd]
   \begin{center}
-    \includegraphics[scale=0.4]{./images/gpu_arch.pdf}
+    \includegraphics[scale=0.35]{./images/gpu_arch.pdf}
   \end{center}
   \caption{Gpu Architecture}
   \label{fig:gpuarch}
@@ -28,3 +28,27 @@
   \caption{Cpu Architecture}
   \label{fig:cpuarch}
 \end{figure}
+
+\subsection{OpenCL および CUDA を用いた Scheduler の実装}
+Scheduler と CpuThreads に対応させる形で OpenCL を用いた GpuScheduler, GpuThreads、CUDA を用いた CudaScheduler, CudaThreads を実装した。
+TaskManager から転送された TaskList の情報をもとに device 上のメモリ領域を確保する。
+その後、OpenCL ならば CommandQueue、CUDA ならば Stream に Operation を発行していく。
+Operation は発行された順序で実行されるので、host から device へのデータ転送、kernel の実行、device から host へのデータ転送の順に発行する。
+非同期 API を用いることでデータ転送や kernel の実行を並列に行うことができる。
+通常、非同期 API を用いる場合は依存関係を考慮した同期が必要になるが転送されてくる Task の依存関係は TaskManager ですべて解消されているので Scheduler 側では順番を考えず Task を実行して問題ない。
+host から device へのデータ転送は、OpenCL では clEnqueueWriteBuffer、CUDA では cuMempcyHtoDAsync を用いて行われる。
+clEnqueueWriteBuffer は第三引数に CL\_FALSE を指定することで非同期なデータ転送を行う。
+転送されてきた TaskList からデータ並列またはタスク並列で実行するか決定する。
+データ並列で実行する場合は、OpenCL では clEnqueueTaskNDRangeKernel、CUDA では cuLaunchKernel を用いる。
+タスク並列で実行する場合は、OpenCL では clEnqueueTask、CUDA では cuLaunckKernel の引数を1に設定することで実行することができる。
+device から host へのデータ転送は、OpenCL では clEnqueuReadBuffer、CUDA では cuMemcpyDtoHAsync を用いて行われる。
+clEnqueueReadBuffer も clEnqueueWriteBuffer と同様に第三引数に CL\_FALSE を指定することで非同期実行となる。
+転送されてきた Task がすべて終了すると Synchronized Queue である mail を通して TaskManager に Task の終了を通知する。
+終了が通知されると TaskManager で依存関係が解消し、再び TaskList を転送する。
+GpuScheduler および CudaScheduler は複数の CommandQueue および Stream を持っており、パイプラインで実行される。
+
+kernel の記述は以下のようになる。
+\lstinputlisting[caption=multiply(OpenCL),label=test]{./source/Multi.cl}
+\lstinputlisting[caption=multiply(CUDA),label=test]{./source/Multiply.cu}
+
+修飾子など若干の違いはあるが、ほぼ同じ記述で書くことができるが CPU, OpenCL, CUDA のどれか1つの記述から残りのコードも生成できるようにすることが望ましい。
--- a/conclusion.tex	Fri Apr 18 20:55:25 2014 +0900
+++ b/conclusion.tex	Mon Apr 21 03:30:37 2014 +0900
@@ -1,1 +1,6 @@
 \section{まとめ}
+本研究では並列プログラミングフレームワーク Cerium を OpenCL および CUDA に対応させた。
+OpenCL および CUDA に対応させたことで Cerium は単一の記述から CPU および GPU 上での実行が可能になった。
+WordCount, FFT を例題に用い、Scheduler の測定も行なった。
+OpenCL と CUDA で異なる結果が出たことからそれぞれで最適なチューニングの方法が違うことがわかる。
+どちらもチューニングを行えば同等な結果が出ると考えられるのでプロファイラなどを用いて、実装を見直すことが今後の課題となる。
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/images/scheduler.bb	Mon Apr 21 03:30:37 2014 +0900
@@ -0,0 +1,5 @@
+%%Title: ./images/scheduler.pdf
+%%Creator: extractbb 20100328
+%%BoundingBox: 0 0 505 345
+%%CreationDate: Tue Jan 10 13:32:02 2012
+
Binary file images/scheduler.graffle has changed
Binary file images/scheduler.pdf has changed
--- a/introduction.tex	Fri Apr 18 20:55:25 2014 +0900
+++ b/introduction.tex	Mon Apr 21 03:30:37 2014 +0900
@@ -1,12 +1,16 @@
 \section{はじめに}
 GPU の普及と高性能化にともない、GPU の演算資源を画像処理以外の目的にも利用する GPGPU(GPU による汎目的計算)が注目されている。
-しかし、GPU の利用には様々な制約があり、十分な性能を引き出すには GPU のアーキテクチャに適したプログラミングを行う必要がある。
-また、マルチコアプロセッサ上で実行する場合と同様に効率の良い並列プログラムを書くことは難しい。
-そこで、当研究室で開発・改良が行われている並列プログラミングフレームワーク Cerium を GPGPU に対応させた。
+GPU 以外にも Cell, SpursEngine, Xeon Phi など様々なプロセッサが存在する。
+GPU や Cell はメモリ空間が異なるたデータ転送がを効率化しなければ性能向上に繋がらない。
+また、それぞれのプロセッサを利用するにはそれぞれ異なる API を利用する必要があり、それらの対応に多くの時間を取られてしまいプログラムの性能改善に集中することができない。
+様々なプロセッサを統合して扱えるフレームワークとして StarPU などがある。
+StarPU はタスクベースの非同期フレームワークである。
+StarPU にはパイプラインでの実行機構は入ってなく、パイプライン処理を行いたい場合は自分で実装するしかない。
+しかし、パイプライン処理を書くことは非常に煩雑で難しい。
+
+そこで、当研究室で開発・改良が行われている並列プログラミングフレームワーク Cerium を OpenCL, CUDA を用い GPGPU に対応させた。
 Cerium は PS3 および Linux, MacOS X 上で動作する。
-MacOS X 上で GPGPU を行う場合、OpenCL または CUDA を用いる方法が考えられる。

-今回、Cerium 上に OpenCL, CUDA を用いた GPU 実行機構を実装した。

 本論文では、まず OpenCL, CUDA について説明する。
 その後、既存の Cerium の実装および新たに実装した GPU 実行の機構について説明する。
--- a/opencl.tex	Fri Apr 18 20:55:25 2014 +0900
+++ b/opencl.tex	Mon Apr 21 03:30:37 2014 +0900
@@ -1,4 +1,5 @@
-\section{OpenCL}
+\section{Parallel Computing Platform}
+\subsection{OpenCL}
 OpenCL とは、Multi Core CPU と GPU のようなヘテロジニアスな環境を利用した並列計算を支援するフレームワークである。
 演算用のプロセッサ(本研究では GPU)上で動作するプログラム OpenCL C という言語で記述する。
 OpenCL C で記述したプログラムを GPU 上で実行させるために OpenCL Runtime API を利用する。
@@ -6,7 +7,7 @@
 OpenCL Application は host 側のプログラムと device 側のプログラムが一体となって動作する。
 この device 上で動作するプログラムを特別に kernel と呼ぶ。

-\subsection{CommandQueue}
+\subsubsection{CommandQueue}
 OpenCL では、device の操作に CommandQueue を使用する。
 CommandQueue は device に Operation を送るための仕組みである。
 kernel の実行、input buffer の読み込み、output buffer への書き込みなどが Operation となる。
@@ -18,7 +19,7 @@
 out of order で実行する場合、データの依存関係を設定する必要がある。各 Operation を発行する関数には event\_wait\_list と event を指定することができ、これらを利用してデータの依存関係を設定することができる。
 out of order 実行を可能にするプロパティをサポートしている device が少ないため、今回は複数の CommandQueue を用いる方法で実装を行なった。

-\subsection{Data Parallel Execution}
+\subsubsection{OpenCL におけるデータ並列}
 3D グラフィックのような多次元のデータを処理する場合に高い並列度を保つには、データを分割して並列に実行する機能が必要である。
 これを OpenCL ではデータ並列と呼んでいる。
 OpenCL では次元数に対応する index があり、OpenCL は1つの記述から index の異なる複数の kernel を自動生成する。
@@ -30,3 +31,54 @@

 データ並列による kernel 実行の場合、clEnqueueNDRangeKernel を使用する。
 この関数の引数として WorkItem の数と次元数を指定することでデータ並列で実行することができる。
+
+\subsection{CUDA}
+CUDA とは、半導体メーカー NVIDIA 社が提供する GPU コンピューティング向けの総合開発環境でコンパイラ、ライブラリ、デバッガなどから構成される。プログラミング言語である CUDA C は C 言語ベースに拡張を加えたものである。
+
+CUDA には CUDA Runtime API と CUDA Driver API の2種類がある。
+Driver API は Runtime API に比べてプログラマが管理すべきリソースが多い。
+しかし、Runtime API より柔軟な処理を行うことができる。
+今回は Driver API を使用して実装した。
+
+CUDA も OpenCL と同様に、制御を行う CPU 側を host、GPU 側を device と定義している。
+また、device 上で動作するプログラムも OpenCL と同様に kernel と呼ぶ。
+
+\subsubsection{Stream}
+CUDA には OpenCL の CommandQueue と似たような仕組みとして Stream がある。
+Stream は host 側で発行された Operation を一連の動作として device で実行する。
+Stream に発行された Operation は発行された順序で実行されることが保証されている。
+異なる Stream に発行された Operation に依存関係が存在しない場合、Operation を並列に実行することができる。
+
+Stream は cuStreamCreate という Driver API で生成される。
+引数に Stream を指定しない API はすべて host 側をブロックする同期的な処理となる。
+複数の Stream を同時に走らせ Operation を並列に実行するためには非同期な処理を行う API を利用する必要がある。
+
+\subsubsection{CUDA におけるデータ並列}
+CUDA では OpenCL の WorkItem に相当する単位を thread と定義している。
+この thread をまとめたものを block と呼ぶ。
+CUDA でデータ並列による kernel 実行をする場合、cuLaunchKernel API を使用する。
+この関数は引数として各座標の block 数と各座標の block 1つ当たりの thread 数を指定することでデータ並列で実行できる。
+
+cuLaunckKernel で kernel を実行すると各 thread に対して block ID と thread ID が割り当てられる。
+CUDA には OpenCL とは異なり、ID を取得する API は存在しない。
+代わりに、kernel に組み込み変数が準備されており、それを参照し、対応するデータに対し処理を行うことでデータ並列を実現する。
+組み込み変数は以下の通りである。
+
+\begin{itemize}
+  \item uint3 blockDim
+  \item uint3 blockIdx
+  \item uint3 threadIdx
+\end{itemize}
+
+各組み込み変数はベクター型で、blockDim.x とすると x 座標の thread 数を参照することができる。
+blockIdx.x とすると x 座標の block ID が参照でき、threadIdx.x とすると x 座標の thread ID を参照することができる。
+blockDim.x * blockIdx.x + threadIdx.x で OpenCL の get\_global\_id(0) で取得できる ID に相当する ID を算出することができる。
+例として、ある kernel で get\_global\_id(0) の返り値が13の場合、CUDA では図:\ref{fig:calculate_index}のようになる。
+
+\begin{figure}[!h]
+  \begin{center}
+    \includegraphics[scale=0.4]{./images/culculate_index.pdf}
+  \end{center}
+  \caption{Calculate Index}
+  \label{fig:calculate_index}
+\end{figure}
--- a/sigos.aux	Fri Apr 18 20:55:25 2014 +0900
+++ b/sigos.aux	Mon Apr 21 03:30:37 2014 +0900
@@ -1,11 +1,22 @@
 \relax
 \newlabel{fig:calculate_index}{{1}{3}}
 \newlabel{fig:createTask}{{2}{3}}
+\newlabel{fig:task_scheduler}{{3}{3}}
 \newlabel{test}{{1}{3}}
 \@writefile{lol}{\contentsline {lstlisting}{\numberline {1}multiply}{3}}
-\newlabel{table:TaskManager_api}{{1}{3}}
-\newlabel{test}{{2}{3}}
-\@writefile{lol}{\contentsline {lstlisting}{\numberline {2}task}{3}}
+\newlabel{table:TaskManager_api}{{1}{4}}
+\newlabel{test}{{2}{4}}
+\@writefile{lol}{\contentsline {lstlisting}{\numberline {2}task}{4}}
+\newlabel{table:taskAPI}{{2}{4}}
+\newlabel{table:dpi}{{3}{4}}
+\newlabel{test}{{3}{4}}
+\@writefile{lol}{\contentsline {lstlisting}{\numberline {3}example}{4}}
+\newlabel{fig:gpuarch}{{4}{5}}
+\newlabel{fig:cpuarch}{{5}{5}}
+\newlabel{test}{{4}{5}}
+\@writefile{lol}{\contentsline {lstlisting}{\numberline {4}multiply(OpenCL)}{5}}
+\newlabel{test}{{5}{5}}
+\@writefile{lol}{\contentsline {lstlisting}{\numberline {5}multiply(CUDA)}{5}}
 \citation{toma:2012a}
 \citation{opencl}
 \citation{opencl:ref}
@@ -14,14 +25,10 @@
 \bibstyle{ipsjunsrt}
 \bibdata{sigos}
 \bibcite{toma:2012a}{1}
-\newlabel{table:taskAPI}{{2}{4}}
-\newlabel{table:dpi}{{3}{4}}
-\newlabel{test}{{3}{4}}
-\@writefile{lol}{\contentsline {lstlisting}{\numberline {3}example}{4}}
-\newlabel{fig:gpuarch}{{3}{4}}
 \bibcite{opencl}{2}
 \bibcite{opencl:ref}{3}
 \bibcite{opencl:applied}{4}
 \bibcite{yutaka:os}{5}
-\newlabel{fig:cpuarch}{{4}{5}}
-\gdef\ipsj@lastpage{5}
+\newlabel{table:wordcount}{{4}{6}}
+\newlabel{table:fft}{{6.1}{6}}
+\gdef\ipsj@lastpage{6}
Binary file sigos.dvi has changed
--- a/sigos.log	Fri Apr 18 20:55:25 2014 +0900
+++ b/sigos.log	Mon Apr 21 03:30:37 2014 +0900
@@ -1,4 +1,4 @@
-This is pTeX, Version 3.141592-p3.1.10 (utf8.euc) (Web2C 7.5.4) (format=platex 2011.8.15)  17 APR 2014 19:19
+This is pTeX, Version 3.141592-p3.1.10 (utf8.euc) (Web2C 7.5.4) (format=platex 2011.8.15)  21 APR 2014 03:28
 **sigos.tex
 (./sigos.tex
 pLaTeX2e <2006/11/10>+0 (based on LaTeX2e <2003/12/01> patch level 0)
@@ -148,6 +148,12 @@

 LaTeX Warning: Label `test' multiply defined.

+
+LaTeX Warning: Label `test' multiply defined.
+
+
+LaTeX Warning: Label `test' multiply defined.
+
 )
 \openout1 = `sigos.aux'.

@@ -186,43 +192,41 @@

 LaTeX Font Info:    External font `cmex10' loaded for size
 (Font)              <7> on input line 86.
-(./introduction.tex) (./opencl.tex [1
+(./introduction.tex) (./opencl.tex
+LaTeX Font Info:    Font shape `JT1/mc/bx/n' in size <9> not available
+(Font)              Font shape `JT1/gt/m/n' tried instead on input line 2.
+LaTeX Font Info:    Font shape `JY1/mc/bx/n' in size <9> not available
+(Font)              Font shape `JY1/gt/m/n' tried instead on input line 2.
+ [1


 ]
-LaTeX Font Info:    Font shape `JT1/mc/bx/n' in size <9> not available
-(Font)              Font shape `JT1/gt/m/n' tried instead on input line 9.
-LaTeX Font Info:    Font shape `JY1/mc/bx/n' in size <9> not available
-(Font)              Font shape `JY1/gt/m/n' tried instead on input line 9.
-) (./cuda.tex
 File: ./images/culculate_index.pdf Graphic file (type eps)
-
-<./images/culculate_index.pdf>
-Overfull \hbox (20.78407pt too wide) in paragraph at lines 46--47
+ <./images/culculate_index.pdf>
+Overfull \hbox (20.78407pt too wide) in paragraph at lines 80--81
  []
  []

 LaTeX Font Info:    Font shape `JT1/mc/bx/n' in size <7> not available
-(Font)              Font shape `JT1/gt/m/n' tried instead on input line 48.
+(Font)              Font shape `JT1/gt/m/n' tried instead on input line 82.
 LaTeX Font Info:    Font shape `JY1/mc/bx/n' in size <7> not available
-(Font)              Font shape `JY1/gt/m/n' tried instead on input line 48.
-
-LaTeX Warning: `!h' float specifier changed to `!ht'.
-
-) (./cerium.tex [2]
+(Font)              Font shape `JY1/gt/m/n' tried instead on input line 82.
+[2]) (./cerium.tex
 File: ./images/createTask.pdf Graphic file (type eps)
  <./images/createTask.pdf>
 Overfull \hbox (0.70938pt too wide) in paragraph at lines 25--26
  []
  []

-(./source/multiply.cc
+File: ./images/scheduler.pdf Graphic file (type eps)
+<./images/scheduler.pdf> (./source/multiply.cc
 LaTeX Font Info:    Font shape `JT1/mc/bx/n' in size <8> not available
 (Font)              Font shape `JT1/gt/m/n' tried instead on input line 1.
 LaTeX Font Info:    Font shape `JY1/mc/bx/n' in size <8> not available
 (Font)              Font shape `JY1/gt/m/n' tried instead on input line 1.
 LaTeX Font Info:    Try loading font information for OMS+cmr on input line 2.
- (/usr/local/share/texmf-dist/tex/latex/base/omscmr.fd
+
+(/usr/local/share/texmf-dist/tex/latex/base/omscmr.fd
 File: omscmr.fd 1999/05/25 v2.5h Standard LaTeX font definitions
 )
 LaTeX Font Info:    Font shape `OMS/cmr/m/n' in size <8> not available
@@ -234,49 +238,53 @@
 )
 LaTeX Font Info:    Font shape `OML/cmr/m/n' in size <8> not available
 (Font)              Font shape `OML/cmm/m/it' tried instead on input line 7.
-)
-Overfull \hbox (71.1005pt too wide) in paragraph at lines 41--41
+ [3])
+Overfull \hbox (71.1005pt too wide) in paragraph at lines 51--51
 []\OT1/cmr/m/n/9 multiply
  []


-Overfull \hbox (54.23717pt too wide) in paragraph at lines 46--56
+Overfull \hbox (17.26125pt too wide) in paragraph at lines 56--66
  []
  []

-(./source/multiply_task.cc [3])
-Overfull \hbox (53.93672pt too wide) in paragraph at lines 64--64
+(./source/multiply_task.cc)
+Overfull \hbox (53.93672pt too wide) in paragraph at lines 74--74
 []\OT1/cmr/m/n/9 task
  []

 (./source/multiply_task_dp.cc)
-Overfull \hbox (70.3366pt too wide) in paragraph at lines 102--102
+Overfull \hbox (70.3366pt too wide) in paragraph at lines 112--112
 []\OT1/cmr/m/n/9 example
  []

 ) (./cerium_gpu.tex
 File: ./images/gpu_arch.pdf Graphic file (type eps)
  <./images/gpu_arch.pdf>
-Overfull \hbox (18.77661pt too wide) in paragraph at lines 18--19
- []
+File: ./images/cpu_arch.pdf Graphic file (type eps)
+ <./images/cpu_arch.pdf> [4]
+(./source/Multi.cl)
+Overfull \hbox (112.5995pt too wide) in paragraph at lines 51--51
+[]\OT1/cmr/m/n/9 multiply(OpenCL)
  []

-File: ./images/cpu_arch.pdf Graphic file (type eps)
-<./images/cpu_arch.pdf>) (./benchmark.tex) (./conclusion.tex) (./sigos.bbl
-[4]) [5
+(./source/Multiply.cu)
+Overfull \hbox (105.65578pt too wide) in paragraph at lines 52--52
+[]\OT1/cmr/m/n/9 multiply(CUDA)
+ []

-] (./sigos.aux)
+) (./benchmark.tex [5]) (./conclusion.tex) (./sigos.bbl) [6] (./sigos.aux)

 LaTeX Warning: There were multiply-defined labels.

  )
 Here is how much of TeX's memory you used:
- 2765 strings out of 94681
- 36836 string characters out of 1169859
- 159801 words of memory out of 1500000
- 6012 multiletter control sequences out of 10000+50000
+ 2794 strings out of 94681
+ 37254 string characters out of 1169859
+ 155801 words of memory out of 1500000
+ 6035 multiletter control sequences out of 10000+50000
  18309 words of font info for 70 fonts, out of 1200000 for 2000
  566 hyphenation exceptions out of 8191
- 33i,10n,68p,250b,1708s stack positions out of 5000i,500n,6000p,200000b,5000s
+ 33i,10n,68p,250b,1809s stack positions out of 5000i,500n,6000p,200000b,5000s

-Output written on sigos.dvi (5 pages, 36532 bytes).
+Output written on sigos.dvi (6 pages, 50428 bytes).
Binary file sigos.pdf has changed
--- a/sigos.tex	Fri Apr 18 20:55:25 2014 +0900
+++ b/sigos.tex	Mon Apr 21 03:30:37 2014 +0900
@@ -89,7 +89,6 @@

 \input{introduction}
 \input{opencl}
-\input{cuda}
 \input{cerium}
 \input{cerium_gpu}
 \input{benchmark}
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/source/Multi.cl	Mon Apr 21 03:30:37 2014 +0900
@@ -0,0 +1,7 @@
+__kernel void
+multi(__global const long *params, __global const float* A, __global const float* B, __global float* C)
+{
+    // get index
+    long id = get_global_id(0);
+    C[id]=A[id]*B[id];
+}
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/source/multiply.cc	Mon Apr 21 03:30:37 2014 +0900
@@ -0,0 +1,24 @@
+void
+multi_init(TaskManager *manager)
+{
+    A = new float[length];
+    B = new float[length];
+    C = new float[length];
+    for(int i=0; i<length; i++) {
+        A[i]=(float)(i+1000);
+        B[i]=(float)(i+1)/10.f;
+    }
+
+    // create task
+    HTask* multiply = manager->create_task(MULTIPLY_TASK);
+    // set cputype
+    multiply->set_cpu(spe_cpu);
+    // set indata
+    multiply->set_inData(0,(memaddr)A, sizeof(float)*length);
+    multiply->set_inData(1,(memaddr)B, sizeof(float)*length);
+    // set outdata
+    multiply->set_outData(0,(memaddr)C, sizeof(float)*length);
+    multiply->set_param(0,(long)length);
+    // spawn task
+    multiply->iterate(length);
+}
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/source/multiply.cu	Mon Apr 21 03:30:37 2014 +0900
@@ -0,0 +1,7 @@
+extern "C" {
+    __global__ void multi(long* params, float* A, float* B, float* C) {
+        // calculate index
+        int id = blockIdx.x * blockDim.x + threadIdx.x;
+        C[id]=A[id]*B[id];
+    }
+}
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/source/multiply_task.cc	Mon Apr 21 03:30:37 2014 +0900
@@ -0,0 +1,16 @@
+static int
+run(SchedTask *s)
+{
+    // get input
+    float* A = (float*)s->get_input(0);
+    float* B = (float*)s->get_input(1);
+    // get output
+    float* C = (float*)s->get_output(0);
+    // get parameter
+    long length = (long)s->get_param(0);
+
+    for(int i=0;i<length;i++)
+        C[i]=A[i]*B[i];
+
+    return 0;
+}
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/source/multiply_task_dp.cc	Mon Apr 21 03:30:37 2014 +0900
@@ -0,0 +1,15 @@
+static int
+run(SchedTask *s)
+{
+    // get input
+    float* A = (float*)s->get_input(0);
+    float* B = (float*)s->get_input(1);
+    // get output
+    float* C = (float*)s->get_output(0);
+    // get index
+    long i = (long)s->x;
+
+    C[i]=A[i]*B[i];
+
+    return 0;
+}