view Paper/paper.tex @ 2:7efb3ef94295 draft

add a section of benchmark
author Daichi TOMA <>
date Sun, 22 Jul 2012 22:08:59 +0900
parents fa9cfac50776
children 4fc34730ac45
line wrap: on
line source

\lhead{\parpic{\includegraphics[height=1zw,clip,keepaspectratio]{pic/emblem-bitmap.eps}}Technical Reading \& Writing}

\setlength{\topmargin}{-1in \addtolength{\topmargin}{15mm}}
\setlength{\oddsidemargin}{-1in \addtolength{\oddsidemargin}{15mm}}
\setlength{\evensidemargin}{-1in \addtolength{\evensidemargin}{15mm}}

\title{Implementation of Cerium Parallel Task Manager on Multi-core}
\author{128569G Daichi TOMA}

We have developed Cerium Task Manager\cite{gongo:2008a} that is a Game Framework on the PlayStation 3/Cell\cite{cell}. 
Cerium Task Manager new supporting parallel execution on Mac OS X and Linux. 
In this paper, we described implementation of existing Cerium Task Manager and a new parallel execution. 

\section{Cerium Task Manager}\label{section:cerium}

Cerium Task Manager is a game framework has been developed for the Cell, and include the Rendering Engine.
In Cerium Task Manager, parallel processing is described as a task. 
The task usually consists of a function or subroutine. also the task is setted data inputs, data outputs and dependencies.
Cerium Task Manager managed those tasks, and execute.  

Cerium Task Manager is available on PlayStaiton 3, Linux, Max OSX,
furthermore run the same programs on each platform.
Therefore, to write a programs that does not depend on the architecture is possible.

Cerium Task Manager configure pipeline at various levels of the program,
thus performance improvement. (Figure \ref{fig:scheduler}). 

The task is very simple because only calculate data outputs from data inputs;
nevertheless to switch to those data inputs and outputs as double buffering,
To generate gradually so as to obtain concurrency is very complicate.

Additionally, these data management, it is necessary to the operation that specializes in architecture using parallel execution.\cite{yutaka:2011b}
Cerium Task Manager helps to do to such operation,
therefore be able to concentrate on the implementation of parallel computation.


% Cell の説明いれる

% \subsection{Mailbox}
% Mailbox は, Cell の機能の1つである.
% Mailbox は, PPE と SPE の間を双方向で, 32 bit メッセージの受け渡しが可能であり,
% FIFO キュー構造になっている.

\section{mechanism of parallel execution on multi-core}\label{section:impl}

If on a PlayStation 3, Task is assigned to each SPE, then to be executed in parallel.
Cerium Task Manager possible to be executed in parallel on Mac OSX and Linux anew.

We implement a synchronized queue on Mac OS X and Linux.
The synchronized queue corresponds to the Mailbox on Playstation 3.
For only one thread use the synchronized queue, that was managed by a binary semaphore.
Each threads has two synchronized queues for input and output,
be able to execute in parallel tasks was received under managment thread.

Furthermore, because multicore available the same memory space in comparison with Playstation 3,
we modified to pass the pointer a spots that were using the transfer DMA, aimed to improve the speed.


Performance was measured using the example of Word Count, Sort and Prime Counter.
Word Count is to count number of words in the 100MBtext file.
Sort is to sort in one hundred thousand pieces of numeric.
Prime Counter is to enumerate all the prime numbers in the range of up to one million.
for comparsion performance was measured using the same example in PlayStation 3.
Both the optimization level is at the maximum.

The results are shown in Table \ref{table:benchmark}.

{\bf Experiment environment}

\item OS : CentOS 6.0
\item CPU : Intel\textregistered Xeon\textregistered X5650 @2.67GHz * 2
\item Memory : 128GB
\item Compiler : GCC 4.4.4

PlayStation 3/Cell
\item OS : Yellow Dog Linux 6.1
\item CPU : Cell Broadband Engine @ 3.2GHz
\item Memory : 256MB
\item Compiler : GCC 4.1.2

& Word Count & Sort & Prime Counter\\
1 CPU (Cell)& 2381 ms & 6244 ms & 2081 ms \\
6 CPU (Cell)& 1268 ms & 1111 ms & 604 ms\\
1 CPU (Xeon)& 354 ms & 846 ms & 266 ms\\
6 CPU (Xeon)& 70 ms & 163 ms & 50 ms\\
12 CPU (Xeon)& 48 ms & 127 ms & 36 ms\\
24 CPU (Xeon)& 40 ms & 100 ms & 31 ms\\

% Word Count 	354 / 70 = 5.0571
% Sort		846 / 163 = 5.1901
% Prime Counter 266 / 50 = 5.32

We use 6 CPU on CentOS, as compared with the case using 1 CPU, 
about 5.1 times the speed improvement in the example of WordCount,
about 5.2 times the speed improvement in the example of Sort,
about 5.3 times the speed improvement in the example of Prime Counter.
If we use 24 CPU, the speed is rising as compared with the case using 12 CPU, however, the speed improvement rate is down.
This is probably concurrency is low, and that seems to be grinding to a halt speed improvement from Amdahl's law\cite{amdahl}.
Improvement of parallelization rate is a challenge for the future.

% また, 図\ref{fig:multi_result}より, 台数効果が確認できる.

本稿では, 既存の Cerium Task Manager の実装と新しい並列実行の機構について説明した.
新しく実装した並列実行の機構を用いることによって, Mac OS X, Linux 上でのマルチプロセッサ環境に対応できる. 

今後の課題として, 並列化率を向上させ, プロセッサ数が増えた時の速度向上率を改善する.
また, 現在の Cerium Task Manager は Task の種類が増え, Open CL\cite{opencl} に比べても記述が煩雑であるなどの欠点がある.
これは Task の依存関係を, ユーザ側ではなくシステム側が記述するようにすることで解決できると考える.

% \nocite{yutaka:2010a, cell_abi, cell_cpp, cell_sdk, libspe2, ydl, clay200912, fix200609}