view Paper/paper.tex @ 5:17c01f69db69 draft default tip

author Daichi TOMA <>
date Mon, 23 Jul 2012 11:58:20 +0900
parents 03e644cc3366
line wrap: on
line source

\lhead{\parpic{\includegraphics[height=1zw,clip,keepaspectratio]{pic/emblem-bitmap.eps}}Technical Reading \& Writing}

\setlength{\topmargin}{-1in \addtolength{\topmargin}{20mm}}
\setlength{\oddsidemargin}{-1in \addtolength{\oddsidemargin}{20mm}}
\setlength{\evensidemargin}{-1in \addtolength{\evensidemargin}{20mm}}

\title{Implementation of Cerium Parallel Task Manager on Multi-core}
\author{128569G Daichi TOMA}

We have developed Cerium Task Manager\cite{gongo:2008a} that is a Game Framework on the PlayStation 3/Cell\cite{cell}. 
Cerium Task Manager new supporting parallel execution on Mac OS X and Linux. 
In this paper, we described implementation of existing Cerium Task Manager and a new parallel execution. 

\section{Cerium Task Manager}\label{section:cerium}

Cerium Task Manager is a game framework has been developed for the Cell, and include the Rendering Engine.
In Cerium Task Manager, parallel processing is described as a task. 
The task usually consists of a function or subroutine. also the task is setted data inputs, data outputs and dependencies.
Cerium Task Manager managed those tasks, and execute.  

Cerium Task Manager is available on PlayStaiton 3, Linux, Max OSX,
furthermore run the same programs on each platform.
Therefore, to write a programs that does not depend on the architecture is possible.

Cerium Task Manager configure pipeline at various levels of the program,
thus performance improvement. (Figure \ref{fig:scheduler}). 

The task is very simple because only calculate data outputs from data inputs;
nevertheless to switch to those data inputs and outputs as double buffering,
To generate gradually so as to obtain concurrency is very complicate.

Additionally, these data management, it is necessary to the operation that specializes in architecture using parallel execution.\cite{yutaka:2011b}
Cerium Task Manager helps to do to such operation,
therefore be able to concentrate on the implementation of parallel computation.


\subsection{Cell Broadband Engine}
Cell Broadband Engine is a microprocessor architecture jointly developed by Sony, Sony Computer Entertainment, Toshiba, and IBM.
The first major commercial application of Cell Broadband Engine was in Sony's PlayStation 3 game console.
The Cell processor can be split into four components:
external input and output strctures, the main processor called the Power Processing Element (PPE),
eight fully functional co-processors called the Synergistic Processing Elements or SPEs,
and a specialized high-bandwidth circular data bus connecting the PPE, input/output elements and the SPEs,
called the Element Interconnect Bus or EIB (Figure \ref{fig:cell_arch}).

  \caption{Cell Broadband Engine Architecture}

The Cell processor marries the SPEs and the PPE via EIB to give access,
via fully cache coherent DMA (direct memory access), to both main memory and to other external data storage. 
To make the best of EIB, and to overlap computation and data transfer,
each of the nine processing elements (PPE and SPEs) is equipped with a DMA engine.
Since the SPE's load/store instructions can only access its own local memory,
each SPE entirely depends on DMAs to transfer data to and from the main memory and other SPEs' local memories.
A DMA operation can transfer either a single block area of size up to 16KB, or a list of 2 to 2048 such blocks.
One of the major design decisions in the architecture of Cell is the use of DMAs as a central means of intra-chip data transfer,
with a view to enabling maximal asynchrony and concurrency in data processing inside a chip\cite{2006:CMC}.

The PPE, which is capable of running a conventional operating system, 
has control over the SPEs and can start, stop, interrupt, and schedule processes running on the SPEs.
To this end the PPE has additional instructions relating to control of the SPEs. 
Unlike SPEs, the PPE can read and write the main memory and the local memories of SPEs through the standard load/store instructions.
Despite having Turing complete architectures, 
the SPEs are not fully autonomous and require the PPE to prime them before they can do any useful work.
Though most of the "horsepower" of the system comes from the synergistic processing elements,
the use of DMA as a method of data transfer and the limited local memory footprint of each SPE pose a major challenge
to software developers who wish to make the most of this horsepower,
demanding careful hand-tuning of programs to extract maximal performance from this CPU.

The PPE and bus architecture includes various modes of operation giving different levels of memory protection,
allowing areas of memory to be protected from access by specific processes running on the SPEs or the PPE.

Both the PPE and SPE are RISC architectures with a fixed-width 32-bit instruction format.
The PPE contains a 64-bit general purpose register set (GPR), a 64-bit floating point register set (FPR),
and a 128-bit Altivec register set. The SPE contains 128-bit registers only.
These can be used for scalar data types ranging from 8-bits to 128-bits 
in size or for SIMD computations on a variety of integer and floating point formats.
System memory addresses for both the PPE and SPE are expressed as 64-bit values
for a theoretic address range of 264 bytes (16 exabytes or 16,777,216 terabytes).
In practice, not all of these bits are implemented in hardware.
Local store addresses internal to the SPU processor are expressed as a 32-bit word.
In documentation relating to Cell a word is always taken to mean 32 bits, a doubleword means 64 bits, and a quadword means 128 bits.

\subsubsection{Power Processor Element (PPE)}
The PPE(Figure \ref{fig:ppe}) is the Power Architecture based, 
two-way multithreaded core acting as the controller for the eight SPEs,
which handle most of the computational workload. The PPE will work 
with conventional operating systems due to its similarity to other 64-bit PowerPC processors, 
while the SPEs are designed for vectorized floating point code execution. 
The PPE contains a 64 KiB level 1 cache (32 KiB instruction and a 32 KiB data) and a 512 KiB Level 2 cache. 
The size of a cache line is 128 bytes.
Each PPE can complete two double precision operations per clock cycle using a scalar-fused multiply-add instruction,
which translates to 6.4 GFLOPS at 3.2 GHz;
or eight single precision operations per clock cycle with a vector fused-multiply-add instruction,
which translates to 25.6 GFLOPS at 3.2 GHz.

  \caption{PPE (Power Processor Element)}

\subsubsection{Synergistic Processing Elements (SPE)}
Each SPE(Figure \ref{fig:ppe}) is composed of a "Synergistic Processing Unit", SPU, and a "Memory Flow Controller", MFC (DMA, MMU, and bus interface)\cite{cell-ibm}.
An SPE is a RISC processor with 128-bit SIMD organization\cite{cell-ieee} for single and double precision instructions.
With the current generation of the Cell, each SPE contains a 256 KiB embedded SRAM for instruction and data,
called "Local Storage" (not to be mistaken for "Local Memory" in Sony's documents that refer to the VRAM) 
which is visible to the PPE and can be addressed directly by software. Each SPE can support up to 4 GiB of local store memory. 
The local store does not operate like a conventional CPU cache since it is neither transparent 
to software nor does it contain hardware structures that predict which data to load. The SPEs contain a 128-bit,
128-entry register file and measures 14.5 mm2 on a 90 nm process.
An SPE can operate on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, 
or four single-precision floating-point numbers in a single clock cycle, as well as a memory operation. 
Note that the SPU cannot directly access system memory; 
the 64-bit virtual memory addresses formed by the SPU must be passed from the SPU 
to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space.
At 3.2 GHz, each SPE gives a theoretical 25.6 GFLOPS of single precision performance.

  \caption{SPE (Synergistic Processing Element)}

% Cell の説明いれる

% \subsection{Mailbox}
% Mailbox は, Cell の機能の1つである.
% Mailbox は, PPE と SPE の間を双方向で, 32 bit メッセージの受け渡しが可能であり,
% FIFO キュー構造になっている.

\section{mechanism of parallel execution on multi-core}\label{section:impl}

If on a PlayStation 3, Task is assigned to each SPE, then to be executed in parallel.
Cerium Task Manager possible to be executed in parallel on Mac OSX and Linux anew.

We implement a synchronized queue on Mac OS X and Linux.
The synchronized queue corresponds to the Mailbox on Playstation 3.
For only one thread use the synchronized queue, that was managed by a binary semaphore.
Each threads has two synchronized queues for input and output,
be able to execute in parallel tasks was received under managment thread.

Furthermore, because multicore available the same memory space in comparison with Playstation 3,
we modified to pass the pointer a spots that were using the transfer DMA, aimed to improve the speed.


Performance was measured using the example of Word Count, Sort and Prime Counter.
Word Count is to count number of words in the 100MBtext file.
Sort is to sort in one hundred thousand pieces of numeric.
Prime Counter is to enumerate all the prime numbers in the range of up to one million.
for comparsion performance was measured using the same example in PlayStation 3.
Both the optimization level is at the maximum.

The results are shown in Table \ref{table:benchmark}.

{\bf Experiment environment}

\item OS : CentOS 6.0
\item CPU : Intel\textregistered Xeon\textregistered X5650 @2.67GHz * 2
\item Memory : 128GB
\item Compiler : GCC 4.4.4

PlayStation 3/Cell
\item OS : Yellow Dog Linux 6.1
\item CPU : Cell Broadband Engine @ 3.2GHz
\item Memory : 256MB
\item Compiler : GCC 4.1.2

& Word Count & Sort & Prime Counter\\
1 CPU (Cell)& 2381 ms & 6244 ms & 2081 ms \\
6 CPU (Cell)& 1268 ms & 1111 ms & 604 ms\\
1 CPU (Xeon)& 354 ms & 846 ms & 266 ms\\
6 CPU (Xeon)& 70 ms & 163 ms & 50 ms\\
12 CPU (Xeon)& 48 ms & 127 ms & 36 ms\\
24 CPU (Xeon)& 40 ms & 100 ms & 31 ms\\

% Word Count 	354 / 70 = 5.0571
% Sort		846 / 163 = 5.1901
% Prime Counter 266 / 50 = 5.32

We use 6 CPU on CentOS, as compared with the case using 1 CPU, 
about 5.1 times the speed improvement in the example of WordCount,
about 5.2 times the speed improvement in the example of Sort,
about 5.3 times the speed improvement in the example of Prime Counter.
If we use 24 CPU, the speed is rising as compared with the case using 12 CPU, however, the speed improvement rate is down.
This is probably concurrency is low, and that seems to be grinding to a halt speed improvement from Amdahl's law\cite{amdahl}.
Improvement of parallelization rate is a challenge for the future.

% また, 図\ref{fig:multi_result}より, 台数効果が確認できる.

In this paper, we describe a new mechanism of parallel execution and implementation of existing Cerium Task Manager.
By using a new implementation mechanism of parallel execution, You can correspond to a multi-core processor environment on Mac OSX and Linux.

To improve the rate of speed as future work when the number of processors has increased.
In addition, Cerium Task Manager has many type of task, is a drawback of such description.
This can be solved by the system description the dependency of the task rather than on the user side.

\nocite{cell_abi, opencl, clay200912, cell_wiki, cell_cpp, cell_sdk, libspe2}
% \nocite{yutaka:2010a, cell_abi, cell_cpp, cell_sdk, libspe2, ydl, clay200912, fix200609}