annotate Paper/paper.tex @ 5:17c01f69db69 draft default tip

finish
author Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
date Mon, 23 Jul 2012 11:58:20 +0900
parents 03e644cc3366
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
5
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
1 \documentclass[twocolumn,twoside,11pt]{article}
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
2 \usepackage[dvipdfmx]{graphicx}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
3 \usepackage{url}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
4 \usepackage{picins}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
5 \usepackage{fancyhdr}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
6 \pagestyle{fancy}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
7 \lhead{\parpic{\includegraphics[height=1zw,clip,keepaspectratio]{pic/emblem-bitmap.eps}}Technical Reading \& Writing}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
8 \rhead{}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
9 \cfoot{}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
10
5
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
11 \setlength{\topmargin}{-1in \addtolength{\topmargin}{20mm}}
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
12 \setlength{\headheight}{0mm}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
13 \setlength{\headsep}{5mm}
5
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
14 \setlength{\oddsidemargin}{-1in \addtolength{\oddsidemargin}{20mm}}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
15 \setlength{\evensidemargin}{-1in \addtolength{\evensidemargin}{20mm}}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
16 \setlength{\textwidth}{171mm}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
17 \setlength{\textheight}{256mm}
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
18 \setlength{\footskip}{0mm}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
19 \pagestyle{empty}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
20
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
21 \begin{document}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
22 \title{Implementation of Cerium Parallel Task Manager on Multi-core}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
23 \author{128569G Daichi TOMA}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
24 \date{}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
25 \maketitle
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
26 \thispagestyle{fancy}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
27
1
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
28 \section{Introduction}
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
29 We have developed Cerium Task Manager\cite{gongo:2008a} that is a Game Framework on the PlayStation 3/Cell\cite{cell}.
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
30 Cerium Task Manager new supporting parallel execution on Mac OS X and Linux.
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
31 In this paper, we described implementation of existing Cerium Task Manager and a new parallel execution.
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
32
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
33 \section{Cerium Task Manager}\label{section:cerium}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
34
1
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
35 Cerium Task Manager is a game framework has been developed for the Cell, and include the Rendering Engine.
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
36 In Cerium Task Manager, parallel processing is described as a task.
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
37 The task usually consists of a function or subroutine. also the task is setted data inputs, data outputs and dependencies.
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
38 Cerium Task Manager managed those tasks, and execute.
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
39
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
40 Cerium Task Manager is available on PlayStaiton 3, Linux, Max OSX,
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
41 furthermore run the same programs on each platform.
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
42 Therefore, to write a programs that does not depend on the architecture is possible.
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
43
1
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
44 Cerium Task Manager configure pipeline at various levels of the program,
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
45 thus performance improvement. (Figure \ref{fig:scheduler}).
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
46
1
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
47 The task is very simple because only calculate data outputs from data inputs;
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
48 nevertheless to switch to those data inputs and outputs as double buffering,
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
49 To generate gradually so as to obtain concurrency is very complicate.
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
50
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
51 Additionally, these data management, it is necessary to the operation that specializes in architecture using parallel execution.\cite{yutaka:2011b}
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
52 Cerium Task Manager helps to do to such operation,
fa9cfac50776 add section for Cerium Task Manager
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 0
diff changeset
53 therefore be able to concentrate on the implementation of parallel computation.
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
54
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
55 \begin{figure}[h]
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
56 \begin{center}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
57 \includegraphics[scale=0.4]{./pic/scheduler.pdf}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
58 \end{center}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
59 \caption{Scheduler}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
60 \label{fig:scheduler}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
61 \end{figure}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
62
4
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
63 \subsection{Cell Broadband Engine}
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
64 Cell Broadband Engine is a microprocessor architecture jointly developed by Sony, Sony Computer Entertainment, Toshiba, and IBM.
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
65 The first major commercial application of Cell Broadband Engine was in Sony's PlayStation 3 game console.
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
66 The Cell processor can be split into four components:
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
67 external input and output strctures, the main processor called the Power Processing Element (PPE),
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
68 eight fully functional co-processors called the Synergistic Processing Elements or SPEs,
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
69 and a specialized high-bandwidth circular data bus connecting the PPE, input/output elements and the SPEs,
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
70 called the Element Interconnect Bus or EIB (Figure \ref{fig:cell_arch}).
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
71
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
72
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
73 \begin{figure}[htb]
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
74 \begin{center}
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
75 \includegraphics[scale=0.4]{./pic/cell-main.pdf}
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
76 \end{center}
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
77 \caption{Cell Broadband Engine Architecture}
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
78 \label{fig:cell_arch}
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
79 \end{figure}
03e644cc3366 add section of cell
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 3
diff changeset
80
5
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
81 The Cell processor marries the SPEs and the PPE via EIB to give access,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
82 via fully cache coherent DMA (direct memory access), to both main memory and to other external data storage.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
83 To make the best of EIB, and to overlap computation and data transfer,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
84 each of the nine processing elements (PPE and SPEs) is equipped with a DMA engine.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
85 Since the SPE's load/store instructions can only access its own local memory,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
86 each SPE entirely depends on DMAs to transfer data to and from the main memory and other SPEs' local memories.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
87 A DMA operation can transfer either a single block area of size up to 16KB, or a list of 2 to 2048 such blocks.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
88 One of the major design decisions in the architecture of Cell is the use of DMAs as a central means of intra-chip data transfer,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
89 with a view to enabling maximal asynchrony and concurrency in data processing inside a chip\cite{2006:CMC}.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
90
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
91 The PPE, which is capable of running a conventional operating system,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
92 has control over the SPEs and can start, stop, interrupt, and schedule processes running on the SPEs.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
93 To this end the PPE has additional instructions relating to control of the SPEs.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
94 Unlike SPEs, the PPE can read and write the main memory and the local memories of SPEs through the standard load/store instructions.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
95 Despite having Turing complete architectures,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
96 the SPEs are not fully autonomous and require the PPE to prime them before they can do any useful work.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
97 Though most of the "horsepower" of the system comes from the synergistic processing elements,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
98 the use of DMA as a method of data transfer and the limited local memory footprint of each SPE pose a major challenge
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
99 to software developers who wish to make the most of this horsepower,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
100 demanding careful hand-tuning of programs to extract maximal performance from this CPU.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
101
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
102 The PPE and bus architecture includes various modes of operation giving different levels of memory protection,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
103 allowing areas of memory to be protected from access by specific processes running on the SPEs or the PPE.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
104
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
105 Both the PPE and SPE are RISC architectures with a fixed-width 32-bit instruction format.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
106 The PPE contains a 64-bit general purpose register set (GPR), a 64-bit floating point register set (FPR),
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
107 and a 128-bit Altivec register set. The SPE contains 128-bit registers only.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
108 These can be used for scalar data types ranging from 8-bits to 128-bits
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
109 in size or for SIMD computations on a variety of integer and floating point formats.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
110 System memory addresses for both the PPE and SPE are expressed as 64-bit values
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
111 for a theoretic address range of 264 bytes (16 exabytes or 16,777,216 terabytes).
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
112 In practice, not all of these bits are implemented in hardware.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
113 Local store addresses internal to the SPU processor are expressed as a 32-bit word.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
114 In documentation relating to Cell a word is always taken to mean 32 bits, a doubleword means 64 bits, and a quadword means 128 bits.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
115
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
116
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
117 \subsubsection{Power Processor Element (PPE)}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
118 The PPE(Figure \ref{fig:ppe}) is the Power Architecture based,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
119 two-way multithreaded core acting as the controller for the eight SPEs,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
120 which handle most of the computational workload. The PPE will work
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
121 with conventional operating systems due to its similarity to other 64-bit PowerPC processors,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
122 while the SPEs are designed for vectorized floating point code execution.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
123 The PPE contains a 64 KiB level 1 cache (32 KiB instruction and a 32 KiB data) and a 512 KiB Level 2 cache.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
124 The size of a cache line is 128 bytes.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
125 Each PPE can complete two double precision operations per clock cycle using a scalar-fused multiply-add instruction,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
126 which translates to 6.4 GFLOPS at 3.2 GHz;
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
127 or eight single precision operations per clock cycle with a vector fused-multiply-add instruction,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
128 which translates to 25.6 GFLOPS at 3.2 GHz.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
129
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
130 \begin{figure}[htb]
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
131 \begin{center}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
132 \includegraphics[scale=0.4]{./pic/PPE.pdf}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
133 \end{center}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
134 \caption{PPE (Power Processor Element)}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
135 \label{fig:ppe}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
136 \end{figure}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
137
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
138 \subsubsection{Synergistic Processing Elements (SPE)}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
139 Each SPE(Figure \ref{fig:ppe}) is composed of a "Synergistic Processing Unit", SPU, and a "Memory Flow Controller", MFC (DMA, MMU, and bus interface)\cite{cell-ibm}.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
140 An SPE is a RISC processor with 128-bit SIMD organization\cite{cell-ieee} for single and double precision instructions.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
141 With the current generation of the Cell, each SPE contains a 256 KiB embedded SRAM for instruction and data,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
142 called "Local Storage" (not to be mistaken for "Local Memory" in Sony's documents that refer to the VRAM)
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
143 which is visible to the PPE and can be addressed directly by software. Each SPE can support up to 4 GiB of local store memory.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
144 The local store does not operate like a conventional CPU cache since it is neither transparent
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
145 to software nor does it contain hardware structures that predict which data to load. The SPEs contain a 128-bit,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
146 128-entry register file and measures 14.5 mm2 on a 90 nm process.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
147 An SPE can operate on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers,
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
148 or four single-precision floating-point numbers in a single clock cycle, as well as a memory operation.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
149 Note that the SPU cannot directly access system memory;
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
150 the 64-bit virtual memory addresses formed by the SPU must be passed from the SPU
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
151 to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
152 At 3.2 GHz, each SPE gives a theoretical 25.6 GFLOPS of single precision performance.
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
153
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
154 \begin{figure}[htb]
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
155 \begin{center}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
156 \includegraphics[scale=0.5]{./pic/SPE.pdf}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
157 \end{center}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
158 \caption{SPE (Synergistic Processing Element)}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
159 \label{fig:spe}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
160 \end{figure}
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
161
2
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
162 % Cell の説明いれる
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
163
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
164 % \subsection{Mailbox}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
165 % Mailbox は, Cell の機能の1つである.
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
166 % Mailbox は, PPE と SPE の間を双方向で, 32 bit メッセージの受け渡しが可能であり,
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
167 % FIFO キュー構造になっている.
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
168
2
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
169 \section{mechanism of parallel execution on multi-core}\label{section:impl}
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
170
2
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
171 If on a PlayStation 3, Task is assigned to each SPE, then to be executed in parallel.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
172 Cerium Task Manager possible to be executed in parallel on Mac OSX and Linux anew.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
173
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
174 We implement a synchronized queue on Mac OS X and Linux.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
175 The synchronized queue corresponds to the Mailbox on Playstation 3.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
176 For only one thread use the synchronized queue, that was managed by a binary semaphore.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
177 Each threads has two synchronized queues for input and output,
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
178 be able to execute in parallel tasks was received under managment thread.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
179
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
180 Furthermore, because multicore available the same memory space in comparison with Playstation 3,
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
181 we modified to pass the pointer a spots that were using the transfer DMA, aimed to improve the speed.
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
182
2
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
183 \section{Benchmark}
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
184
2
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
185 Performance was measured using the example of Word Count, Sort and Prime Counter.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
186 Word Count is to count number of words in the 100MBtext file.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
187 Sort is to sort in one hundred thousand pieces of numeric.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
188 Prime Counter is to enumerate all the prime numbers in the range of up to one million.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
189 for comparsion performance was measured using the same example in PlayStation 3.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
190 Both the optimization level is at the maximum.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
191
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
192 The results are shown in Table \ref{table:benchmark}.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
193
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
194 {\bf Experiment environment}
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
195
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
196 CentOS/Xeon
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
197 \begin{small}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
198 \begin{itemize}\small
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
199 \item OS : CentOS 6.0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
200 \item CPU : Intel\textregistered Xeon\textregistered X5650 @2.67GHz * 2
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
201 \item Memory : 128GB
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
202 \item Compiler : GCC 4.4.4
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
203 \end{itemize}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
204 \end{small}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
205
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
206 PlayStation 3/Cell
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
207 \begin{small}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
208 \begin{itemize}\small
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
209 \item OS : Yellow Dog Linux 6.1
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
210 \item CPU : Cell Broadband Engine @ 3.2GHz
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
211 \item Memory : 256MB
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
212 \item Compiler : GCC 4.1.2
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
213 \end{itemize}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
214 \end{small}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
215
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
216
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
217 \begin{table}[h]
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
218 \caption{Benchmark}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
219 \label{table:benchmark}
5
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
220 {\scriptsize
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
221 \begin{tabular}[t]{c||r|r|r}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
222 \hline
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
223 & Word Count & Sort & Prime Counter\\
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
224 \hline\hline
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
225 1 CPU (Cell)& 2381 ms & 6244 ms & 2081 ms \\
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
226 \hline
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
227 6 CPU (Cell)& 1268 ms & 1111 ms & 604 ms\\
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
228 \hline
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
229 1 CPU (Xeon)& 354 ms & 846 ms & 266 ms\\
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
230 \hline
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
231 6 CPU (Xeon)& 70 ms & 163 ms & 50 ms\\
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
232 \hline
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
233 12 CPU (Xeon)& 48 ms & 127 ms & 36 ms\\
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
234 \hline
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
235 24 CPU (Xeon)& 40 ms & 100 ms & 31 ms\\
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
236 \hline
5
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
237 \end{tabular}}
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
238 \end{table}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
239
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
240 % Word Count 354 / 70 = 5.0571
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
241 % Sort 846 / 163 = 5.1901
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
242 % Prime Counter 266 / 50 = 5.32
2
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
243
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
244 We use 6 CPU on CentOS, as compared with the case using 1 CPU,
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
245 about 5.1 times the speed improvement in the example of WordCount,
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
246 about 5.2 times the speed improvement in the example of Sort,
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
247 about 5.3 times the speed improvement in the example of Prime Counter.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
248 If we use 24 CPU, the speed is rising as compared with the case using 12 CPU, however, the speed improvement rate is down.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
249 This is probably concurrency is low, and that seems to be grinding to a halt speed improvement from Amdahl's law\cite{amdahl}.
7efb3ef94295 add a section of benchmark
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 1
diff changeset
250 Improvement of parallelization rate is a challenge for the future.
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
251
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
252 % また, 図\ref{fig:multi_result}より, 台数効果が確認できる.
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
253
3
4fc34730ac45 add section of conclusions
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 2
diff changeset
254 \section{Conclusions}
4fc34730ac45 add section of conclusions
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 2
diff changeset
255 In this paper, we describe a new mechanism of parallel execution and implementation of existing Cerium Task Manager.
4fc34730ac45 add section of conclusions
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 2
diff changeset
256 By using a new implementation mechanism of parallel execution, You can correspond to a multi-core processor environment on Mac OSX and Linux.
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
257
3
4fc34730ac45 add section of conclusions
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 2
diff changeset
258 To improve the rate of speed as future work when the number of processors has increased.
4fc34730ac45 add section of conclusions
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 2
diff changeset
259 In addition, Cerium Task Manager has many type of task, is a drawback of such description.
4fc34730ac45 add section of conclusions
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 2
diff changeset
260 This can be solved by the system description the dependency of the task rather than on the user side.
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
261
5
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents: 4
diff changeset
262 \nocite{cell_abi, opencl, clay200912, cell_wiki, cell_cpp, cell_sdk, libspe2}
0
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
263 % \nocite{yutaka:2010a, cell_abi, cell_cpp, cell_sdk, libspe2, ydl, clay200912, fix200609}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
264 \bibliographystyle{junsrt}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
265 \bibliography{cerium.bib,book.bib}
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
266
c0689037215f first commit
Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
267 \end{document}