comparison Paper/paper.tex @ 5:17c01f69db69 draft default tip

finish
author Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp>
date Mon, 23 Jul 2012 11:58:20 +0900
parents 03e644cc3366
children
comparison
equal deleted inserted replaced
4:03e644cc3366 5:17c01f69db69
1 \documentclass[twocolumn,twoside,9.5pt]{article} 1 \documentclass[twocolumn,twoside,11pt]{article}
2 \usepackage[dvipdfmx]{graphicx} 2 \usepackage[dvipdfmx]{graphicx}
3 \usepackage{url} 3 \usepackage{url}
4 \usepackage{picins} 4 \usepackage{picins}
5 \usepackage{fancyhdr} 5 \usepackage{fancyhdr}
6 \pagestyle{fancy} 6 \pagestyle{fancy}
7 \lhead{\parpic{\includegraphics[height=1zw,clip,keepaspectratio]{pic/emblem-bitmap.eps}}Technical Reading \& Writing} 7 \lhead{\parpic{\includegraphics[height=1zw,clip,keepaspectratio]{pic/emblem-bitmap.eps}}Technical Reading \& Writing}
8 \rhead{} 8 \rhead{}
9 \cfoot{} 9 \cfoot{}
10 10
11 \setlength{\topmargin}{-1in \addtolength{\topmargin}{15mm}} 11 \setlength{\topmargin}{-1in \addtolength{\topmargin}{20mm}}
12 \setlength{\headheight}{0mm} 12 \setlength{\headheight}{0mm}
13 \setlength{\headsep}{5mm} 13 \setlength{\headsep}{5mm}
14 \setlength{\oddsidemargin}{-1in \addtolength{\oddsidemargin}{15mm}} 14 \setlength{\oddsidemargin}{-1in \addtolength{\oddsidemargin}{20mm}}
15 \setlength{\evensidemargin}{-1in \addtolength{\evensidemargin}{15mm}} 15 \setlength{\evensidemargin}{-1in \addtolength{\evensidemargin}{20mm}}
16 \setlength{\textwidth}{181mm} 16 \setlength{\textwidth}{171mm}
17 \setlength{\textheight}{261mm} 17 \setlength{\textheight}{256mm}
18 \setlength{\footskip}{0mm} 18 \setlength{\footskip}{0mm}
19 \pagestyle{empty} 19 \pagestyle{empty}
20 20
21 \begin{document} 21 \begin{document}
22 \title{Implementation of Cerium Parallel Task Manager on Multi-core} 22 \title{Implementation of Cerium Parallel Task Manager on Multi-core}
74 \begin{center} 74 \begin{center}
75 \includegraphics[scale=0.4]{./pic/cell-main.pdf} 75 \includegraphics[scale=0.4]{./pic/cell-main.pdf}
76 \end{center} 76 \end{center}
77 \caption{Cell Broadband Engine Architecture} 77 \caption{Cell Broadband Engine Architecture}
78 \label{fig:cell_arch} 78 \label{fig:cell_arch}
79 \end{figure}
80
81 The Cell processor marries the SPEs and the PPE via EIB to give access,
82 via fully cache coherent DMA (direct memory access), to both main memory and to other external data storage.
83 To make the best of EIB, and to overlap computation and data transfer,
84 each of the nine processing elements (PPE and SPEs) is equipped with a DMA engine.
85 Since the SPE's load/store instructions can only access its own local memory,
86 each SPE entirely depends on DMAs to transfer data to and from the main memory and other SPEs' local memories.
87 A DMA operation can transfer either a single block area of size up to 16KB, or a list of 2 to 2048 such blocks.
88 One of the major design decisions in the architecture of Cell is the use of DMAs as a central means of intra-chip data transfer,
89 with a view to enabling maximal asynchrony and concurrency in data processing inside a chip\cite{2006:CMC}.
90
91 The PPE, which is capable of running a conventional operating system,
92 has control over the SPEs and can start, stop, interrupt, and schedule processes running on the SPEs.
93 To this end the PPE has additional instructions relating to control of the SPEs.
94 Unlike SPEs, the PPE can read and write the main memory and the local memories of SPEs through the standard load/store instructions.
95 Despite having Turing complete architectures,
96 the SPEs are not fully autonomous and require the PPE to prime them before they can do any useful work.
97 Though most of the "horsepower" of the system comes from the synergistic processing elements,
98 the use of DMA as a method of data transfer and the limited local memory footprint of each SPE pose a major challenge
99 to software developers who wish to make the most of this horsepower,
100 demanding careful hand-tuning of programs to extract maximal performance from this CPU.
101
102 The PPE and bus architecture includes various modes of operation giving different levels of memory protection,
103 allowing areas of memory to be protected from access by specific processes running on the SPEs or the PPE.
104
105 Both the PPE and SPE are RISC architectures with a fixed-width 32-bit instruction format.
106 The PPE contains a 64-bit general purpose register set (GPR), a 64-bit floating point register set (FPR),
107 and a 128-bit Altivec register set. The SPE contains 128-bit registers only.
108 These can be used for scalar data types ranging from 8-bits to 128-bits
109 in size or for SIMD computations on a variety of integer and floating point formats.
110 System memory addresses for both the PPE and SPE are expressed as 64-bit values
111 for a theoretic address range of 264 bytes (16 exabytes or 16,777,216 terabytes).
112 In practice, not all of these bits are implemented in hardware.
113 Local store addresses internal to the SPU processor are expressed as a 32-bit word.
114 In documentation relating to Cell a word is always taken to mean 32 bits, a doubleword means 64 bits, and a quadword means 128 bits.
115
116
117 \subsubsection{Power Processor Element (PPE)}
118 The PPE(Figure \ref{fig:ppe}) is the Power Architecture based,
119 two-way multithreaded core acting as the controller for the eight SPEs,
120 which handle most of the computational workload. The PPE will work
121 with conventional operating systems due to its similarity to other 64-bit PowerPC processors,
122 while the SPEs are designed for vectorized floating point code execution.
123 The PPE contains a 64 KiB level 1 cache (32 KiB instruction and a 32 KiB data) and a 512 KiB Level 2 cache.
124 The size of a cache line is 128 bytes.
125 Each PPE can complete two double precision operations per clock cycle using a scalar-fused multiply-add instruction,
126 which translates to 6.4 GFLOPS at 3.2 GHz;
127 or eight single precision operations per clock cycle with a vector fused-multiply-add instruction,
128 which translates to 25.6 GFLOPS at 3.2 GHz.
129
130 \begin{figure}[htb]
131 \begin{center}
132 \includegraphics[scale=0.4]{./pic/PPE.pdf}
133 \end{center}
134 \caption{PPE (Power Processor Element)}
135 \label{fig:ppe}
136 \end{figure}
137
138 \subsubsection{Synergistic Processing Elements (SPE)}
139 Each SPE(Figure \ref{fig:ppe}) is composed of a "Synergistic Processing Unit", SPU, and a "Memory Flow Controller", MFC (DMA, MMU, and bus interface)\cite{cell-ibm}.
140 An SPE is a RISC processor with 128-bit SIMD organization\cite{cell-ieee} for single and double precision instructions.
141 With the current generation of the Cell, each SPE contains a 256 KiB embedded SRAM for instruction and data,
142 called "Local Storage" (not to be mistaken for "Local Memory" in Sony's documents that refer to the VRAM)
143 which is visible to the PPE and can be addressed directly by software. Each SPE can support up to 4 GiB of local store memory.
144 The local store does not operate like a conventional CPU cache since it is neither transparent
145 to software nor does it contain hardware structures that predict which data to load. The SPEs contain a 128-bit,
146 128-entry register file and measures 14.5 mm2 on a 90 nm process.
147 An SPE can operate on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers,
148 or four single-precision floating-point numbers in a single clock cycle, as well as a memory operation.
149 Note that the SPU cannot directly access system memory;
150 the 64-bit virtual memory addresses formed by the SPU must be passed from the SPU
151 to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space.
152 At 3.2 GHz, each SPE gives a theoretical 25.6 GFLOPS of single precision performance.
153
154 \begin{figure}[htb]
155 \begin{center}
156 \includegraphics[scale=0.5]{./pic/SPE.pdf}
157 \end{center}
158 \caption{SPE (Synergistic Processing Element)}
159 \label{fig:spe}
79 \end{figure} 160 \end{figure}
80 161
81 % Cell の説明いれる 162 % Cell の説明いれる
82 163
83 % \subsection{Mailbox} 164 % \subsection{Mailbox}
131 \item Compiler : GCC 4.1.2 212 \item Compiler : GCC 4.1.2
132 \end{itemize} 213 \end{itemize}
133 \end{small} 214 \end{small}
134 215
135 216
136 \begin{tiny}
137 \begin{table}[h] 217 \begin{table}[h]
138 \caption{Benchmark} 218 \caption{Benchmark}
139 \label{table:benchmark} 219 \label{table:benchmark}
140 \small 220 {\scriptsize
141 \begin{tabular}[t]{c||r|r|r} 221 \begin{tabular}[t]{c||r|r|r}
142 \hline 222 \hline
143 & Word Count & Sort & Prime Counter\\ 223 & Word Count & Sort & Prime Counter\\
144 \hline\hline 224 \hline\hline
145 1 CPU (Cell)& 2381 ms & 6244 ms & 2081 ms \\ 225 1 CPU (Cell)& 2381 ms & 6244 ms & 2081 ms \\
152 \hline 232 \hline
153 12 CPU (Xeon)& 48 ms & 127 ms & 36 ms\\ 233 12 CPU (Xeon)& 48 ms & 127 ms & 36 ms\\
154 \hline 234 \hline
155 24 CPU (Xeon)& 40 ms & 100 ms & 31 ms\\ 235 24 CPU (Xeon)& 40 ms & 100 ms & 31 ms\\
156 \hline 236 \hline
157 \end{tabular} 237 \end{tabular}}
158 \end{table} 238 \end{table}
159 \end{tiny}
160 239
161 % Word Count 354 / 70 = 5.0571 240 % Word Count 354 / 70 = 5.0571
162 % Sort 846 / 163 = 5.1901 241 % Sort 846 / 163 = 5.1901
163 % Prime Counter 266 / 50 = 5.32 242 % Prime Counter 266 / 50 = 5.32
164 243
178 257
179 To improve the rate of speed as future work when the number of processors has increased. 258 To improve the rate of speed as future work when the number of processors has increased.
180 In addition, Cerium Task Manager has many type of task, is a drawback of such description. 259 In addition, Cerium Task Manager has many type of task, is a drawback of such description.
181 This can be solved by the system description the dependency of the task rather than on the user side. 260 This can be solved by the system description the dependency of the task rather than on the user side.
182 261
183 \nocite{cell_abi, opencl, clay200912} 262 \nocite{cell_abi, opencl, clay200912, cell_wiki, cell_cpp, cell_sdk, libspe2}
184 % \nocite{yutaka:2010a, cell_abi, cell_cpp, cell_sdk, libspe2, ydl, clay200912, fix200609} 263 % \nocite{yutaka:2010a, cell_abi, cell_cpp, cell_sdk, libspe2, ydl, clay200912, fix200609}
185 \bibliographystyle{junsrt} 264 \bibliographystyle{junsrt}
186 \bibliography{cerium.bib,book.bib} 265 \bibliography{cerium.bib,book.bib}
187 266
188 \end{document} 267 \end{document}