Mercurial > hg > Members > amothic > TRW
comparison Paper/paper.tex @ 5:17c01f69db69 draft default tip
finish
author | Daichi TOMA <toma@cr.ie.u-ryukyu.ac.jp> |
---|---|
date | Mon, 23 Jul 2012 11:58:20 +0900 |
parents | 03e644cc3366 |
children |
comparison
equal
deleted
inserted
replaced
4:03e644cc3366 | 5:17c01f69db69 |
---|---|
1 \documentclass[twocolumn,twoside,9.5pt]{article} | 1 \documentclass[twocolumn,twoside,11pt]{article} |
2 \usepackage[dvipdfmx]{graphicx} | 2 \usepackage[dvipdfmx]{graphicx} |
3 \usepackage{url} | 3 \usepackage{url} |
4 \usepackage{picins} | 4 \usepackage{picins} |
5 \usepackage{fancyhdr} | 5 \usepackage{fancyhdr} |
6 \pagestyle{fancy} | 6 \pagestyle{fancy} |
7 \lhead{\parpic{\includegraphics[height=1zw,clip,keepaspectratio]{pic/emblem-bitmap.eps}}Technical Reading \& Writing} | 7 \lhead{\parpic{\includegraphics[height=1zw,clip,keepaspectratio]{pic/emblem-bitmap.eps}}Technical Reading \& Writing} |
8 \rhead{} | 8 \rhead{} |
9 \cfoot{} | 9 \cfoot{} |
10 | 10 |
11 \setlength{\topmargin}{-1in \addtolength{\topmargin}{15mm}} | 11 \setlength{\topmargin}{-1in \addtolength{\topmargin}{20mm}} |
12 \setlength{\headheight}{0mm} | 12 \setlength{\headheight}{0mm} |
13 \setlength{\headsep}{5mm} | 13 \setlength{\headsep}{5mm} |
14 \setlength{\oddsidemargin}{-1in \addtolength{\oddsidemargin}{15mm}} | 14 \setlength{\oddsidemargin}{-1in \addtolength{\oddsidemargin}{20mm}} |
15 \setlength{\evensidemargin}{-1in \addtolength{\evensidemargin}{15mm}} | 15 \setlength{\evensidemargin}{-1in \addtolength{\evensidemargin}{20mm}} |
16 \setlength{\textwidth}{181mm} | 16 \setlength{\textwidth}{171mm} |
17 \setlength{\textheight}{261mm} | 17 \setlength{\textheight}{256mm} |
18 \setlength{\footskip}{0mm} | 18 \setlength{\footskip}{0mm} |
19 \pagestyle{empty} | 19 \pagestyle{empty} |
20 | 20 |
21 \begin{document} | 21 \begin{document} |
22 \title{Implementation of Cerium Parallel Task Manager on Multi-core} | 22 \title{Implementation of Cerium Parallel Task Manager on Multi-core} |
74 \begin{center} | 74 \begin{center} |
75 \includegraphics[scale=0.4]{./pic/cell-main.pdf} | 75 \includegraphics[scale=0.4]{./pic/cell-main.pdf} |
76 \end{center} | 76 \end{center} |
77 \caption{Cell Broadband Engine Architecture} | 77 \caption{Cell Broadband Engine Architecture} |
78 \label{fig:cell_arch} | 78 \label{fig:cell_arch} |
79 \end{figure} | |
80 | |
81 The Cell processor marries the SPEs and the PPE via EIB to give access, | |
82 via fully cache coherent DMA (direct memory access), to both main memory and to other external data storage. | |
83 To make the best of EIB, and to overlap computation and data transfer, | |
84 each of the nine processing elements (PPE and SPEs) is equipped with a DMA engine. | |
85 Since the SPE's load/store instructions can only access its own local memory, | |
86 each SPE entirely depends on DMAs to transfer data to and from the main memory and other SPEs' local memories. | |
87 A DMA operation can transfer either a single block area of size up to 16KB, or a list of 2 to 2048 such blocks. | |
88 One of the major design decisions in the architecture of Cell is the use of DMAs as a central means of intra-chip data transfer, | |
89 with a view to enabling maximal asynchrony and concurrency in data processing inside a chip\cite{2006:CMC}. | |
90 | |
91 The PPE, which is capable of running a conventional operating system, | |
92 has control over the SPEs and can start, stop, interrupt, and schedule processes running on the SPEs. | |
93 To this end the PPE has additional instructions relating to control of the SPEs. | |
94 Unlike SPEs, the PPE can read and write the main memory and the local memories of SPEs through the standard load/store instructions. | |
95 Despite having Turing complete architectures, | |
96 the SPEs are not fully autonomous and require the PPE to prime them before they can do any useful work. | |
97 Though most of the "horsepower" of the system comes from the synergistic processing elements, | |
98 the use of DMA as a method of data transfer and the limited local memory footprint of each SPE pose a major challenge | |
99 to software developers who wish to make the most of this horsepower, | |
100 demanding careful hand-tuning of programs to extract maximal performance from this CPU. | |
101 | |
102 The PPE and bus architecture includes various modes of operation giving different levels of memory protection, | |
103 allowing areas of memory to be protected from access by specific processes running on the SPEs or the PPE. | |
104 | |
105 Both the PPE and SPE are RISC architectures with a fixed-width 32-bit instruction format. | |
106 The PPE contains a 64-bit general purpose register set (GPR), a 64-bit floating point register set (FPR), | |
107 and a 128-bit Altivec register set. The SPE contains 128-bit registers only. | |
108 These can be used for scalar data types ranging from 8-bits to 128-bits | |
109 in size or for SIMD computations on a variety of integer and floating point formats. | |
110 System memory addresses for both the PPE and SPE are expressed as 64-bit values | |
111 for a theoretic address range of 264 bytes (16 exabytes or 16,777,216 terabytes). | |
112 In practice, not all of these bits are implemented in hardware. | |
113 Local store addresses internal to the SPU processor are expressed as a 32-bit word. | |
114 In documentation relating to Cell a word is always taken to mean 32 bits, a doubleword means 64 bits, and a quadword means 128 bits. | |
115 | |
116 | |
117 \subsubsection{Power Processor Element (PPE)} | |
118 The PPE(Figure \ref{fig:ppe}) is the Power Architecture based, | |
119 two-way multithreaded core acting as the controller for the eight SPEs, | |
120 which handle most of the computational workload. The PPE will work | |
121 with conventional operating systems due to its similarity to other 64-bit PowerPC processors, | |
122 while the SPEs are designed for vectorized floating point code execution. | |
123 The PPE contains a 64 KiB level 1 cache (32 KiB instruction and a 32 KiB data) and a 512 KiB Level 2 cache. | |
124 The size of a cache line is 128 bytes. | |
125 Each PPE can complete two double precision operations per clock cycle using a scalar-fused multiply-add instruction, | |
126 which translates to 6.4 GFLOPS at 3.2 GHz; | |
127 or eight single precision operations per clock cycle with a vector fused-multiply-add instruction, | |
128 which translates to 25.6 GFLOPS at 3.2 GHz. | |
129 | |
130 \begin{figure}[htb] | |
131 \begin{center} | |
132 \includegraphics[scale=0.4]{./pic/PPE.pdf} | |
133 \end{center} | |
134 \caption{PPE (Power Processor Element)} | |
135 \label{fig:ppe} | |
136 \end{figure} | |
137 | |
138 \subsubsection{Synergistic Processing Elements (SPE)} | |
139 Each SPE(Figure \ref{fig:ppe}) is composed of a "Synergistic Processing Unit", SPU, and a "Memory Flow Controller", MFC (DMA, MMU, and bus interface)\cite{cell-ibm}. | |
140 An SPE is a RISC processor with 128-bit SIMD organization\cite{cell-ieee} for single and double precision instructions. | |
141 With the current generation of the Cell, each SPE contains a 256 KiB embedded SRAM for instruction and data, | |
142 called "Local Storage" (not to be mistaken for "Local Memory" in Sony's documents that refer to the VRAM) | |
143 which is visible to the PPE and can be addressed directly by software. Each SPE can support up to 4 GiB of local store memory. | |
144 The local store does not operate like a conventional CPU cache since it is neither transparent | |
145 to software nor does it contain hardware structures that predict which data to load. The SPEs contain a 128-bit, | |
146 128-entry register file and measures 14.5 mm2 on a 90 nm process. | |
147 An SPE can operate on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, | |
148 or four single-precision floating-point numbers in a single clock cycle, as well as a memory operation. | |
149 Note that the SPU cannot directly access system memory; | |
150 the 64-bit virtual memory addresses formed by the SPU must be passed from the SPU | |
151 to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space. | |
152 At 3.2 GHz, each SPE gives a theoretical 25.6 GFLOPS of single precision performance. | |
153 | |
154 \begin{figure}[htb] | |
155 \begin{center} | |
156 \includegraphics[scale=0.5]{./pic/SPE.pdf} | |
157 \end{center} | |
158 \caption{SPE (Synergistic Processing Element)} | |
159 \label{fig:spe} | |
79 \end{figure} | 160 \end{figure} |
80 | 161 |
81 % Cell の説明いれる | 162 % Cell の説明いれる |
82 | 163 |
83 % \subsection{Mailbox} | 164 % \subsection{Mailbox} |
131 \item Compiler : GCC 4.1.2 | 212 \item Compiler : GCC 4.1.2 |
132 \end{itemize} | 213 \end{itemize} |
133 \end{small} | 214 \end{small} |
134 | 215 |
135 | 216 |
136 \begin{tiny} | |
137 \begin{table}[h] | 217 \begin{table}[h] |
138 \caption{Benchmark} | 218 \caption{Benchmark} |
139 \label{table:benchmark} | 219 \label{table:benchmark} |
140 \small | 220 {\scriptsize |
141 \begin{tabular}[t]{c||r|r|r} | 221 \begin{tabular}[t]{c||r|r|r} |
142 \hline | 222 \hline |
143 & Word Count & Sort & Prime Counter\\ | 223 & Word Count & Sort & Prime Counter\\ |
144 \hline\hline | 224 \hline\hline |
145 1 CPU (Cell)& 2381 ms & 6244 ms & 2081 ms \\ | 225 1 CPU (Cell)& 2381 ms & 6244 ms & 2081 ms \\ |
152 \hline | 232 \hline |
153 12 CPU (Xeon)& 48 ms & 127 ms & 36 ms\\ | 233 12 CPU (Xeon)& 48 ms & 127 ms & 36 ms\\ |
154 \hline | 234 \hline |
155 24 CPU (Xeon)& 40 ms & 100 ms & 31 ms\\ | 235 24 CPU (Xeon)& 40 ms & 100 ms & 31 ms\\ |
156 \hline | 236 \hline |
157 \end{tabular} | 237 \end{tabular}} |
158 \end{table} | 238 \end{table} |
159 \end{tiny} | |
160 | 239 |
161 % Word Count 354 / 70 = 5.0571 | 240 % Word Count 354 / 70 = 5.0571 |
162 % Sort 846 / 163 = 5.1901 | 241 % Sort 846 / 163 = 5.1901 |
163 % Prime Counter 266 / 50 = 5.32 | 242 % Prime Counter 266 / 50 = 5.32 |
164 | 243 |
178 | 257 |
179 To improve the rate of speed as future work when the number of processors has increased. | 258 To improve the rate of speed as future work when the number of processors has increased. |
180 In addition, Cerium Task Manager has many type of task, is a drawback of such description. | 259 In addition, Cerium Task Manager has many type of task, is a drawback of such description. |
181 This can be solved by the system description the dependency of the task rather than on the user side. | 260 This can be solved by the system description the dependency of the task rather than on the user side. |
182 | 261 |
183 \nocite{cell_abi, opencl, clay200912} | 262 \nocite{cell_abi, opencl, clay200912, cell_wiki, cell_cpp, cell_sdk, libspe2} |
184 % \nocite{yutaka:2010a, cell_abi, cell_cpp, cell_sdk, libspe2, ydl, clay200912, fix200609} | 263 % \nocite{yutaka:2010a, cell_abi, cell_cpp, cell_sdk, libspe2, ydl, clay200912, fix200609} |
185 \bibliographystyle{junsrt} | 264 \bibliographystyle{junsrt} |
186 \bibliography{cerium.bib,book.bib} | 265 \bibliography{cerium.bib,book.bib} |
187 | 266 |
188 \end{document} | 267 \end{document} |