Mercurial > hg > Members > shoshi > iDB2010

-title: Programming Scalable Service in Code segment and Data segment

--author: Shosi TAMAKI and Shinji KONO

-abstract:

To implement scalable services, not only higher software design,
low level implementation is also important to achieve performance
and reliability.
A combination of fine grain task manager and continuation based
language is good to make Scalable Services on Many core architecture.
Code segment is a
small part of execution code written in a lower language of C.
Data segments are fragments of memory and these are passed
among code segments and CPU cores.  We discuss the pro and cons
of our method.

--Tools for implementing Distributed Application

On demanding construction of scalable services such as Twitter,
FaceBook or Network based Book Publishing, we need new stage of programming
tools. Based on our experiences, we designed and implemented
two major tools to build scalable services: Code segments and
Data Segments.

Not necessary mentioned SEDA \cite{SEDA2001}, scalable services
requires highly distributed servers and highly multi-threaded
program on a server among them. This type of implementation works
fine in theory, but it heavyly depends on low level implementation, such
as threads, synchronized queues and CAS operations.

We have successfully implemented
WWW services using Classical tools such as C++, Java, or even C.
Script Languages such as Perl, PHP or Python are used in front end,
but in case of heavy duty database side, careful implementation
is necessary to achieve good performance.

Now some of the services have more than 10 millions users,
load balancing among several WWW front-end and many
memcached\cite{memcached04} servers to replicate Database accesses using
classical database such as Oracle, mySQL or Postgress, which
performs so badly, Internet companies have to
create Key Value Store system by themselves, such as BigTable
\cite{Chang06bigtable:a}
or Cassandra\cite{cassandra09}.
This situation is sometimes discussed in a context
of ACID vs BASE database scheme \cite{Brewer:2000:TRD:343477.343502}.

Based on our works on Internet programming and Sony PS3
programming, that is Cell architecture\cite{Cell}, now,
we are sure that we need more suitable tools to implement various
components in the scalable services, such as database server,
web server or HTTP response generator even in a front end.

We separate difficulties in two point of views: Code and Data.
This sounds very basic, but since our history is starting from
a single CPU with few memories, current tools are some how obsolete
now, so we have to reconsider the situation
(Fig. \ref{Data and Code in Internet Service}).

<center><img src="fig/two-side.pdf" alt="Data and Code in Internet Service"></center>

We are working on a combination of Continuation based C\cite{kono08f,cbc-sourceforge}
 and
Cerium Engine\cite{cerium-sourceforge}.
The former one is a lower language of C implemented
in GCC\cite{gcc}, and the later is a Open CL\cite{opencl} like fine grain task manager on
PS3 Cell architecture, which supports data segment management on
SPE (Synergistic Processor Engine). Since SPE has only 256Kbytes
local memory, careful management is necessary, so we have to invent
our own memory manager. We can use 6 SPE with 2Tbit/s ring bus in
PS3 Linux (Fig. \ref{Cell Architecture}).

<center><img src="fig/Cell.pdf" alt="Cell Architecture"></center>

In this paper, first we analyze problems in scalable system. Then
we introduce new concepts: Code Segment and Data Segment. Code
segment is implemented in Continuation based C here after we call CbC  and Data Segment
is implemented in a memory management module in many core task manager called
Cerium Engine.

The basic idea is this. We pass the control among module layer without function
call. We cannot use conventional language because it has built-in function
call which which cannot be removed. These module layer segments are
called code segments.
All the data are stored in a message
packed from, which we called data segment, which is controlled uniformly.
Instead of using direct pointer access,
data segments are copied among modules and CPU cores, which are carefully
adopted to the cache or interconnect communication such as DMA.
All the data segments are hashed in $2^n$ size memory pool similar to the
Unix malloc mechanism. This pool is in 64bit address space and
it makes data segment communication far simpler.

It sounds like completely different from current Internet service scheme,
we overcome the difference in following way using code transformation.
First we make entire program in a conventional way. We divide it
into code segment and explicit stack manipulation. During this stage,
communications are reformatted into data segment passing among
code segments. The program still working exactly the same before
transformation and we may use automatic conversion here.
We reorganize it using data parallel and pipeline execution. At this
stage, automatic conversion is not suitable in many cases, so we have
to make translation by hands, but we can use possible equivalence checker
for the program correctness. In order to make pipeline execution,
destructive modification of the content of data segments such as
classical object oriented programming is not allowed.
We have to make copies.

At the last section, we give some of our achievement and
comparison with other tools, such as SEDA or Open CL.

--Problems

Let's think that we are going to make a network game.
Maybe PS3 (6 SPU and 2 PPU )is used in a client side and 32 CPU ( 16 x 2 hyper thread ). We have to use highly pipelined thread and data parallel execution
in both client side and server side, something like SEDA architecture.
We will demonstrate several problems based on our experiences.

---Our Experiences

Our PS3 implementation is SPURS\cite{spurs} like Pipeline Task Manager which
is called Cerium. (We had to write software rendering because of SCEI did
not open GPU information BTW.) Since PS3's SPU has only 256Kbytes memory,
we have to carefully handle memory usage both in case of code and data.

Data segment is copied from PPU to SPU via DMA, which overhead is hide
using Task Pipeline. But we have to avoid too much synchronization
of these copies.

In case of Xeon architecture, CPU cores are shared all the memory,
but actually it has a local cache which is interconnected using quick path.
Cache size is 256Kbyte for each core. Implicit copy is done between
a cache memory and the main memory or between a cache and another cache.
The situation is basically the same in PS3 and Xeon.

---Module Layer

Complex systems such as Operating systems, Database systems,
Network Systems or Game libraries have several module layer.
For examples, Database system has message packing module,
query analyzing module and execution module. Network system
may have ISO standard layer such as presentation, transport
and network. Operating system have v-node file system and
device drivers.

Each module may have 1-5 nested function calls, so we have
more then 10-30 nested function calls in Complex systems.
It can be implemented in normal function calls (Fig.\ref{Layer by call}).
<center><img src="fig/layer-func.pdf" alt="Layer by call"></center>

Using our Continuation base C, layering can be implemented in
a goto statement. Since it has no stack operation, it works very fast (Fig.\ref{Layer by goto}).
<center><img src="fig/layer-continuation.pdf" alt="Layer by goto"></center>

If we have several tasks to do, each
processing in modules can be executed in a pipelined way.
In order to implement the pipeline, we assign threads from
thread pools to each module layer .
Each thread is interconnected by a synchronized queue, which has
certain overheads, but if it is carefully implemented, parallel
processing hides its costs.

Conversion from sequential execution to pipelined execution
is straight forward, but if it has a race condition, correcting
the problem is very difficult.

---SEDA

This combination of pipeline staging and data parallel execution
is the heart of SEDA. But it requires very complex programming.
At first we have design communication queue among all the pipeline
stages. In case of C++, we have to managing all the queue manually
because it lacks garbage collection. It is not so easy and requires
complex memory pools ( or conservative GC ), which is a bug prone
(Fig.\ref{Layer by Thread}).
<center><img src="fig/layer-continuation.pdf" alt="Layer by Thread"></center>

---Thread Implementation

Theoretically SEDA architecture works fine, but it assumes very
fast thread execution with blocking queue. Cassandra key value store
system use Java to implement this architecture. Java 6 is far better
thread execution, but sine it is a combination of user level thread and
kernel level thread, it is not so easy to optimize its execution.

If we use script language to implement this, thing become worse. For
example, Python thread implementation is very bad concurrency\cite{python-gil},
and ruby does not support kernel thread. And their GC mechanism always interfere
with thread executions.

---Blocking queue

Each threads are executed in an event driven way. A task is
put into a blocking queue and it wakes up a thread. The thread
read the queue atomically.

We can write this operation in following way.

    while(Task p = waitingQueue.get()) {
	p.run();
    }

It looks good but {\tt p} is determines just before its execution,
which is no good in terms of branch prediction.
It looks like this delay is not so important, but
it has penalty around 10 clocks. If we have many small task
this situation is not so good. What we need is 10 to 20 instruction
cycle before executing the indirect call.

Besides blocking queue's CAS costs, queue operations include
allocation / deallocation
cost. In case of Java, to avoid GC penalty, link node is not reused and
it simply delete old one and create new one. If the new operation is
shared among threads (unlikely), it requires another CAS, otherwise
it requires separate memory pool for each thread, which consumes a lot
of memory.

---Scheduling

If cost of blocking queue operation is negligible, simple
FIFO scheduling is OK in SEDA from the through put point of view.
But blocking queue requires 10 to 20 instruction cycle under
no race condition. Thread pool size is heavyly depends on
the architecture, that is number of CPU, number of requests,
execution time of tasks.

Sometimes it is better to reduce concurrency and skip these
synchronization costs. In this case, synchronization of threads
becomes just a cost.

---Garbage Collection

Basically communication between layers makes no garbage, because it is
generated and destroyed in fix amount size. But in case of programming
language with GC, if we use memory pool like technique, it makes many
references which GC have to take care of. It makes GC very slow. So
it is better to simply generate and destroy.

Apache Web server features memory pool approach drastically, but it
is an convention, some module use malloc library call directly.

---Programming Correctness

SEDA architecture or SPURS architecture is very complex to implement
and the program working on it is very difficult also. It is very difficult
to test.

For example, message packet between pipeline stages is created and destroyed
in exact moment. If we lost the correct timing, a bug will occur or not
if we are unlucky. This is odd, because even if program itself is deterministic,
it behaves non deterministic dew to pipeline execution timing.

---Many Core Awareness

Open CL \cite{opencl} is a standard library to use Many Core
architecture, but it has very complex interfaces. We have to
write a program on a core in a string, which is compiled in
LLVM \cite{llvm}. Data transfer API is vary and complex, which
requires large amount of code.

In case of Java or Scripting language, we cannot directly control
the copy between cores, which means we cannot hide copy cost
explicitly. We have to care about SPU's local memory size or
cache memory size which is 256Kbytes in this time.

The same careful management is necessary for executing code
which is a data on a core also. We have to transfer code segment
using copying cost hiding technique such as pipeline execution.

These higher level pipeline optimization is very difficult and is not handle well
in compiles. Since compiler technique is working well on streaming instructions, it is some how contradict. It should be designed by hands.

--New Tools

We introduce two main tools, one is Continuation based C and the other one
is Data segment management based on Cerium Task Manager.

--Code Segment

Continuation based C is a C language which all the function is
forced to do tail call elimination. It is implemented using
GCC 4.x. Modification is not so large. We also force
FASTCALL option which assign arguments on registers. This
makes it faster.

CbC Syntax is very simple.

   struct interface1 { DataSegment<Data> *i;};
   struct interface2 { DataSegment<Data> *o;};

   __code f(struct interface1 *a,
	struct interface2 *b) {
       b->o=a->i;
       goto g(b);
   }

In this example, a code segment
\verb+f+ has \verb+input a+ and sends \verb+output b+ to a code segment \verb+g+.
There is no return from code segment \verb+b+, \verb+b+ should call another
continuation using \verb+goto+. Any control structure in C is allowed in CbC
language, but we may restrict ourselves to use \verb+if+ statement
only, because it is sufficient to implement C to CbC translation. In this case,
code segment has one input interface and several output interfaces (fig.\ref{Code Segment}).
<center><img src="fig/code.pdf" alt="Code Segment"></center>

\verb+__code+ and parameterized global goto statement is an extension of
Continuation based C. Unlike \verb+C--+ \cite{cminusminus}'s parameterized goto,
we cannot goto into normal C function because of forced FASTCALL option.

---Continuation

Since code segment has no stack, continuation of code segment is mere
entry address to the code segment. We can call it a light weight continuation.

We also supports full continuation of normal C function using GCC nested function and statement expression. It is implemented some like this in GCC compiler in
a pseudo code with GCC extensions.

    void (*__return)(int retval_, void *_envp);
      __return = ({
         nee_label__ _cbc_exit0;
         void __return_func(int retval_,
                      void *_envp){
               retval = retval_;
              goto exit0;
         }
          if (0) {
             _cbc_exit0:
              return retval;
         }
          __return_func; // return value
        });

        void *__environment =
           __builtin_frame_pointer();

We have a environment pointer which is usually the frame pointer, but
it is not used here, because this is a closure with a hidden environment.
Since this closure is usually implemented using trampoline, that is executable code on stack, if execution code on stack is prohibited, it will not work, but it works on Linux and Mac OS X. In case of Windows case, we cannot use closure,
so we have to assign frame pointer explicitly.  If we don't have to
handle frame pointer directly, generation of continuation is done in
parsing phase. This is important to make GCC modification minimum.

Anyway this can be used like this.

  int main() {
    goto f(1, __environment, __return );
      ....
  }
  __code f(int, void *env,
    __code (*continuation)(int retval_,void *fp)) {
 	goto (continuation)(-1, env);
  }

In this example, \verb+main+ will return -1. When you want to return to
the middle of the normal function or code segment, put an extra function
call over it.


--Data Segment

We have Open CL like task manager with data segment.

Data segment is a set of doubly linked fix size block which also
hashed by the 64bit address. It has $2^n$ size, so it is allocated
efficiently.

Each site, CPU or cores expected to have separated data segment pool.
Data segment address is unique in all CPUs. In case of PS3, SPU has
local storage that is 256Kbytes separate addressing space, which
local address is different from data segment global address.

Code segment will not use global address directly but it will use
offset in data segments in its input interface. So we can use
same code segment both for 64 bit Xeon and SPU 256Kbytes memory.
This means data segment size is normally limited by its hardware,
Typically 16Kbytes (fig.\ref{Pipeline buffered data segment}).

<center><img src="fig/pipeline.pdf" alt="Pipeline buffered data segment"></center>

Each Core have to have two input segments and two output segments
to make pipeline correctly. With two extra segments are necessary
for task array it-selves, so we have 6 segments total.

---Data Segment operations

Data segment has several operations,

   get with no global address
   get with copy
   get with no copy
   get with copy with write back
   get with no copy with write

Allocation/deallocation is not directly handled from its code segment,
because it is handled by the Task Manager in a pipelined way.

API can be called from a Task like this,

    Datasegment tile =
      smanager->get_segment(addr);

but usually it not visible from the task, because its reading
operations were done before its execution and its writing
execution will be done after the task execution.

Data segment may contains other data segments' global address, but
it may invalid. It is a kind of key in a key value store. Consistency
of data segment global address is maintained by the Cerium task manager.

--Typical usage

A code segment is provided input interface, which contains array of
data segment in local address space or cache. Usually availability
of data segment is assured by the task manager. If it is not ready,
the code segment waits and other ready-to-run code segments are executed.

Loading necessary data segments in the input interfaces are done prior
to the execution, may be in a back ground of other code segments
execution.

In following example, \verb+t_exec+ is created, and it has one input
data segment and one output data segment. It can be executed in
any SPU (PS3's CPU core), and \verb+t_print+ task have to wait for
its completion. Finally it is spawned.

    HTask *t_exec =
       manager->create_task(TASK_EXEC);
    t_exec->add_input_datasegment(i_data);
    t_exec->add_output_datasegment(o_data);
    t_exec->set_cpu(SPE_ANY);
    t_print->wait_for(t_exec);
    t_exec->spawn();

When all data segments are ready, the cod segment is executed. During
its execution, next input interfaces may be loading.

After the execution, output interfaces are written into the global
address if necessary. This is also done in a pipelined way.

---Task dependency and Task array

Cerium task manager has very simple FIFO scheduler. It is sufficient
if only through put is matter, which is a usual case.

All the task is stored in data segments, and connected wait-for link.

After a task is finished, the task manager solve these dependencies,
which is a rather heavy task. If tasks are grouped in terms of
dependencies, we can reduce this phase. This is called task array.

All the data is stored in data segments and it is managed in data segment
pool in each separate CPU, that is we need no lock in its allocation.

---Task execution

After the loading of input interface, if we have a next task,
we know where to execute it. It can be passed to the current
task.

    _code task_a(next_task, interface input,
           interface output) {
	.... Task processing
        goto next_task->code(next_task,
          next_task->input, next_task->output);
    }

If we have not task to execute more, we can put mail waiting task in
the \verb+next_task+.
In this way, \verb+next_task+ call address is determined well before the call.

---Data segment deallocation timing

There are two types of data segment.

The one is staying in a main memory
indefinitely, possibly replicated in more reliable storage hierarchy such
as SSD or Hard Disks. It's global address is persistent. It is basically
write only and remain forever in the life of the Internet service. In
other words, it will never be deallocated.  We can call this a persistent
data segment.

The other data segment is stayed in local cores. This is limited and
temporally. It is copied from the persistent data segment. After the
code segment execution, temporal data segment may be copied into
persistent data segment.

Task itself don't care about reading and writing race conditions. It have
be controlled in terms of task dependency or be controlled by the task
manager (fig.\ref{Global and local data segments}).

<center><img src="fig/global.pdf" alt="Global and local data segments"></center>

---Where is the synchronization?

In Cerium Task Manager, a core has a single threaded scheduler. It accept
an array of task as a data segment as a mail. There is a main task manager,
which waits mails from schedulers in cores. Synchronization only
happens in mail transfer among main scheduler and sub schedulers in cores.
This means synchronization itself can be delayed significantly in this scheme.

If some service needs very fast response, dominating special task is necessary.
For example, it wait some events using spin locks or hardware interrupts.


---Hardware support

This architecture requires explicit cache control. But now a day, most
architecture has this kind of cache control such as memory barrier.
Unfortunately these are not standardized.
Using Cerium task manager, we can hide these differences.

---With Conventional Operating System

Task Manager itself is running in a user space. Since tasks are in
data segments and it can be transferred to other user spaces, for
example in other clusters.
Actually we build our task manager in user space.

There are possible operating system supports for this task manager or
we can provide memory space management for code segments and data segments.

---Object Orientation

There are many object oriented programming style since Smalltalk-80\cite{smalltalk}.
C++ or Java has an object in fixed memory address. When a
field of an object is updated, fixed memory contents is updated.
In case of highly pipelined execution, updating memory contents requires
synchronization when the object is shared.

In our scheme, usually input interface and output interface point
different data segment to avoid synchronization.

In ACT3\cite{actor87}, actor has a become operator. An object is replaced by
newly created object. This means object has multiple memory address
according to its update history. In Smalltalk-80, it has object
table and become operator is a replacement of pointer in the list.
The list should be kept in a data segment and update by a single
threaded task.

We can build actor like object oriented system on top of data segment
pools.

---Verification

Basically pipelined tasks are in fact, series of application of
tasks on requests. We can simply writes this using iterator.
In case of word count in a file,

    foreach data segment d
         in ( file ),
         out in (partial_result) {
             task_work_count(d,out);
    }
    task_sum_up(partial_result);

If pipeline execution is correct, we don't have to verify the pipeline
execution, but check the correctness of this sequential execution,
which is much easier.

Once we get verified sequential execution, we can put checking
stage on each pipeline stage.

-- Comparison

Our architecture is a variant of SEDA, but it can reduce
synchronization cost. Using data segment copying, shared
data are reduced. Copying costs are hide using DMA or
cache management instruction.

Open CL is recently introduced but it has very complex
data transfer operation. It is an assembler level description
to achieve best performance. Data segment handling makes it
simple both in syntactically and in memory management efficiency.

Tasks in Open CL is stored in C strings. In our scheme
tasks are all written in code segments, which can run on
any architecture. Actually we can run Cerium both on Mac OS X and
PS3 Linux using the same code. If it contains an optimized code
for SPU, we can run a code has the same behavior with non optimized code.

Script language is easy to describe, but it works sequentially
from the beginning. Python cannot achieve parallel processing
advantage because of interpreter restriction. Our approach is that
divide problem into code segments and data segments and execute
it in an iterator.

This is something like FP\cite{key7}, but using data segments, its execution
becomes suitable for Many core architecture.

KVS, key value store is a distributed database which is separated
from main service program, In our scheme, management of global
data segment can be done in a KVS. We can also use our architecture
to implement a KVS.

In Persistent programming, records and transactions are introduced
in Programming Languages, but at that time, parallel execution is
not well considered. In our scheme, data segments behave as records,
which has several versions.

This architecture uses a new language CbC. It is a lower level
language of C, but still programmers have to learn it. It is
very different programming style also. We think it is not so
easyly accepted by every one.

In structured programming and data flow diagrams \cite{Bruza93thesemantics}, everything has record like structure,
which is called container. In this architecture, containers has
common operations and managed in many core architecture. It is
also executed in CbC.

There is a conversion  algorithm from C to CbC, so we hope
some kind of half automatic conversion of sequential implementation
of the Internet service is possible.

--Conclusions

We are developing SEDA like architecture for software service
architecture. It has code segment system based on Continuation
based C and data segment system based on
Cerium Engine.

Combination of data segment and code segment provides a better
many core programming than Open CL. It is executed in multi
stage pipeline. Code segment provides good implementation
technique of pipeline scheduler.

Data segment copying makes garbage collection unnecessary in
computation pipeline which cannot avoid in case of GC based
language such as Java.

Cerium task engine and Continuation based C compiler is
developed openly and working.
The combination of code segment and data segment is under construction.
author	Shinji KONO <kono@ie.u-ryukyu.ac.jp>
date	Wed, 02 Jun 2010 14:58:12 +0900
parents	5571683e9870
children