Wednesday, February 24, 2016

Parameter Server and Petuum: Specific system for large scale machine learning algorithm.

Parameter Server and Petuum are two distributed computational framework for large scale machine learning tasks. Both are created by CMU guys. The links for them are: Petuum,Paremeter Server. In my previous post (Introduction to Petuum) , I have already made an introduction for Petuum. In this post, I will make an introduction to Parameter Server. This post is not a short abstraction of each part of Parameter Server paper. In this post, I will mainly talk about some important aspects for designing this kind of system for large scale machine learning task, with parameter server being a good example.
  • 1,  Iterative-convergence algorithm and Error Tolerance Convergence
A large number of machine learning problems can be solved by iterative-convergence algorithms. By iterative-convergence I mean, these algorithms iteratively converge to a optimal state which may be a global optimal state or local optimal state.  For this kind of algorithms, there are some key properties that we can exploit to facilitate the design and implementation of a distributed computing system. One of the key properties is Error Tolerance Convergence. I will illustrate this in the following.
  • 2,  Stale Synchronous Parallal (SSP)----trade-off between efficiency and consistency
It's hard to design a distributed system with both well efficiency and consistent guarantee. Most people in distributes system area may be familiar with MapReduce framework. It employs Bulk Synchronous Parallel (BSP) model and provide strong consistency guarantee. For some problems, strong consistency should be necessarily guaranteed. For example, sorting problem. When using MapReduce to solve some machine learning problems the efficiency suffers because MapReduce's consistency mode requires significant communication overhead. Can we sacrifice consistency to gain efficiency improvement in solving large scale machine learning problems. The answer is of cause!

We can treat SSP model as a trade-off between a fully synchronous system and a fully asynchronous system. SSP bring inconsistency into a system, however this inconsistency is bounded. For some machine learning algorithms especially iterative-convergence algorithms, bounded inconsistency would not impose a bad effect and the correctness of executing these algorithms in the context of SSP is guaranteed. Both Parameter Server and Petuum employ SSP model (Actually, the consistency mechanism of Parameter Server can be configured by users, it can be configured as a fully Asyn or Sync model). Why these iterative-convergence machine learning algorithms can, to some extent,  bear inconsistency? The reason is that these algorithm usually have the capacity of tolerating bounded error. Linear regression is a classic model in machine learning. We can use gradient descent algorithm to get the model parameter. In this post, I take gradient descent algorithm as an example to illustrate the property of error-tolerant-convergence. (In this post, I omit basics regarding linear regression and gradient descent algorithm, if you are not familiar with it, you can refer to Andrew Ng's online courses)

In gradient descent algorithm, we usually want to minimize a cost function $J(\Theta)$ , where $\Theta \in \mathbb{R}^{n}$ is a n-dimension vector. The gradient descent algorithm updates $\Theta$ iteratively until $J(\Theta)$ gets its local/global minimum (convergence). In each iteration, $\Theta$ is updated in the following way:
$\Theta :=\Theta+\alpha \cdot \nabla_\Theta  J(\Theta)$             $(1)$
formula $(1)$ is equivalent to the following update:
$\Theta_j:=\Theta_j+\alpha \cdot \frac{\partial J(\Theta )}{\partial \Theta_j}  \quad j \in{1,2...n}$      $(2)$
where $\alpha$ is called learning rate, $ \frac{\partial J(\Theta )}{\partial \Theta_j}$ is called gradient. Formula $(1)$ can be represented in another way:
$\Theta :=\Theta+u$                                    $(3)$
Where $u=\alpha \cdot \nabla_\Theta  J(\Theta)$ is the update in each iteration.  Since the training data sample for this kind of machine learning algorithms are usually independent and identically distributed, we can partition the whole training data set into several parts and assign each part to a worker machine. Then each worker is responsible for computing a partial update based on its assigned data subset. As for strongly consistency system like MapReduce, if we have $r$ workers, $r$ sub-updates are needed to be aggregated  in each iteration and then the parameter $\Theta$ will be updated based on the aggregated gradient (sum of sub-updates). This means the update for parameter $\Theta$ can only be triggered when all workers finish computation in each individual iteration. In this situation,  we trade efficiency for strong consistency. 

SSP model employed by Parameter Server and Petuum actually sacrifice, to some extent, consistency to gain efficiency improvement. At the same time, algorithm correctness is guaranteed. SSP model allow faster worker go a little more "further" than others but guarantee the faster and slowest worker be no more than $\tau$ apart.  In Parameter Server and Petuum $\tau$ could be iteration count. So any worker's sub-gradient may be able to trigger the update for parameter $\Theta$ even though some   other workers may not have their sub-gradient done. This means a faster worker do not have to wait a slower one as long as the slower worker is not that slower (more than $\tau$ far behind). This process does cause errors in iteration, but the errors are bounded. There are theoretical proofs for that in the following papers (To be truth, I am not clear enough regarding every details in the papers):


  • 3,  The architecture of Parameter Server
The previous parts are preparation which is necessary. Now let's take a look at Parameter Server (PS).
   Figure 1,  the architecture of PS
The PS employs client-server model. 
Server group: share global parameters such as $\Theta$ in gradient descent algorithm. The nodes in the server can communicate with each other to replicate and/or migrate parameters for reliability and scaling.
---Server manager node: maintain consistent view of server nodes (metadata of servers)
Work group:  Workers in a group is responsible for a application
---Task scheduler: In each worker group, assign tasks to each worker and monitor their progress. If workers are added or removed, it reschedules unfinished tasks.
Workers only communicate with server group but not with other workers within a group. Worker groups can either work together for a big task or work individually for individual tasks.

  • 4,  Data structure and communication  in Parameter Server
The shared parameter in PS can be represented as a set of (key,value) pairs. In addition, in order to optimize linear algebra operation the keys are ordered. This lets us treat the parameters as (key, value) pairs while endowing them with vector and matrix semantics. For example, in the gradient descent algorithm parameter $\Theta$ can be seen as a vector with each entry being as ($ID$, $\Theta_i$).

The PS supports Range Push and Pull operations.  Suppose the key range is denoted by $R$. Parameter Server includes the following two operations:
$u.push(R,dest)$: send all existing entries of $u$ in key range $R$ to the destination.
$u.pull(R,dest)$:  read all  existing entries of $u$ in key range $R$ from the destination.

  • 5,  An algorithm example:  Distributed Subgradient Descent


Algorithm 1


The example is a little different with the original gradient descent algorithm (a penalty function $\Omega$ is added to original cost function).  However, the method for estimating model parameter and the property of error-tolerant-convergency is almost the same.

  • 6,  Task dependence and flexible consistency
In PS, we are able to  impose specific task dependency constraints. By default, tasks are executed in parallel. If we want a task to be executed only after some other tasks have been finished, we can set this constraints in PS. The PS system is responsible for maintaining the task-dependency constraints, if users define such constraints.  For example, the aggregation logic in $ServerIterate$ of Algorithm 1 updates $w$ only after all worker gradients have been aggregated. This can be implemented by having the updating task depend on the push tasks of all workers. (*Make a note here*: there are some variants of gradient decent algorithm, one of them is stochastic gradient descent algorithm(SGD). SGD does not scan all the data points for each iteration, it instead relies on only one sample from data set for each iteration. In the case of SGD, we do not need the constraints as in algorithm 1)

Instead of providing only one consistency model, PS provides flexibility for uses in defining consistency model. Three kinds of consistency models are supported by PS.

  1. Bulk Synchronous Parallel (BSP) 
  2. Stale Synchronous Parallel (SSP)
  3. Asynchronous Model
Actually, BSP and Asynchronous Model can be seen as special case of SSP.


 7, Distributed Hash table in Server Group

Parameters are stored in server group. PS provides the following mechanism of fault tolerance and scalability.
The Parameter Server partition keys much as a conventional distributed hash table does: keys and server node IDs are both inserted into the hash ring . Note that a physical node is inserted multiple times in the form of virtual nodes to facilitate load-balancing.
Figure 2

Just like that showed in Figure 2, each node manages the key segment starting with its insertion point to the next point by other nodes in the anti-clockwise direction. In Figure 2, server nodes manages segment of the same color. The mapping from key segments to nodes is stored in Zookeeper. Each key segment is then duplicated into the k anti-clockwise neighbor server nodes for fault tolerance.
PS also provides effective mechanism for dealing with failure recovery.

At last, there are still some points which are not discussed in this post such as message passing and compression, (key value) pairs with vector clock, user-defined filters.