Distributed Machine Learning on Big Data). This paper comes from academic environment (CMU).
- What's the purpose of creating Petuum?
In machine learning domain, it's a common case that an algorithm can be easily applied to a small date set but can not be applied to a large scale data set. When facing large scale data set, the first thing that comes to mind is parallel process. In a parallel environment, unexpected problems/obstacles may arise. The follows are some obstacles: 1), need a decent programming model/abstraction/framework which can make the codes run in distributed environment. 2), hard to guarantee that an algorithm can work correctly in a distributed system. By saying "correctly", I mean the system can guarantee fault tolerance (for example, even if one or some servers crash, the system has the capacity of making algorithm run as intended). In Petuum, they design a platform applicable to a wide range of different ML algorithms at scale. I think the most significant feature of Petuum is that it takes both System and Algorithm into consideration and proposes the Model Parallelism concept with Theory guarantee. This is a new angle and perspective for design distributed system for ML on large scale data set.
- How does Petuum do?
The inspiration of Petuum comes from an observation that a number of ML algorithms are optimization-based. For these algorithms, we usually have an model with parameters to be estimated/determined. To figure out proper parameters, we usually set a appropriate function $L$ (RMS loss, likelihood, margin) and a stop criteria (for example, $L$ gets its minimum). Basically, what an algorithm does is it iteratively updates parameters until stop criteria is reached. Given this observation, they proposed two parallel mechanisms: Data Parallelism and Model Parallelism for Petuum system.
Iterative-Convergent ML Algorithm: Petuum system is designed for this kind of ML algorithm. The follows are representation of Iterative-Convergent ML Algorithm:
$D$ is data set; loss $L$ is a fitness function such as RMS loss, likelihood, margin; $(t)$ denotes iteration. $A$ is model state (i.e. parameters and/or latent variables). $\Delta _L$ performs computation on data $D$ and model state A, and outputs intermediate results to be aggregated by $F()$. For simplicity, $L$ in the subscript is omitted in the rest of this post.
Data Parallelism: data set $D$ is partitioned and assigned to workers. The $p$-th subset $D_p$ is handled by $p$-th worker. Each worker can iteratively update ($\Delta$) by operating on the data partition it has been assigned. All updated outputs from every workers can be aggregated via summation. The key to the validity of this "additive update" is the notion of "independent and identically distributed data set", which is assumed for many ML algorithm. The data-parallel update equation is:
Model Parallelism: the model $A$ is partitioned and assigned to workers. Unlike data-parallelism, each update takes a scheduling function $S_{P}^{(t-1)}$, which restricts $ \Delta\left ( \right )$ to operate on a subset of the model parameters $A$. This process can be presented by:
Where $D$ is omitted for simplicity. $S_{P}^{(t-1)}$ outputs a set of indices $\{j_1, j_2, ...\}$ so that $ \Delta\left ( \right )$ only performs updates on $A_{j_1}, A_{j_2}, ...$. One thing we should bear in mind is that $A_j$ are not, in general, independent of each other. In my opinion, the core problem for model parallelism of Petuum is how to implement a schedule function $S_p()$.
- Implementing Data and Model parallel Programs--Petuum
Scheduler: The scheduler system enables model-parallelism by allowing users to control which model parameters are updated by worker machines. This is performed through a user-defined scheduling function $schedule( )$ (corresponding to $S_{P}^{(t-1)}$)
Workers: Each worker $p$ receives parameters to be updated from $schedule( )$, and then runs parallelism update functions $push( )$ (corresponding to $ \Delta\left ( \right )$) on data $D$. While $push( )$ is being executed, the model state $A$ is automatically synchronized with the parameter server via the parameter exchange channel.
Parameter Server: The parameter servers provide global access to model parameters $A$ via a convenient distributed shared memory.
The rest of this paper provides some theory proof and application examples of Petuum as well as some experimental results. The main purpose of this post is to introduce Petuum briefly, so I won't go through that part. If you are interested, please refer to Petuum paper.
Question?
Which way is proper in designing a large-scale distributed system for machine learning algorithm? From bottom to up OR from up to bottom OR others?