Tuesday, August 1, 2017

Data parallel and model parallel distributed training with Tensorflow

Google's TensorFlow is a popular platform that is able to perform distributed training of machine learning/deep learning applications. In this post, I will present several ways of performing distributed training with TensorFlow, especially data parallel and model parallel training. By distributed training, I mean the training process run on multiple devices, especially multiple nodes/machines.

About data parallelism and model parallelism: 
Distributed training comes into play as a solution to deal with big data and big model problem. Distributed training enhances the degree of parallelism. Usually, there are two kinds of parallelism in distributed training with Tensorflow: data parallelism and model parallelism. For more explanation: Please refer to this Quora answer What is the difference between model parallelism and data parallelism?

About parameter server concept:
TensorFlow introduces the abstraction of "parameter server" or "ps" from its predecessor DistBelief. Other platforms like MXnetPetuum also have the same abstraction. Basically, parameter server (ps) is a collection of computing unit (such as device, node and application process) that is responsible for storing and updating the parameters of a machine learning/deep learning model. As a common way to handle mutable state (model parameters),  most of the distributed machine learning/deep learning platforms support parameter server functionality.

I am assuming you have basic understanding of Tensorflow and I will omit basic concept/ knowledge about Tensorflow. The following shows the ways we can implement distributed training with TensorFlow:
Fig 1  


  • Data Parallelism:
Example 1 (Figure 1):
Each worker node (corresponding to a worker task process) executes the same copy of compute-intensive part of the TensorFlow computation graph. Model parameters are distributed/divided into multiple PS nodes(corresponding to  ps task processes). There is a separate client for each worker task. In each iteration of training, each worker task takes into a batch of training data and performs computation(e.g. compute the gradient in SGD training) with the model parameters fetched from PS. The resulting gradients from each worker task are aggregated and applied to PS in either sync or async fashion

Some notes: 
1, In this scheme, the PS only contains a single copy of the model. However, the single copy of the model parameters can be split on different nodes.
2, This scheme can achieve both synchronous and asynchronous training.
3, This is a typical data parallelism training scenario, because in each iteration, each worker performs the same computation logic and has to accommodate the whole model parameters for computation. However, each worker only needs to hold a subset of the training data for each iteration's computation.
4, There is no application-level data (e.g. parameters or activations) communication among workers.
5, The example employs between-graph replication training mechanism. A similar scheme is in-graph replication training,  "in-graph" means their is only a single client in the training cluster and the copies of compute-intensive part are within the single computation graph built by the single client. In-graph replication training is less popular than between-graph replication training. 
For more information about between-graph replication training, please refer to the tutorial.

Example 2 (Fig 2)
In example 1, I am assuming each computing node in the cluster only contains CPU device. With GPU-powered nodes, the data parallel distributed training might be as the following example 2 ( Fig 2) (You can find this figure here) :
Fig 2
Example 2 shows a single node (with one CPU and two GPUs)'s view of data parallel distributed training. Each GPU executes a worker task. The difference between this example and the previous one is: in example 2 each node's the CPU first aggregates the gradients from all the local GPUs and then applies the aggregated gradients to PS. The fetching of parameters from PS is done by each node's CPU and the CPU then copies the parameters to local GPUs.

You can find code examples for data parallel distributed training on my github: code examples of distributed training with TensorFlow

  • Model Parallelism
When a big model can not fit into a single node's memory, model parallel training can be employed to  handle the big model. Model parallelism training has two key features:
1, each worker task is responsible for estimating different part of the model parameters. So the computation logic in each worker is different from other one else.
2, There is application-level data communication between workers.
The following Fig 3 shows a model parallel training example:
Fig 3

 This example shows how to use model parallel to train a neural network (with three hidden layers) with back-propagation algorithm. (Here is a wonderful tutorial for understanding back-propagation algorithm) There are four computing units (here four worker nodes). Each worker is responsible for storing and updating a subset of the network parameters. In the forward phase (blue arrow), each worker passes the activations to its successor worker. And in the backward phase (red arrow), each worker passes the "error" to its predecessor worker. Once a worker receives the errors from its successor worker, it is ready to update its corresponding parameters. It's worth meaning that in this model parallel scheme, workers are not able to perform computation simultaneously, as a worker can not get starting computing until it receives the necessary stuff, either activations or errors, from its neighbor.
This Stackoverflow Question is about a simple implementation of model parallel training in TensorFlow. You may find it useful.