Tuesday, November 10, 2015

Google Dataflow: A unified model for batch and streaming data processing

  • 1, Why we need google dataflow model (motivation)?


In real world, we often face data processing tasks with different requirements. The following form is an example to illustrate what we regard as important and how important it is for a given large scale data process task.
By the above form, I want to say different tasks have different computation requirement. That’s why google need to improve or create a new data processing framework. In google, MapReduce is widely used for batch processing. But when facing unbounded data set (i.e. infinite data sets/stream data), MapReduce does not work well because of its latency. Given this shortcomings of MapReduce, google developed MillWheel which works well for processing streaming data sets.

For engineers, they want to make a single framework that can apply to as many real tasks as possible. That’s the reason for developing google dataflow system. Google dataflow model makes batch and streaming data process model unified into a single model/framework. It allows us the flexibility of balancing latency, correctness and cost in a variety of real use cases.
  • 2, Some useful concepts in Google Dataflow Model(GDM)
We may be familiar with streaming processing system. In streaming processing system, the data put into the system is usually unbound, which means you don't know how many data records you will process and what is the last one. A common case is that, the data come into a streaming system continuously. But what if we want to process data associated with a given time period, say one day or one hour?  The solution is we introduce a logical concept called window. 

With window, we can judge if all the records we need have arrived in a computation/stage already. A window can "splits" unbound/infinite data set into bound/finite data chunks. To see if a record is in a window we must know a "timestamp" of that record. There are two types of time associated with a data record in GDM. They are event time which is the time the record is generated and processing time which is the time when an record is observed at any given point during processing within the pipeline. Therefore we have two time domains: even time and processing time.  During processing, the realities of the system in use result in an dynamically changing amount of skew between the two domains.  To visualize this skew, GDM uses some global progress metric such as watermark in MillWheel.  For most real-world distributed system, it is very hard to establish a 100% correct watermark. It’s possible that there are some records that “comes late”. Ideally, the watermark can provide a useful notion of when the system can think it likely that all records/data up to a given point in even time have been observed. The details of how a system deals with records that "come late" is not the focus of this blog. I omit it. The follow is a figure showing the time domain skew:
  • 3, Make a balance between latency and correctness

As we mentioned in part 2. It is often hard to establish a watermark for a real system 100% correctly. Actually, watermark is often established by heuristic-based algorithm, which means we have to make a tradeoff!  Let's consider two cases.  If the watermark move too slowly, we have to wait for records longer than we need and our result is going to be delayed. In this case, it leads to latency. If watermark moves too fast, some records are going to "come late", which means we miss some data for a given time period and our result will not be complete. In this case, correctness suffers. Therefore we have to make a balance between latency and correctness. The tradeoff between latency and correctness (and cost) depends on the use case you will face. Google Dataflow Model can facilitate this tradeoff, which is a key difference compared with traditional batch or streaming processing system.
  • 4, Google Dataflow Model (GDM)

The best way to understand GDM is to decompose its implementation across four related dimensions. The answers of four questions can specify what a google dataflow system does. They are
  1. What are you computing?
  2. Where in event time?
  3. When in processing time?
  4. How do refinements relate?

  What are you computing?
The data pipeline may be like this:
A pipeline represents a graph of data processing transformations. The nodes/vertex in the graph represent the transformations themselves. And the edges are parallel collections (PCollections) flowing through the pipeline. A PCollection(T) is a collection of data of type T. It may be bounded or unbounded in size. Each element in PCollection has an implicit timestamp. PCollection can be tranformed into other PCollections (as figure shows below).
Let' s look at an example:
We have a large number of data records. Each record has the form of "user name, team, score". We are going to compute the sum score for each team. Here is pseudo-code for this task:
The example program works in this way:
At first we have raw records in PCollection(String), then a transformation is made from raw Pcollection(String) to PCollection (team, score). At last, a sum score for each unique team is computed. The figure 6 below shows what's going on for a specific team. 
Figure 6
From Figure 6 we can see all the records that came in for that team.


Figure 7
In Figure 7, white dash line represents ideal watermark. From figure 7 we can see that some records come into the system slightly later (record with score 3). However, there is a record that comes rather late (record with score 9). In a batch process system (such as Map-Reduce), we simply wait until all the records are observed, then we get all the scores accumulated. Eventually we get a sum score for the team. Don't forget that in GDM, there could be windows. We will show how to incorporate "window" into GDM computation.

Where in event time?

In GDM, windowing divides data into event-time-based finite chunks. Here are three common windowing patterns (Figure 8):
Windowing is necessary for processing unbounded data, since we should define a time period within which our computation makes sense. Please look at back to the example above, when windowing added, the pseudo-code may have a new form like the follows (Figure 8a). Figure 8b shows the process of computing sum score for a specific team. Fixed 2-minute windows are used. Each window contains a few records (assigning a record to a window is even-time-based). Within each window, the computation is batch-based: it does not emit a result until "a safe time point" at which all the records belonging to the window are guaranteed to be observed (in figure 8b, all windows wait until 12:10) and get all records(scores) accumulated(added). At last, we add four window's sum scores together to get the total score for that specific team (14+22+3+12=51).
Figure 8a
Figure 8b
can we reduce latency for the computation process above? The answer is yes, that's what makes Google Dataflow Model distinguished from its counterparts. 

When in processing time?

We want the arrival time of records to have an effect on the processing we are going to do. A notion called Triggers is used to control that. Triggers tell us when we should start producing output for a window instead of waiting to have the entire computation done for all data. One of the most common ways to trigger is relative to the watermark. Figure 9a,9b is still about the above example but with trigger involved.
Figure 9a
Figure 9b
When a window passes through watermark, we know we should start producing output for that window. In this way, we can get a lower latency compared to the case without trigger. However, a drawback of this method is obvious: we may miss some records (in this example its record with score 9)
.
.
.
I will continue writing/add something for this blog.








No comments:

Post a Comment