Saturday, June 27, 2015

When machine replace human

Recently, a good friend sent me an article from Harvard Business Review called "Beyond Automation", written by Thomas H. Davenport and Julia Kirby.  The article talked about how automation affects our job forces and displacing values from human workers.  It proposed 5 strategies in how we can get prepared to retain competitiveness in the automation era.  This is a very good article and triggers me a lot of thoughts.

I want to explore a fundamental question:  "Can machine replace a human in future ?"

Lets start looking at what machines are doing and not doing today.  Machines are operating under a human's program, and therefore it can only solve those problems that we, human can express or codified in a structural form.  Don't underestimate its power underneath.  With good abstract thinking, smartest human in the world has partitioned large number of problems (by its problem nature) into different problem categories.  Each category is expressed in form od a "generic problem" and subsequently a "general solution" is developed.  Notice that computer scientist has been doing this for many decades, and come up with the powerful algorithm such as "Sorting", "Finding shortest path", "Heuristic search" ... etc.

By grouping concrete problems by their nature into a "generic, abstract problem", we can significantly reduce the volume of cases/scenarios while still covers a large area of ground.  The "generic solution" we developed can also be specialized for each concrete problem scenario.  After that we can develop a software program which can be executed in a large cluster of machines equipped with fast CPU and a lot of memory.  Compare this automated solution with what a human can do in a manual fashion.  In these areas, once problems are well-defined and solutions are automated by software program, computers with much powerful CPU and memory will always beat human in many many orders of magnitude.  There is no question that the human job in these areas will be eliminated.

In terms of capturing our experience using a abstract data structure and algorithm, computer scientist are very far from done.  There are still a very large body of problems that even the smartest human haven't completely figured out how to put them in a structural form yet.  Things that involve "perception", "intuition", "decision making", "estimation", "creativity" are primarily done today by human.  I believe these type of jobs will continue to be done by human workers in our next decade.  On the other hand, with our latest technology research, we continuously push our boundary of automation into some of these areas.  "Face recognition", "Voice recognition" that involves high degree of perception can now be done very accurately by software program.  With "machine learning" technology, we can do "prediction" and make judgement in a more objective way than a human.  Together with "planning" and "optimization" algorithm, large percentage of decision making can be automated, and the result is usually better because of a less biased and data-driven manner.

However, in these forefront areas where latest software technology is unable to automate every steps, the human is need in the path to make a final decision, or interven in those exceptional situation that the software is not programmed to handled.  There are jobs that a human and machine can working together to make better outcome.  This is what is called "augmentation" in the article.  Some job examples are artists are using advanced software to touchup their photos, using computer graphics to create movies, using machine learning to do genome sequence processing, using robots to perform surgery, driver-less vehicles ... etc.

Whether computer programming can replace human completely remains to be seen, but I don't think this will happen in the next 2 decades.  We humans are unique and good at perceiving things with multiple level of abstractions from different angles.  We are good at connecting the dots between unrelated areas.  We can invent new things.  These are things that machine will be very hard to do, or at least will take a long time if at all possible.

"When can we program a machine that can write program ?"

The HBR article suggests a person can consider five strategies (step up, step aside, step in, step narrowly and step forward) to retain value in the automation era.  I favor the "step forward" strategy because the person is driving the trend rather than passively reacting to the trend.  Date back to our history, human's value system has been shifted across industry revolution, internet revolution etc.   At the end of the day, it is more-sophisticated human who take away jobs (and value) from other less-sophisticated human.  And it is always the people who drives the movement to be the winner of this value shift.  It happens in the past and will continue into future.

Sunday, February 22, 2015

Big Data Processing in Spark

In the traditional 3-tier architecture, data processing is performed by the application server where the data itself is stored in the database server.  Application server and database server are typically two different machine.  Therefore, the processing cycle proceeds as follows
  1. Application server send a query to the database server to retrieve the necessary data
  2. Application server perform processing on the received data
  3. Application server will save the changed data to the database server
In the traditional data processing paradigm, we move data to the code.
It can be depicted as follows ...

Then big data phenomenon arrives.  Because the data volume is huge, it cannot be hold by a single database server.  Big data is typically partitioned and stored across many physical DB server machines.  On the other hand, application servers need to be added to increase the processing power of big data.

However, as we increase the number of App servers and DB servers for storing and processing the big data, more data need to be transfer back and forth across the network during the processing cycle, up to a point where the network becomes a major bottleneck.

Moving code to data

To overcome the network bottleneck, we need a new computing paradigm.  Instead of moving data to the code, we move the code to the data and perform the processing at where the data is stored.

Notice the change of the program structure
  • The program execution starts at a driver, which orchestrate the execution happening remotely across many worker servers within a cluster.
  • Data is no longer transferred to the driver program, the driver program holds a data reference in its variable rather than the data itself.  The data reference is basically an id to locate the corresponding data residing in the database server
  • Code is shipped from the program to the database server, where the execution is happening, and data is modified at the database server without leaving the server machine.
  • Finally the program request a save of the modified data.  Since the modified data resides in the database server, no data transfer happens over the network.
By moving the code to the data, the volume of data transfer over network is significantly reduced.  This is an important paradigm shift for big data processing.

In the following session, I will use Apache Spark to illustrate how this big data processing paradigm is implemented.


Resilient Distributed Dataset (RDD) is how Spark implements the data reference concept.  RDD is a logical reference of a dataset which is partitioned across many server machines in the cluster.

To make a clear distinction between data reference and data itself, a Spark program is organized as a sequence of execution steps, which can either be a "transformation" or an "action".

Programming Model

A typical program is organized as follows
  1. From an environment variable "context", create some initial data reference RDD objects
  2. Transform initial RDD objects to create more RDD objects.  Transformation is expressed in terms of functional programming where a code block is shipped from the driver program to multiple remote worker server, which hold a partition of the RDD.  Variable appears inside the code block can either be an item of the RDD or a local variable inside the driver program which get serialized over to the worker machine.  After the code (and the copy of the serialized variables) is received by the remote worker server, it will be executed there by feeding the items of RDD residing in that partition.  Notice that the result of a transformation is a brand new RDD (the original RDD is not mutated)
  3. Finally, the RDD object (the data reference) will need to be materialized.  This is achieved through an "action", which will dump the RDD into a storage, or return its value data to the driver program.
Here is a word count example

# Get initial RDD from the context
file = spark.textFile("hdfs://...")

# Three consecutive transformation of the RDD
counts = file.flatMap(lambda line: line.split(" "))
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)

# Materialize the RDD using an action

When the driver program starts its execution, it builds up a graph where nodes are RDD and edges are transformation steps.  However, no execution is happening at the cluster until an action is encountered.  At that point, the driver program will ship the execution graph as well as the code block to the cluster, where every worker server will get a copy.

The execution graph is a DAG.
  • Each DAG is a atomic unit of execution. 
  • Each source node (no incoming edge) is an external data source or driver memory
  • Each intermediate node is a RDD
  • Each sink node (no outgoing edge) is an external data source or driver memory
  • Green edge connecting to RDD represents a transformation.  Red edge connecting to a sink node represents an action

Data Shuffling

Although we ship the code to worker server where the data processing happens, data movement cannot be completely eliminated.  For example, if the processing requires data residing in different partitions to be grouped first, then we need to shuffle data among worker server.

Spark carefully distinguish "transformation" operation in two types.
  • "Narrow transformation" refers to the processing where the processing logic depends only on data that is already residing in the partition and data shuffling is unnecessary.  Examples of narrow transformation includes filter(), sample(), map(), flatMap() .... etc.
  • "Wide transformation" refers to the processing where the processing logic depends on data residing in multiple partitions and therefore data shuffling is needed to bring them together in one place.  Example of wide transformation includes groupByKey(), reduceByKey() ... etc.

Joining two RDD can also affect the amount of data being shuffled.  Spark provides two ways to join data.  In a shuffle join implementation, data of two RDD with the same key will be redistributed to the same partition.  In other words, each of the items in each RDD will be shuffled across worker servers.

Beside shuffle join, Spark provides another alternative call broadcast join.  In this case, one of the RDD will be broadcasted and copied over to every partition.  Imagine the situation when one of the RDD is significantly smaller relative to the other, then broadcast join will reduce the network traffic because only the small RDD need to be copied to all worker servers while the large RDD doesn't need to be shuffled at all.

In some cases, transformation can be re-ordered to reduce the amount of data shuffling.  Below is an example of a JOIN between two huge RDDs followed by a filtering.

Plan1 is a naive implementation which follows the given order.  It first join the two huge RDD and then apply the filter on the join result.  This ends up causing a big data shuffling because the two RDD is huge, even though the result after filtering is small.

Plan2 offers a smarter way by using the "push-down-predicate" technique where we first apply the filtering in both RDDs before joining them.  Since the filtering will reduce the number of items of each RDD significantly, the join processing will be much cheaper.

Execution planning

As explain above, data shuffling incur the most significant cost in the overall data processing flow.  Spark provides a mechanism that generate an execute plan from the DAG that minimize the amount of data shuffling.
  1. Analyze the DAG to determine the order of transformation.  Notice that we starts from the action (terminal node) and trace back to all dependent RDDs.
  2. To minimize data shuffling, we group the narrow transformation together in a "stage" where all transformation tasks can be performed within the partition and no data shuffling is needed.  The transformations becomes tasks that are chained together within a stage
  3. Wide transformation sits at the boundary of two stages, which requires data to be shuffled to a different worker server.  When a stage finishes its execution, it persist the data into different files (one per partition) of the local disks.  Worker nodes of the subsequent stage will come to pickup these files and this is where data shuffling happens
Below is an example how the execution planning turns the DAG into an execution plan involving stages and tasks.

Reliability and Fault Resiliency

Since the DAG defines a deterministic transformation steps between different partitions of data within each RDD RDD, fault recovery is very straightforward.  Whenever a worker server crashes during the execution of a stage, another worker server can simply re-execute the stage from the beginning by pulling the input data from its parent stage that has the output data stored in local files.  In case the result of the parent stage is not accessible (e.g. the worker server lost the file), the parent stage need to be re-executed as well.  Imagine this is a lineage of transformation steps, and any failure of a step will trigger a restart of execution from its last step.

Since the DAG itself is an atomic unit of execution, all the RDD values will be forgotten after the DAG finishes its execution.  Therefore, after the driver program finishes an action (which execute a DAG to its completion), all the RDD value will be forgotten and if the program access the RDD again in subsequent statement, the RDD needs to be recomputed again from its dependents.  To reduce this repetitive processing, Spark provide a caching mechanism to remember RDDs in worker server memory (or local disk).  Once the execution planner finds the RDD is already cache in memory, it will use the RDD right away without tracing back to its parent RDDs.  This way, we prune the DAG once we reach an RDD that is in the cache.

Overall speaking, Apache Spark provides a powerful framework for big data processing.  By the caching mechanism that holds previous computation result in memory, Spark out-performs Hadoop significantly because it doesn't need to persist all the data into disk for each round of parallel processing.  Although it is still very new, I think Spark will take off as the main stream approach to process big data.

Friday, November 28, 2014

Spark Streaming

In this post, we'll discuss another important topic of big data processing: real-time stream processing area.  This is an area where Hadoop falls short because of its high latency, and another open source framework Storm is developed to cover the need in real-time processing.  Unfortunately, Hadoop and Storm provides quite different programming model, resulting in high development and maintenance cost.

Continue from my previous post on Spark, which provides a highly efficient parallel processing framework.  Spark streaming is a natural extension of its core programming paradigm to provide large-scale, real-time data processing.  The biggest benefits of using Spark Streaming is that it is based on a similar programming paradigm of its core and there is no need to develop and maintain a completely different programming paradigm for batch and realtime processing.

Spark Core Programming Paradigm Recap

The core Spark programming paradigm consists of the following steps ...
  1. Taking input data from an external data source and create an RDD (a distributed data set across many servers)
  2. Transform the RDD to another RDD (these transformation defines a direct acyclic graph of dependencies between RDD)
  3. Output the final RDD to an external data source

Notice that the RDD is immutable, therefore the sequence of transformations is deterministic and therefore recovery from intermediate processing failure is simply by tracing back to the parent of the failure node (in the DAG) and redo the processing from there.

Spark Streaming

Spark Streaming introduce a data structure call DStream which is basically a sequence of RDD where each RDD contains data associated with a time interval.  DStream is created with a frequency parameters which defines the frequency RDD creation into the sequence.

Transformation of a DStream boils down to transformation of each RDD (within the sequence of RDD that the DStream contains).  Within the transformation, the RDD inside the DStream can "join" with another RDD (outside the DStream), hence provide a mix processing paradigm between DStream and other RDDs.  Also, since each transformation produces an output RDD, the result of transforming a DStream results in another sequence of RDDs that defines an output DStream.

Here is the basic transformation where each RDD in the output DStream has a one to one correspondence with each RDD in the input DStream. 

Instead of performing a 1 to 1 transformation of each RDD in the DStream.  Spark streaming enable a sliding window operation by defining a WINDOW which groups consecutive RDDs along the time dimension.  There are 2 parameters that the window is defined ...
  1. Window length: defines how many consecutive RDDs will be combined for performing the transformation.
  2. Slide interval: defines how many RDD will be skipped before the next transformation executes.

By providing a similar set of transformation operation for both RDD and DStream, Spark enable a unified programming paradigm across both batch and real-time processing, and hence reduce the corresponding development and maintenance cost.

Wednesday, November 5, 2014

Common data science project flow

As working across multiple data science projects, I observed a similar pattern across a group of strategic data science projects where a common methodology can be used.  In this post, I want to sketch this methodology at a high level.

First of all, "data science" itself is a very generic term that means different things to different people.  For the projects I involved, many of them target to solve a very tactical and specific problem.  However, over the last few years more and more enterprises start to realize the strategic value of data.  I observed a growing number of strategic data science projects were started from a very broad scope and took a top-down approach to look at the overall business operation.  Along the way, the enterprise prioritize the most critical areas within their operation cycle and build sophisticated models to guide and automate the decision process.

Usually, my engagement started as a data scientist / consultant, with very little (or even no) domain knowledge.  Being unfamiliar with the domain is nothing to be proud of and often slow down my initial discussion.  Therefore, within a squeezed time period I need to quickly learn enough "basic domain knowledge" to facilitate the discussion smooth.  On the other hand, lacking a per-conceived model enables me (or you can say force me) to look from a fresh-eye view, from which I can trim off unnecessary details from the legacies and only focus on those essential elements that contributes to the core part of the data model.  It is also fun to go through the concept blending process between a data scientist and a domain expert.  I force them to think in my way and they force me to think in their way.  This is by far the most effective way for me to learn any new concepts.

Recently I had a discussion with a company who has a small, but very sophisticated data science team that build pricing model, and demand forecasting for their product line.  I am, by no means an expert in their domain.  But their problem (how to predict demand, and how to set price) is general enough across many industries.  Therefore, I will use this problem as an example to illustration the major steps in the common pattern that I describe above.

Problem Settings

Lets say a car manufacturer starts its quarterly planning process.  Here are some key decisions that need to be made by the management.
  • How many cars the company should produce for next year ?
  • What should be the renew price of the cars ?
First of all, we need to identify the ultimate "goal" of these decisions.  Such goal is usually easy to find as it usually in the company's mission statement.

In this problem, the goal is to ...
maximize: "Profit_2015"

In general, I find it is a good start to look at the problem from an "optimization" angle, from which we define our goal in terms of an objective function as well as a set of constraints.

Step 1:  Identify variables and define its dependency graph

Build the dependency graph between different variables starting from the Objective function.  Separate between the decision variables (where you have control) and environment variable (where you have no control).

As an illustrative example, we start from our objective function "Profit_2015" and define the dependency relationship below. Decision variable is highlighted in blue.

Profit_2015 = F(UnitSold_2015, UnitProduced_2015, Price_2015, Cost_2015)
UnitSold_2015 = G(Supply_2015, Demand_2015, Price_2015, CompetitorPrice_2015)
Demand_2015 = H(GDP_2014, PhoneSold_2014)
GDP_2015 = T(GDP_2014, GDP_2013, GDP_2012, GDP_2011 ...)

Identifying these variable and their potential dependencies typically come from a well-studied theory from University, or domain experts in the industry.  At this stage, we don't need to know the exact formula of the function F/G/H.  We only need to capture the links between the variables.  It is also ok to include a link that shouldn't have exist (ie: there is no relationship between the 2 variables in reality).  However, it is not good if we miss a link (ie: fail to capture a strong, existing dependency).

This round usually involves 4 to 5 half day brainstorming sessions with the domain experts, facilitated by the data scientist/consultant who is familiar with the model building process.  There may be additional investigation, background studies if the subject matter experts doesn't exist.  Starting from scratch, this round can take somewhere between couple weeks to couple months

Step 2:  Define the dependency function

In this round,  we want to identify the relationship between variable using formula of F(), G(), H().

Well-Known Function
For some relationship that is well-studied, we can use a known mathematical model.

For example, in the relationship
Profit_2015 = F(UnitSold_2015, UnitProduced_2015, Price_2015, Cost_2015)

We can use the following Mathematical formula in a very straightforward manner
Profit = (UnitSold * Price) - (UnitProduced * Cost)

Semi-Known Function
However, some of the relationship is not as straightforward as that.  For those relationship that we don't exactly know the formula, but can make a reasonable assumption on the shape of the formula, we can assume the relationship follows a family of models (e.g. Linear, Quadratic ... etc.), and then figure out the optimal parameters that best fit the historical data.

For example, in the relationship
Demand_2015 = H(GDP_2014, PhoneSold_2014)

Lets assume the "demand" is a linear combination of "GDP" and "Phone sold", which seems to be a reasonable assumption.

For the linear model we assume
Demand = w0 + (w1 * GDP) + (w2 * PhoneSold)

Then we feed the historical training data to a build a linear regression model and figure out what the fittest value of w0, w1, w2 should be.

Time-Series Function
In some cases, a variable depends only on its own past value but not other variables, here we can train a Time Series model to predict the variable based on its own past values.  Typically, the model is decomposed into 3 components; Noise, Trend and Seasonality.  One popular approach is to use exponential smoothing techniques such as Holt/Winters model.  Another popular approach is to use the ARIMA model which decomposed the value into "Auto-Regression" and "Moving-Average".

For example, in the relationship
GDP_2015 = T(GDP_2014, GDP_2013, GDP_2012, GDP_2011 ...)

We can use TimeSeries model to learn the relationship between the historical data to its future value.

Completely Unknown Function
But if we cannot even assume the model family, we can consider using "k nearest neighbor" approach to interpolate the output from its input.  We need to define the "distance function" between data points based on domain knowledge and also to figure out what the optimal value of k should be.  In many case, using a weighted average of the k-nearest neighbor is a good interpolation.

For example, in the relationship
UnitSold_2015 = G(Supply_2015, Demand_2015, Price_2015, CompetitorPrice_2015)
 It is unclear what model to be used in representing UnitSold as a function of Supply, Demand, Price and CompetitorPrice.  So we go with a nearest neighbor approach.

Based on monthly sales of past 3 years, we can use "Euclidean distance" (we can also consider scaling the data to a comparable range by minus its mean and divide by its standard deviation) to find out the closest 5 neighbors, and then using the weighted average to predict the unit sold.

Step 3: Optimization

At this point, we have the following defined
  • A goal defined by maximizing (or minimizing) an objective function
  • A set of variables (including the decision and environment variables)
  • A set of functions that define how these variables are inter-related to each other.  Some of them is defined by a mathematical formula and some of them is defined as a black-box (base on a predictive model)
Our goal is to figure out what the decision variables (which we have control) should be set such that the objective function is optimized (maximized or minimized).

Determine the value of environment variables
For those environment variables that has no dependencies on other variables, we can acquire their value from external data sources.  For those environment variables that has dependencies on other environment variables (but not decision variables), we can estimate their value using the corresponding dependency function (of course, we need to estimate all its depending variables first).  For those environment variables that has dependencies (direct or indirect) on decision variables, leave it as undefined.

Determine the best value of decision variables
Once we formulate the dependency function, depends on the format of these function, we can employ different optimization methods.  Here is how I choose the appropriate method based on the formulation of dependency functions.

Additional Challenges

To summarize, we have following the process below
  1. Define an objective function, constraints, decision variables and environment variables
  2. Identify the relationship between different variables
  3. Collect or predict those environment variables
  4. Optimize those decision variables based on the objective functions
  5. Return the optimal value of decision variables as the answer
So far, our dependency graph is acyclic where our decision won't affect the underlying variables.  Although this is reasonably true if the enterprise is an insignificant small player in the market, it is no longer true if the enterprise is one of the few major players.  For example, their pricing strategy may causes other competitors to change their own pricing strategy as well.  And how the competitors would react is less predictable and historical data play a less important role here.  At some point, human judgement will get involved to fill the gaps.

Sunday, August 17, 2014

Lambda Architecture Principles

"Lambda Architecture" (introduced by Nathan Marz) has gained a lot of traction recently.  Fundamentally, it is a set of design patterns of dealing with Batch and Real time data processing workflow that fuel many organization's business operations.  Although I don't realize any novice ideas has been introduced, it is the first time these principles are being outlined in such a clear and unambiguous manner.

In this post, I'd like to summarize the key principles of the Lambda architecture, focus more in the underlying design principles and less in the choice of implementation technologies, which I may have a different favors from Nathan.

One important distinction of Lambda architecture is that it has a clear separation between the batch processing pipeline (ie: Batch Layer) and the real-time processing pipeline (ie: Real-time Layer).  Such separation provides a means to localize and isolate complexity for handling data update.  To handle real-time query, Lambda architecture provide a mechanism (ie: Serving Layer) to merge/combine data from the Batch Layer and Real-time Layer and return the latest information to the user.

Data Source Entry

At the very beginning, data flows in Lambda architecture as follows ...
  • Transaction data starts streaming in from OLTP system during business operations.  Transaction data ingestion can be materialized in the form of records in OLTP systems, or text lines in App log files, or incoming API calls, or an event queue (e.g. Kafka)
  • This transaction data stream is replicated and fed into both the Batch Layer and Realtime Layer
Here is an overall architecture diagram for Lambda.

Batch Layer

For storing the ground truth, "Master dataset" is the most fundamental DB that captures all basic event happens.  It stores data in the most "raw" form (and hence the finest granularity) that can be used to compute any perspective at any given point in time.  As long as we can maintain the correctness of master dataset, every perspective of data view derived from it will be automatically correct.

Given maintaining the correctness of master dataset is crucial, to avoid the complexity of maintenance, master dataset is "immutable".  Specifically data can only be appended while update and delete are disallowed.  By disallowing changes of existing data, it avoids the complexity of handling the conflicting concurrent update completely.

Here is a conceptual schema of how the master dataset can be structured.  The center green table represents the old, traditional-way of storing data in RDBMS.  The surrounding blue tables illustrates the schema of how the master dataset can be structured, with some key highlights
  • Data are partitioned by columns and stored in different tables.  Columns that are closely related can be stored in the same table
  • NULL values are not stored
  • Each data record is associated with a time stamp since then the record is valid

Notice that every piece of data is tagged with a time stamp at which the data is changed (or more precisely, a change record that represents the data modification is created).  The latest state of an object can be retrieved by extracting the version of the object with the largest time stamp.

Although master dataset stores data in the finest granularity and therefore can be used to compute result of any query, it usually take a long time to perform such computation if the processing starts with such raw form.  To speed up the query processing, various data at intermediate form (called Batch View) that aligns closer to the query will be generated in a periodic manner.  These batch views (instead of the original master dataset) will be used to serve the real-time query processing. 

To generate these batch views, the "Batch Layer" use a massively parallel, brute force approach to process the original master dataset.  Notice that since data in master data set is timestamped, the data candidate can be identified simply from those that has the time stamp later than the last round of batch processing.  Although less efficient, Lambda architecture advocates that at each round of batch view generation, the previous batch view should just be simply discarded and the new batch view is computed  from master dataset.  This simple-mind, compute-from-scratch approach has some good properties in stopping error propagation (since error cannot be accumulated), but the processing may not be optimized and may take a longer time to finish.  This can increase the "staleness" of the batch view.

Real time Layer

As discussed above, generating the batch view requires scanning a large volume of master dataset that takes few hours.  The batch view will therefore be stale for at least the processing time duration (ie: between the start and end of the Batch processing).  But the maximum staleness can be up to the time period between the end of this Batch processing and the end of next Batch processing (ie: the batch cycle).  The following diagram illustrate this staleness.

Even the batch view is stale period, business operates as usual and transaction data will be streamed in continuously.  To answer user's query with the latest, up-to-date information.  The business transaction records need to be captured and merged into the real-time view.  This is the responsibility of the Real-time Layer.  To reduce the latency of latest information availability close to zero, the merge mechanism has to be done in an incremental manner such that no batching delaying the processing will be introduced.  This requires the real time view update to be very different from the batch view update, which can tolerate a high latency.  The end goal is that the latest information that is not captured in the Batch view will be made available in the Realtime view.

The logic of doing the incremental merge on Realtime view is application specific.  As a common use case, lets say we want to compute a set of summary statistics (e.g. mean, count, max, min, sum, standard deviation, percentile) of the transaction data since the last batch view update.  To compute the sum, we can simply add the new transaction data to the existing sum and then write the new sum back to the real-time view.  To compute the mean, we can multiply the existing count with existing mean, adding the transaction sum and then divide by the existing count plus one.  To implement this logic, we need to READ data from the Realtime view, perform the merge and WRITE the data back to the Realtime view.  This requires the Realtime serving DB (which host the Realtime view) to support both random READ and WRITE.  Fortunately, since the realtime view only need to store the stale data up to one batch cycle, its scale is limited to some degree.
Once the batch view update is completed, the real-time layer will discard the data from the real time serving DB that has time stamp earlier than the batch processing.  This not only limit the data volume of Realtime serving DB, but also allows any data inconsistency (of the realtime view) to be clean up eventually.  This drastically reduce the requirement of sophisticated multi-user, large scale DB.  Many DB system support multiple user random read/write and can be used for this purpose.

Serving Layer

The serving layer is responsible to host the batch view (in the batch serving database) as well as hosting the real-time view (in the real-time serving database).  Due to very different accessing pattern, the batch serving DB has a quite different characteristic from the real-time serving DB.

As mentioned in above, while required to support efficient random read at large scale data volume, the batch serving DB doesn't need to support random write because data will only be bulk-loaded into the batch serving DB.  On the other hand, the real-time serving DB will be incrementally (and continuously) updated by the real-time layer, and therefore need to support both random read and random write.

To maintain the batch serving DB updated, the serving layer need to periodically check the batch layer progression to determine whether a later round of batch view generation is finished.  If so, bulk load the batch view into the batch serving DB.  After completing the bulk load, the batch serving DB has contained the latest version of batch view and some data in the real-time view is expired and therefore can be deleted.  The serving layer will orchestrate these processes.  This purge action is especially important to keep the size of the real-time serving DB small and hence can limit the complexity for handling real-time, concurrent read/write.

To process a real-time query, the serving layer disseminates the incoming query into 2 different sub-queries and forward them to both the Batch serving DB and Realtime serving DB, apply application-specific logic to combine/merge their corresponding result and form a single response to the query.  Since the data in the real-time view and batch view are different from a timestamp perspective, the combine/merge is typically done by concatenate the results together.  In case of any conflict (same time stamp), the one from Batch view will overwrite the one from Realtime view.

Final Thoughts

By separating different responsibility into different layers, the Lambda architecture can leverage different optimization techniques specifically designed for different constraints.  For example, the Batch Layer focuses in large scale data processing using simple, start-from-scratch approach and not worrying about the processing latency.  On the other hand, the Real-time Layer covers where the Batch Layer left off and focus in low-latency merging of the latest information and no need to worry about large scale.  Finally the Serving Layer is responsible to stitch together the Batch View and Realtime View to provide the final complete picture.

The clear demarcation of responsibility also enable different technology stacks to be utilized at each layer and hence can tailor more closely to the organization's specific business need.  Nevertheless, using a very different mechanism to update the Batch view (ie: start-from-scratch) and Realtime view (ie: incremental merge) requires two different algorithm implementation and code base to handle the same type of data.  This can increase the code maintenance effort and can be considered to be the price to pay for bridging the fundamental gap between the "scalability" and "low latency" need.

Nathan's Lambda architecture also introduce a set of candidate technologies which he has developed and used in his past projects (e.g. Hadoop for storing Master dataset, Hadoop for generating Batch view, ElephantDB for batch serving DB, Cassandra for realtime serving DB, STORM for generating Realtime view).  The beauty of Lambda architecture is that the choice of technologies is completely decoupled so I intentionally do not describe any of their details in this post.  On the other hand, I have my own favorite which is different and that will be covered in my future posts.

Sunday, July 27, 2014

Incorporate domain knowledge into predictive model

As a data scientist / consultant, in many cases we are being called in to work with domain experts who has in-depth business knowledge of industry settings.  The main objective is to help our clients to validate and quantify the intuition of existing domain knowledge based on empirical data, and remove any judgement bias.  In many cases, customers will also want to build a predictive model to automate their business decision making process.

To create a predictive model, feature engineering (defining the set of input) is a key part if not the most important.  In this post, I'd like to share my experience in how to come up with the initial set of features and how to evolve it as we learn more.

Firstly, we need to acknowledge two forces in this setting
  1. Domain experts tends to be narrowly focused (and potentially biased) towards their prior experience.  Their domain knowledge can usually encoded in terms of "business rules" and tends to be simple and obvious (if it is too complex and hidden, human brain is not good at picking them up).
  2. Data scientist tends to be less biased and good at mining through a large set of signals to determine how relevant they are in an objective and quantitative manner.  Unfortunately, raw data rarely gives strong signals.  And lacking the domain expertise, data scientist alone will not even be able to come up with a good set of features (usually requires derivation from the combination of raw data).  Notice that trying out all combinations are impractical because there are infinite number of ways to combine raw data.  Also, when you have too many features in the input, the training data will not be enough and resulting in model with high variance.
Maintain a balance between these forces is a critical success factor of many data science project.

This best project settings (in my opinion) is to let the data scientist to take control in the whole exercise (as less bias has an advantage) while guided by input from domain experts.

Indicator Feature

This is a binary variable based on a very specific boolean condition (ie: true or false) that the domain expert believe to be highly indicative to the output.  For example, for predicting stock, one indicator feature is whether the stock has been drop more than 15 % in a day.

Notice that indicator features can be added at any time once a new boolean condition is discovered by the domain expert.  Indicators features doesn't need to be independent to each other and in fact most of the time they are highly inter-correlated.

After fitting these indicator features into the predictive model, we can see how many influence each of these features is asserting in the final prediction and hence providing a feedback to the domain experts about the strength of these signals.

Derived Feature

This is a numeric variable (ie: quantity) that the domain expert believe to be important to predicting the output.  The idea is same as indicator feature except it is numeric in nature.

Expert Stacking

Here we build a predictive model whose input features are taking from each of the expert's prediction output.  For example, to predict the stock, our model takes 20 analyst's prediction as its input.

The strength of this approach is that it can incorporate domain expertise very easily because it treat them as a blackbox (without needing to understand their logic).  The model we training will take into account the relative accuracy of each expert's prediction and adjust its weighting accordingly.  On the other hand, one weakness is the reliance of domain expertise during the prediction, which may or may not be available in an on-going manner.

Saturday, June 28, 2014

Interactive Data Visualization

Recently, "interactive report" is becoming a hot topic in data visualization.  I believe it is becoming the next generation UI paradigm for KPI reports.

Interactive report is sitting somewhere in between static report and BI tools …

Executive KPI report today

Today most executive reports are "static report" provided by financial experts by pulling data from various ERP system on the regular basis, summarize these raw data in a highly condensed and simplified form, then generate a static report for the execs.  When the exec gets the report, it is already in a summarized form that is customized based on his/her prior requirement.  There is no way to ask any other question that the report is not already showing.  Of course, the exec can ask for a separate report which requires additional development time and effort on his/her staff, but also need to wait for the new report to be developed.

This is a suboptimal situation.  In order to survive or maintain leadership in today's highly competitive business environment, execs not just need a much broader perspective (from wide variety of operation data) to make his/her decision, but also he/she has to make the decision fast.  Static report cannot fulfill this need.

Business Intelligence Tools

On the other hand, BI tools (such as Tableau) or OLAP tools can do very detail analysis in wide range of data sources.  However, using these tools to perform more detail analysis (such as slice/dice/rollup/drilldown) typically requires specially trained data analysis skills.  In reality, very few execs use these tools directly.  What they do is to ask their data analyst to prepare a static report for them using these BI tools.  The exec still get a "static report" although it is provided by the BI tools.  Whenever they need to ask a different question, they need to go back to the data analyst and ask to prepare a separate report.

There is a gap between the static report and BI tool.

Interactive Report

"Interactive Report" provides a new paradigm to fill this gap.  It has the following characteristics …
  • Like a static report, "Interactive Report" is still based on "static data", which is a fixed set of data generated in a periodic batch fashion.
  • Unlike static report, this pre-generated "static data" is much larger and wider that covers a broader scope of questions that the execs may ask.
  • Because the "static data" is large and wide, it is impossible to visualize all aspects in the report.  Therefore, only one perspective of the static data (based on the exec's pre-specified requirement) is shown in the report.
  • However, if the exec wants to ask a different question, he/she can switch to a different perspective of the same "static data".

By providing a much large volume of static data, "interactive report" provides a more dynamic data navigation experience to the execs to find out the answer of their ad/hoc unplanned questions.

There are many open source technologies (such as Googlevis...) to support interactive data visualization from which the "interactive report" can be built.  And many of them provides a programmatic interface with R so now data scientist without much Javascript experience can produce highly interactive web pages.