Pragmatic Programming Techniques: May 2012

Tuesday, May 29, 2012

Predictive Analytics: Generalized Linear Regression

In the previous 2 posts, we have covered how to visualize input data to explore strong signals as well as how to prepare input data to a form that is situation for learning. In this and subsequent posts, I'll go through various machine learning techniques to build our predictive model.

Linear regression
Logistic regression
Linear and Logistic regression with regularization
Neural network
Support Vector Machine
Naive Bayes
Nearest Neighbor
Decision Tree
Random Forest
Gradient Boosted Trees

There are two general types of problems that we are interested in this discussion; Classification is about predicting a category (value that is discrete, finite with no ordering implied) while Regression is about predicting a numeric quantity (value is continuous, infinite with ordering).

For classification problem, we use the "iris" data set and predict its "species" from its "width" and "length" measures of sepals and petals. Here is how we setup our training and testing data.

> summary(iris)
  Sepal.Length       Sepal.Width        Petal.Length    Petal.Width      
 Min.   :4.300000   Min.   :2.000000   Min.   :1.000   Min.   :0.100000  
 1st Qu.:5.100000   1st Qu.:2.800000   1st Qu.:1.600   1st Qu.:0.300000  
 Median :5.800000   Median :3.000000   Median :4.350   Median :1.300000  
 Mean   :5.843333   Mean   :3.057333   Mean   :3.758   Mean   :1.199333  
 3rd Qu.:6.400000   3rd Qu.:3.300000   3rd Qu.:5.100   3rd Qu.:1.800000  
 Max.   :7.900000   Max.   :4.400000   Max.   :6.900   Max.   :2.500000  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> 
# Prepare training and testing data
> testidx <- which(1:length(iris[,1])%%5 == 0)
> iristrain <- iris[-testidx,]
> iristest <- iris[testidx,]

For regression problem, we use the "Prestige" data set (imported from the "car" R package) and predict the degree of "prestige" from a set of input variables such as "income", "education", "job types" ... etc. Here is how we setup our training and testing data.

> library(car)
> summary(Prestige)
   education            income              women         
 Min.   : 6.38000   Min.   :  611.000   Min.   : 0.00000  
 1st Qu.: 8.44500   1st Qu.: 4106.000   1st Qu.: 3.59250  
 Median :10.54000   Median : 5930.500   Median :13.60000  
 Mean   :10.73804   Mean   : 6797.902   Mean   :28.97902  
 3rd Qu.:12.64750   3rd Qu.: 8187.250   3rd Qu.:52.20250  
 Max.   :15.97000   Max.   :25879.000   Max.   :97.51000  
    prestige            census           type   
 Min.   :14.80000   Min.   :1113.000   bc  :44  
 1st Qu.:35.22500   1st Qu.:3120.500   prof:31  
 Median :43.60000   Median :5135.000   wc  :23  
 Mean   :46.83333   Mean   :5401.775   NA's: 4  
 3rd Qu.:59.27500   3rd Qu.:8312.500            
 Max.   :87.20000   Max.   :9517.000            
> head(Prestige)
                    education income women prestige census type
gov.administrators      13.11  12351 11.16     68.8   1113 prof
general.managers        12.26  25879  4.02     69.1   1130 prof
accountants             12.77   9271 15.70     63.4   1171 prof
purchasing.officers     11.42   8865  9.11     56.8   1175 prof
chemists                14.62   8403 11.68     73.5   2111 prof
physicists              15.64  11030  5.13     77.6   2113 prof
> testidx <- which(1:nrow(Prestige)%%4==0)
> prestige_train <- Prestige[-testidx,]
> prestige_test <- Prestige[testidx,]

To measure the performance of our prediction, here we use some simple metrics to get through this basic ideas. For regression problem, we'll use the correlation between our prediction and the testing result. For classification problem, we'll use the contingency table to measure the count of true/false positive/negative in our prediction.

Now we have set the stage, lets get started.

Linear Regression

Linear regression has the longest history in statistics, well understood and is the most popular machine learning model. It is based on the assumption of a linear relationship exist between the input x1, x2 ... and output variable y (numeric).

y = Ɵ₀ + Ɵ₁x₁ + Ɵ₂x₂ + …

The learning algorithm will learn the set of parameter such that the objective function: sum of square error ∑(y_actual - y_estimate)² is minimized.

Here is the code examples in R.

> model <- lm(prestige~., data=prestige_train)
> summary(model)
Call:
lm(formula = prestige ~ ., data = prestige_train)

Residuals:
        Min          1Q      Median          3Q         Max 
-13.9078951  -5.0335742   0.3158978   5.3830764  17.8851752 

Coefficients:
                  Estimate     Std. Error  t value     Pr(>|t|)    
(Intercept) -20.7073113585  11.4213272697 -1.81304    0.0743733 .  
education     4.2010288017   0.8290800388  5.06710 0.0000034862 ***
income        0.0011503739   0.0003510866  3.27661    0.0016769 ** 
women         0.0363017610   0.0400627159  0.90612    0.3681668    
census        0.0018644881   0.0009913473  1.88076    0.0644172 .  
typeprof     11.3129416488   7.3932217287  1.53018    0.1307520    
typewc        1.9873305448   4.9579992452  0.40083    0.6898376    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 7.41604 on 66 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared: 0.820444,   Adjusted R-squared: 0.8041207 
F-statistic: 50.26222 on 6 and 66 DF,  p-value: < 0.00000000000000022204 

> # Use the model to predict the output of test data
> prediction <- predict(model, newdata=prestige_test)
> # Check for the correlation with actual result
> cor(prediction, prestige_test$prestige)
[1] 0.9376719009

The coefficient column gives an estimation of Ɵi, there is an associated p-value that gives the confidence of each estimated Ɵi. For example, features not marked with at least one * can be safely ignored.

In the above model, education and income has a high influence to the prestige.

Linear regression has certain assumption about the underlying distribution of data which we need to validate. This include the residuals (error) has normal distribution with a zero mean and constant variation.

> # verify if the residuals are normally distributed
> rs <- residuals(model)
> qqnorm(rs)
> qqline(rs)
> shapiro.test(rs)
        Shapiro-Wilk normality test
data:  rs 
W = 0.9744, p-value = 0.1424

Here is the plot of how the residual aligned with the normal distribution.

Logistic Regression

Logistic Regression is basically applying a transformation of the output of a linear regression such that it fits into the value of 0 to 1, and hence mimic the probability of a binary output. Logistic regression is the most popular machine learning technique applied in solving classification problem.

In a classification problem, the output is binary rather than numeric, we can imagine of doing a linear regression and then compress the numeric output into a 0..1 range using the logit function
1/(1+e^-t)

y = 1/(1 + e ^-(Ɵ₀^{+ Ɵ}₁^x₁^{+ Ɵ}₂^x₂^{+ …)})

The objective function in logistic regression is different. It is minimizing the entropy as follows ...
∑ (y_actual * log y_estimate + (1 - y_actual) * log(1 - y_estimate))

Here is the example R code to classify iris data.

> newcol = data.frame(isSetosa=(iristrain$Species == 'setosa'))
> traindata <- cbind(iristrain, newcol)
> head(traindata)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species isSetosa
1          5.1         3.5          1.4         0.2  setosa     TRUE
2          4.9         3.0          1.4         0.2  setosa     TRUE
3          4.7         3.2          1.3         0.2  setosa     TRUE
4          4.6         3.1          1.5         0.2  setosa     TRUE
6          5.4         3.9          1.7         0.4  setosa     TRUE
7          4.6         3.4          1.4         0.3  setosa     TRUE
> formula <- isSetosa ~ Sepal.Length + Sepal.Width
                        + Petal.Length + Petal.Width
> logisticModel <- glm(formula, data=traindata, 
                       family="binomial")
> # Predict the probability for test data
> prob <- predict(logisticModel, newdata=iristest, type='response')
> round(prob, 3)
  5  10  15  20  25  30  35  40  45  50  55  60  65  70  75 
  1   1   1   1   1   1   1   1   1   1   0   0   0   0   0 
 80  85  90  95 100 105 110 115 120 125 130 135 140 145 150 
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

Regression with Regularization

Notice our assumption of linearity between input and output variable above is not necessary true. One technique is to combine existing features by multiplying them together and hope we are lucky enough to hit some useful combination. Therefore we may end up having a large set of input features in our machine learning.

When we have a large size of input variable but moderate size of training data, we are subjected to the overfitting problem, which is our model fits too specific to the training data and not generalized enough for the data we haven't seen. Regularization is the technique of preventing the model to fit too specifically to the training data.

In linear regression, it is found that overfitting happens when Ɵ has a large value. So we can add a penalty that is proportional to the magnitude of Ɵ. In L2 regularization (also known as Ridge regression), Σ square(Ɵi) will be added to the cost function, while In L1 regularization (also known as Lasso regression), Σ ||Ɵi|| will be added to the cost function.

Both L1, L2 will shrink the magnitude of Ɵi, L2 tends to make dependent input variables having the same coefficient while L tends to pick of the coefficient of variable to be non-zero and other zero. In other words, L1 regression will penalize the coefficient of redundant variables that are linearly dependent and is frequently used to remove redundant features.

Combining L1 and L2, the general form of cost function becomes
Cost == Non-regularization-cost + λ (α.Σ ||Ɵi|| + (1- α).Σ square(Ɵi))

Notice there are 2 tunable parameters, lambda λ and alpha α. Lambda controls the degree of regularization (0 means no-regularization, infinity means ignoring all input variables because all coefficients of them will be zero). Alpha controls the degree of mix between L1 and L2. (0 means pure L2 and 1 means pure L1).

Glmnet is a popular regularization package. The alpha parameter need to be supplied based on the application need (for selecting a reduced set of variables, alpha=1 is preferred). The library provides a cross-validation test to automatically figure out what is the better lambda value.

Lets repeat the linear regression "prestige" data set and use regularization this time, we pick alpha = 0.7 to favor L1 regularization.

> library(glmnet)
> cv.fit <- cv.glmnet(as.matrix(prestige_train[,c(-4, -6)]), 
                      as.vector(prestige_train[,4]), 
                      nlambda=100, alpha=0.7, 
                      family="gaussian")
> plot(cv.fit)
> coef(cv.fit)
5 x 1 sparse Matrix of class "dgCMatrix"
                          1
(Intercept) 6.3876684930151
education   3.2111461944976
income      0.0009473793366
women       0.0000000000000
census      0.0000000000000
> prediction <- predict(cv.fit, 
                        newx=as.matrix(prestige_test[,c(-4, -6)]))
> cor(prediction, as.vector(prestige_test[,4]))
          [,1]
1 0.9291181193

Here is the cross-validation plot, which shows the best lambda with minimal mean square error.

Also we can repeat the logistic regression "iris" data set and use regularization this time, we pick alpha = 0.8 to even favor L1 regularization more.

> library(glmnet)
> cv.fit <- cv.glmnet(as.matrix(iristrain[,-5]), 
                      iristrain[,5], alpha=0.8, 
                      family="multinomial")
> prediction <- predict(cv.fit, 
                        newx=as.matrix(iristest[,-5]),
                        type="class")
> table(prediction, iristest$Species)
prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         2
  virginica       0          0         8

Instead of picking lambda (the degree of regularization) based on cross validation, we can also based on the number of input variables that we want to retain. So we can plot the "regularization path" which shows how the coefficient of each input variables changes when the lambda changes and pick the right lambda that filter out the number of input variables for us.

> # try alpha = 0, Ridge regression
> fit <- glmnet(as.matrix(iristrain[,-5]), 
                iristrain[,5], alpha=0, 
                family="multinomial")
> plot(fit)
> # try alpha = 0.8, Elastic net
> fit <- glmnet(as.matrix(iristrain[,-5]), 
                iristrain[,5], alpha=0.8, 
                family="multinomial")
> plot(fit)
> # try alpha = 1, Lasso regression
> fit <- glmnet(as.matrix(iristrain[,-5]), 
                iristrain[,5], alpha=1, 
                family="multinomial")
> plot(fit)

Here is the output of these plots. Notice that as Alpha is closer to 1 (L1 regularization), it tends to have a stronger filter to the number of input variables.

In my next post, I will cover the other machine learning techniques.

Thursday, May 17, 2012

Predictive Analytics: Data Preparation

As a continuation of my last post on predictive analytics, in this post I will focus in describing how to prepare data for the training the predictive model., I will cover how to perform necessary sampling to ensure the training data is representative and fit into the machine processing capacity. Then we validate the input data and perform necessary cleanup on format error, fill-in missing values and finally transform the collected data into our defined set of input features.

Different machine learning model will have its unique requirement in its input and output data type. Therefore, we may need to perform additional transformation to fit the model requirement

Sample the data

If the amount of raw data is huge, processing all of them may require an extensive amount of processing power which may not be practical. In this case it is quite common to sample the input data to reduce the size of data that need to be processed.

There are many sampling models. Random sampling is by far the most common model which assign a uniform probability to each record for being picked. On the other hand, stratified sampling allows different probability to be assigned to different record type. It is very useful in case of highly unbalanced class occurrence so that the high frequency class will be down-sampled. Cluster sampling is about sampling at a higher level of granularity (ie the cluster) and then pick all members within that cluster. An example of cluster sampling is to select family (rather than individuals) to conduct a survey.

Sample can also be done without replacement (each record can be picked at most once) or with replacement (same record can be picked more than once)

Here is a simple example of performing a sampling (notice record 36 has been selected twice)

> # select 10 records out from iris with replacement
> index <- sample(1:nrow(iris), 10, replace=T)
> index
 [1] 133  36 107 140  66  67  36   3  97  37
> irissample <- iris[index,]
> irissample
     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
133           6.4         2.8          5.6         2.2  virginica
36            5.0         3.2          1.2         0.2     setosa
107           4.9         2.5          4.5         1.7  virginica
140           6.9         3.1          5.4         2.1  virginica
66            6.7         3.1          4.4         1.4 versicolor
67            5.6         3.0          4.5         1.5 versicolor
36.1          5.0         3.2          1.2         0.2     setosa
3             4.7         3.2          1.3         0.2     setosa
97            5.7         2.9          4.2         1.3 versicolor
37            5.5         3.5          1.3         0.2     setosa
>

Impute missing data

It is quite common that some of the input records are incomplete in the sense that certain fields are missing or have input error. In a typical tabular data format, we need to validate each record contains the same number of fields and each field contains the data type we expect.

In case the record has some fields missing, we have the following choices

Discard the whole record if it is incomplete
Infer the missing value based on the data from other records. A common approach is to fill the missing data with the average, or the median.

> # Create some missing data
> irissample[10, 1] <- NA
> irissample
     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
133           6.4         2.8          5.6         2.2  virginica
36            5.0         3.2          1.2         0.2     setosa
107           4.9         2.5          4.5         1.7  virginica
140           6.9         3.1          5.4         2.1  virginica
66            6.7         3.1          4.4         1.4 versicolor
67            5.6         3.0          4.5         1.5 versicolor
36.1          5.0         3.2          1.2         0.2     setosa
3             4.7         3.2          1.3         0.2     setosa
97            5.7         2.9          4.2         1.3 versicolor
37             NA         3.5          1.3         0.2     setosa
> library(e1071)
Loading required package: class
Warning message:
package ‘e1071’ was built under R version 2.14.2 
> fixIris1 <- impute(irissample[,1:4], what='mean')
> fixIris1
     Sepal.Length Sepal.Width Petal.Length Petal.Width
133      6.400000         2.8          5.6         2.2
36       5.000000         3.2          1.2         0.2
107      4.900000         2.5          4.5         1.7
140      6.900000         3.1          5.4         2.1
66       6.700000         3.1          4.4         1.4
67       5.600000         3.0          4.5         1.5
36.1     5.000000         3.2          1.2         0.2
3        4.700000         3.2          1.3         0.2
97       5.700000         2.9          4.2         1.3
37       5.655556         3.5          1.3         0.2
> fixIris2 <- impute(irissample[,1:4], what='median')
> fixIris2
     Sepal.Length Sepal.Width Petal.Length Petal.Width
133           6.4         2.8          5.6         2.2
36            5.0         3.2          1.2         0.2
107           4.9         2.5          4.5         1.7
140           6.9         3.1          5.4         2.1
66            6.7         3.1          4.4         1.4
67            5.6         3.0          4.5         1.5
36.1          5.0         3.2          1.2         0.2
3             4.7         3.2          1.3         0.2
97            5.7         2.9          4.2         1.3
37            5.6         3.5          1.3         0.2
>

Normalize numeric value

Normalize data is about transforming numeric data into a uniform range. Numeric attribute can have different magnitude based on different measurement units. To compare numeric attributes at the same scale, we need to normalize data by subtracting their average and then divide by the standard deviation.

> # scale the columns
> # x-mean(x)/standard deviation
> scaleiris <- scale(iris[, 1:4])
> head(scaleiris)
     Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]   -0.8976739  1.01560199    -1.335752   -1.311052
[2,]   -1.1392005 -0.13153881    -1.335752   -1.311052
[3,]   -1.3807271  0.32731751    -1.392399   -1.311052
[4,]   -1.5014904  0.09788935    -1.279104   -1.311052
[5,]   -1.0184372  1.24503015    -1.335752   -1.311052
[6,]   -0.5353840  1.93331463    -1.165809   -1.048667
>

Reduce dimensionality

High dimensionality is a problem to machine learning task. There are two ways to reduce the number of input attributes. One is about removing irrelevant input variables, another one is about removing redundant input variables.

Looking from the other angle, removing irrelevant feature is same as selecting relevant feature. The general approach is to try different combination of subset of feature to see which combination has the best performance (how well it predicts hold-out test data). One approach is to start with zero feature and pick one feature at a time to see which one give best prediction, and then add the second feature to the best feature to find the best 2 features, so on and so forth. Another approach is to go the opposite direction, start with the full set of feature and throw away one feature to see which one retains best performance.

In my experience, having irrelevant features sitting around will affect the overall workload of processing but usually won't affect the accuracy of the prediction. This is a lesser problem than redundant features, which will give unequal weights to information that are redundant. Removing redundant features is at a higher priority in my opinion.

Principal Component Analysis is a common technique to look only at the numeric input features to measure the linear-dependency among themselves. It transforms the input feature set into a lower dimensional space while retaining most of the fidelity of data.

> # Use iris data set
> cor(iris[, -5])
              Sepal.Length   Sepal.Width  Petal.Length   Petal.Width
Sepal.Length  1.0000000000 -0.1175697841  0.8717537759  0.8179411263
Sepal.Width  -0.1175697841  1.0000000000 -0.4284401043 -0.3661259325
Petal.Length  0.8717537759 -0.4284401043  1.0000000000  0.9628654314
Petal.Width   0.8179411263 -0.3661259325  0.9628654314  1.0000000000
> # Some attributes shows high correlation, compute PCA
> pca <- prcomp(iris[,-5], scale=T)
> summary(pca)
Importance of components:
                            PC1       PC2       PC3       PC4
Standard deviation     1.708361 0.9560494 0.3830886 0.1439265
Proportion of Variance 0.729620 0.2285100 0.0366900 0.0051800
Cumulative Proportion  0.729620 0.9581300 0.9948200 1.0000000
> # Notice PC1 and PC2 covers most variation
> plot(pca)
> pca$rotation
                       PC1            PC2           PC3           PC4
Sepal.Length  0.5210659147 -0.37741761556  0.7195663527  0.2612862800
Sepal.Width  -0.2693474425 -0.92329565954 -0.2443817795 -0.1235096196
Petal.Length  0.5804130958 -0.02449160909 -0.1421263693 -0.8014492463
Petal.Width   0.5648565358 -0.06694198697 -0.6342727371  0.5235971346
> # Project first 2 records in PCA direction
> predict(pca)[1:2,]
              PC1           PC2          PC3           PC4
[1,] -2.257141176 -0.4784238321 0.1272796237 0.02408750846
[2,] -2.074013015  0.6718826870 0.2338255167 0.10266284468
> # plot all points in top 2 PCA direction
> biplot(pca)

Add derived attributes

In some cases, we may need to compute additional attributes from existing attributes.

> iris2 <- transform(iris, ratio=round(Sepal.Length/Sepal.Width, 2))
> head(iris2)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species ratio
1          5.1         3.5          1.4         0.2  setosa  1.46
2          4.9         3.0          1.4         0.2  setosa  1.63
3          4.7         3.2          1.3         0.2  setosa  1.47
4          4.6         3.1          1.5         0.2  setosa  1.48
5          5.0         3.6          1.4         0.2  setosa  1.39
6          5.4         3.9          1.7         0.4  setosa  1.38

Another common use of derived attributes is to generalize some attributes to a coarser grain, such as converting a geo-location to a zip code, or converting the age to an age group.

Discretize numeric value into categories

Discretize data is about cutting a continuous value into ranges and assigning the numeric with the corresponding bucket of the range it falls on. For numeric attribute, a common way to generalize it is to discretize it into ranges, which can be either constant width (variable height/frequency) or variable width (constant height).

> # Equal width cuts
> segments <- 10
> maxL <- max(iris$Petal.Length)
> minL <- min(iris$Petal.Length)
> theBreaks <- seq(minL, maxL, 
                   by=(maxL-minL)/segments)
> cutPetalLength <- cut(iris$Petal.Length, 
                        breaks=theBreaks, 
                        include.lowest=T)
> newdata <- data.frame(orig.Petal.Len=iris$Petal.Length, 
                        cut.Petal.Len=cutPetalLength)
> head(newdata)
  orig.Petal.Len cut.Petal.Len
1            1.4      [1,1.59]
2            1.4      [1,1.59]
3            1.3      [1,1.59]
4            1.5      [1,1.59]
5            1.4      [1,1.59]
6            1.7   (1.59,2.18]
>
> # Constant frequency / height
> myBreaks <- quantile(iris$Petal.Length, 
                       probs=seq(0,1,1/segments))
> cutPetalLength2 <- cut(iris$Petal.Length, 
                         breaks=myBreaks,
                         include.lowest=T)
> newdata2 <- data.frame(orig.Petal.Len=iris$Petal.Length, 
                         cut.Petal.Len=cutPetalLength2)
> head(newdata2)
  orig.Petal.Len cut.Petal.Len
1            1.4       [1,1.4]
2            1.4       [1,1.4]
3            1.3       [1,1.4]
4            1.5     (1.4,1.5]
5            1.4       [1,1.4]
6            1.7     (1.7,3.9]
>

Binarize categorical attributes

Certain machine learning models only take binary input (or numeric input). In this case, we need to convert categorical attribute into multiple binary attributes, while each binary attribute corresponds to a particular value of the category. (e.g. sunny/rainy/cloudy can be encoded as sunny == 100 and rainy == 010)

> cat <- levels(iris$Species)
> cat
[1] "setosa"     "versicolor" "virginica" 
> binarize <- function(x) {return(iris$Species == x)}
> newcols <- sapply(cat, binarize)
> colnames(newcols) <- cat
> data <- cbind(iris[,c('Species')], newcols)
> data[45:55,]
        setosa versicolor virginica
 [1,] 1      1          0         0
 [2,] 1      1          0         0
 [3,] 1      1          0         0
 [4,] 1      1          0         0
 [5,] 1      1          0         0
 [6,] 1      1          0         0
 [7,] 2      0          1         0
 [8,] 2      0          1         0
 [9,] 2      0          1         0
[10,] 2      0          1         0
[11,] 2      0          1         0

Select, combine, aggregate data

Designing the form of training data in my opinion is the most important part of the whole predictive modeling exercise because the accuracy largely depends on whether the input features are structured in an appropriate form that provide strong signals to the learning algorithm.

Rather than using the raw data as is, it is quite common that multiple pieces of raw data need to be combined together, or aggregating multiple raw data records along some dimensions.

In this section, lets use a different data source CO2, which provides the carbon dioxide uptake in grass plants.

> head(CO2)
  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2
>

To select the record that meet a certain criteria

> data <- CO2[CO2$conc>400 & CO2$uptake>40,]
> head(data)
   Plant   Type  Treatment conc uptake
12   Qn2 Quebec nonchilled  500   40.6
13   Qn2 Quebec nonchilled  675   41.4
14   Qn2 Quebec nonchilled 1000   44.3
19   Qn3 Quebec nonchilled  500   42.9
20   Qn3 Quebec nonchilled  675   43.9
21   Qn3 Quebec nonchilled 1000   45.5

To sort the records, lets say we want to sort by conc (in ascending order) and then by uptake (in descending order)

> # ascend sort on conc, descend sort on uptake
> CO2[order(CO2$conc, -CO2$uptake),][1:20,]
   Plant        Type  Treatment conc uptake
15   Qn3      Quebec nonchilled   95   16.2
1    Qn1      Quebec nonchilled   95   16.0
36   Qc3      Quebec    chilled   95   15.1
22   Qc1      Quebec    chilled   95   14.2
8    Qn2      Quebec nonchilled   95   13.6
50   Mn2 Mississippi nonchilled   95   12.0
57   Mn3 Mississippi nonchilled   95   11.3
43   Mn1 Mississippi nonchilled   95   10.6
78   Mc3 Mississippi    chilled   95   10.6
64   Mc1 Mississippi    chilled   95   10.5
29   Qc2      Quebec    chilled   95    9.3
71   Mc2 Mississippi    chilled   95    7.7
16   Qn3      Quebec nonchilled  175   32.4
2    Qn1      Quebec nonchilled  175   30.4
9    Qn2      Quebec nonchilled  175   27.3
30   Qc2      Quebec    chilled  175   27.3
23   Qc1      Quebec    chilled  175   24.1
51   Mn2 Mississippi nonchilled  175   22.0
37   Qc3      Quebec    chilled  175   21.0
58   Mn3 Mississippi nonchilled  175   19.4
>

To look at each plant rather than each raw record, lets compute the average uptake per plant.

> aggregate(CO2[,c('uptake')], data.frame(CO2$Plant), mean)
   CO2.Plant        x
1        Qn1 33.22857
2        Qn2 35.15714
3        Qn3 37.61429
4        Qc1 29.97143
5        Qc3 32.58571
6        Qc2 32.70000
7        Mn3 24.11429
8        Mn2 27.34286
9        Mn1 26.40000
10       Mc2 12.14286
11       Mc3 17.30000
12       Mc1 18.00000
>

We can also group by the combination of type and treatment

> aggregate(CO2[,c('conc', 'uptake')],
            data.frame(CO2$Type, CO2$Treatment), 
            mean)

     CO2.Type CO2.Treatment conc   uptake
1      Quebec    nonchilled  435 35.33333
2 Mississippi    nonchilled  435 25.95238
3      Quebec       chilled  435 31.75238
4 Mississippi       chilled  435 15.81429

To join multiple data sources by a common key, we can use the merge() function.

> head(CO2)
  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2
> # Lets create some artitificial data
> state <- c('California', 'Mississippi', 'Fujian',
                   'Shandong', 'Quebec', 'Ontario')
> country <- c('USA', 'USA', 'China', 'China', 
                     'Canada', 'Canada')
> geomap <- data.frame(country=country, state=state)
> geomap
  country       state
1     USA  California
2     USA Mississippi
3   China      Fujian
4   China    Shandong
5  Canada      Quebec
6  Canada     Ontario
> # Need to match the column name in join
> colnames(geomap) <- c('country', 'Type')
> joinCO2 <- merge(CO2, countrystate, by=c('Type'))
> head(joinCO2)
         Type Plant  Treatment conc uptake country
1 Mississippi   Mn1 nonchilled   95   10.6     USA
2 Mississippi   Mn1 nonchilled  175   19.2     USA
3 Mississippi   Mn1 nonchilled  250   26.2     USA
4 Mississippi   Mn1 nonchilled  350   30.0     USA
5 Mississippi   Mn1 nonchilled  500   30.9     USA
6 Mississippi   Mn1 nonchilled  675   32.4     USA
>

Power and Log transformation

A large portion of machine learning models are based on assumption of linearity relationship (ie: the output is linearly dependent on the input), as well as normal distribution of error with constant standard deviation. However the reality is not exactly align with this assumption in many cases.

In order to use these machine models effectively, it is common practice to perform transformation on the input or output variable such that it approximates normal distribution (and in a multi-variate case, multi-normal distribution).

The commonly use transformation is a class call Box-Cox transformation which is to transform a variable x to (x^k - 1) / k where k is a parameter that we need to determine. (when k = 2, this is a square transform, when k = 0.5, this is a square root transform, when k = 0, this will become log y, when k = 1, there is no transform)

Here is how we determine what k should be for input variable x and then transform the variable.

> # Use the data set Prestige
> library(cat)
> head(Prestige)
                    education income women prestige census type
gov.administrators      13.11  12351 11.16     68.8   1113 prof
general.managers        12.26  25879  4.02     69.1   1130 prof
accountants             12.77   9271 15.70     63.4   1171 prof
purchasing.officers     11.42   8865  9.11     56.8   1175 prof
chemists                14.62   8403 11.68     73.5   2111 prof
physicists              15.64  11030  5.13     77.6   2113 prof
> plot(density(Prestige$income))
> qqnorm(Prestige$income)
> qqline(Prestige$income)
> summary(box.cox.powers(cbind(Prestige$income)))
bcPower Transformation to Normality 

   Est.Power Std.Err. Wald Lower Bound Wald Upper Bound
Y1    0.1793   0.1108          -0.0379           0.3965

Likelihood ratio tests about transformation parameters
                            LRT df         pval
LR test, lambda = (0)  2.710304  1 9.970200e-02
LR test, lambda = (1) 47.261001  1 6.213585e-12
> transformdata <- box.cox(Prestige$income, 0.18)
> plot(density(transformdata))
> qqnorm(transformdata)
> qqline(transformdata)

Hopefully I have covered the primary data transformation tasks. This should have set the stage for my next post, which is to train the predict model.

Monday, May 14, 2012

Predictive Analytics: Overview and Data visualization

I plan to start a series of blog post on predictive analytics as there is an increasing demand on applying machine learning technique to analyze large amount of raw data. This set of technique is very useful to me and I think they should be useful to other people as well. I will also going through some coding example in R. R is a statistical programming language that is very useful for performing predictive analytic tasks. In case you are not familiar with R, here is a very useful link to get some familiarity in R.

Predictive Analytics is a specialize data processing techniques focusing in solving the problem of predicting future outcome based on analyzing previous collected data. The processing cycle typically involves two phases of processing:

Training phase: Learn a model from training data
Predicting phase: Deploy the model to production and use that to predict the unknown or future outcome

The whole lifecycle of training involve the following steps.

Determine the input and output

At this step, we define the output (what we predict) as well as the input data (what we use) in this exercise. If the output is predicting a continuous numeric quantity, we call this exercise a “regression”. If the output is predicting a discrete category, we call it a “classification”. Similarly, input can also be a number, a category, or a mix of both.

Determine the ultimate output is largely a business requirement and usually well-defined (e.g. predicting the quarterly revenue). However, there are many intermediate outputs that are related (in fact they are be input) to the final output. In my experience, determining these set of intermediate outputs usually go through an back-tracking exploratory process as follows.

Starting at the final output that we aim to predict (e.g. quarterly revenue).
Identify all input variables that is useful to predict the output. Look into the source of getting these input data. If there is no data source corresponding to the input variable, that input variable will become the candidate of the intermediate output.
We repeat this process to learn about these intermediate outputs. Effectively we build multiple layers of predictive analytics such that we can move from raw input data to intermediate output and eventually to the final output.

Feature engineering

At this step, we determine how to extract useful input information from the raw data that will be influential to the output. This is an exploratory exercise guided by domain experts. Finally a set of input feature (derived from raw input data) will be defined.

Visualizing existing data is a very useful way to come up with ideas about what features should be included. "Dataframe" in R is a common way where data samples are organized in a tabular structure. And we'll be using some dataframe that comes with the R package. Specifically the dataframe "iris" represents different types of iris and their measures in different lengths.

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> nrow(iris)
[1] 150
> table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 
>

Single Variable Frequency Plot

For numeric data, it is good to get some idea about their frequency distribution. Histogram and a smoother density plot will give a good idea.

> # Plot the histogram
> hist(iris$Sepal.Length, breaks=10, prob=T)
> # Plot the density curve
> lines(density(iris$Sepal.Length))
>

For category data, bar plot is a good choice.

> categories <- table(iris$Species)
> barplot(categories, col=c('red', 'green', 'blue'))
>

Two variable Plot

Box plot can be used to visualize the distribution of a numeric value across different categories.

> boxplot(Sepal.Length~Species, data=iris)
>

When there are multiple numeric input variables, it is useful to get an idea of their correlation to identify if there are any redundancy in our input. Certain model is sensitive to such redundancy and won't give accurate prediction if their input has high redundancy. This problem is known as the multicollinearity problem.

To get an idea of the correlation, we can use a scatter plot matrix for any two pairs of input variables. We can also use a correlation matrix for the same purpose.

> # Scatter plot for all pairs
> pairs(iris[,c(1,2,3,4)])
> # Compute the correlation matrix
> cor(iris[,c(1,2,3,4)])
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1170695    0.8716902   0.8179410
Sepal.Width    -0.1170695   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8716902  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179410  -0.3661259    0.9628654   1.0000000
>

From the observation, we can see Petal Length is highly correlated to Petal Width, and also there are some correlation between Sepal Length and Petal Width as well.

We can further drill down into analyzing the relationship between two numeric value by fitting a regression line or a regression curve (by performing local neighbor linear regression).

> # First plot the 2 variables
> plot(Petal.Width~Sepal.Length, data=iris)
> # Learn the regression model
> model <- lm(Petal.Width~Sepal.Length, data=iris)
> # Plot the regression line
> abline(model)
> # Now learn the local linear model
> model2 <- lowess(iris$Petal.Width~iris$Sepal.Length)
> lines(model2, col="red")
>

OLAP processing

If the data is in multi-dimensional form (has multiple categorical attributes), OLAP type manipulation can give a very good summary.

In this section, lets use a different data source CO2, which provides the carbon dioxide uptake in grass plants.

> head(CO2)
  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2
>

Lets look at the count summarized in different dimensions

> cube <- xtabs(~Plant+Type+Treatment, data=CO2)
> cube
, , Treatment = nonchilled

     Type
Plant Quebec Mississippi
  Qn1      7           0
  Qn2      7           0
  Qn3      7           0
  Qc1      0           0
  Qc3      0           0
  Qc2      0           0
  Mn3      0           7
  Mn2      0           7
  Mn1      0           7
  Mc2      0           0
  Mc3      0           0
  Mc1      0           0

, , Treatment = chilled

     Type
Plant Quebec Mississippi
  Qn1      0           0
  Qn2      0           0
  Qn3      0           0
  Qc1      7           0
  Qc3      7           0
  Qc2      7           0
  Mn3      0           0
  Mn2      0           0
  Mn1      0           0
  Mc2      0           7
  Mc3      0           7
  Mc1      0           7

> apply(cube, c("Plant"), sum)
Qn1 Qn2 Qn3 Qc1 Qc3 Qc2 Mn3 Mn2 Mn1 Mc2 Mc3 Mc1 
  7   7   7   7   7   7   7   7   7   7   7   7 
> apply(cube, c("Type"), sum)
     Quebec Mississippi 
         42          42 
> apply(cube, c("Plant", "Type"), mean)
     Type
Plant Quebec Mississippi
  Qn1    3.5         0.0
  Qn2    3.5         0.0
  Qn3    3.5         0.0
  Qc1    3.5         0.0
  Qc3    3.5         0.0
  Qc2    3.5         0.0
  Mn3    0.0         3.5
  Mn2    0.0         3.5
  Mn1    0.0         3.5
  Mc2    0.0         3.5
  Mc3    0.0         3.5
  Mc1    0.0         3.5
>

Prepare training data

At this step, the purpose is to transform the raw data in a form that can fit into the machine learning model. Some common steps including ...

Data sampling
Data validation and handle missing data
Normalize numeric value into a uniform range
Compute aggregated value (a special case is to compute frequency counts)
Expand categorical field to binary fields
Discretize numeric value into categories
Create derived fields from existing fields
Reduce dimensionality
Power and Log transformation

Data preparation is the bulk of effort for most predictive analytic tasks and usually consumes 60 to 80% of the time. This is an important area that deserve a dedicated post. I'll cover data preparation in my next blog post.

Train a model

At this step, we pick a machine learning model based on the characteristics of the input and output features. Then we feed the training data into the learning algorithm which produce a set of parameters to explain the occurrence of training data.

There are many machine learning models that we can choose from, some common ones including …

Each of the above models has its own characteristics and fit better in certain types of problems. I'll cover each of them more detail in future posts.

Validate the model

After we train the model, how do we validate the model do a good job in prediction. Typically, we withhold a subset of training data for testing purpose. Through the testing, we quantitatively measure the performance of our model prediction ability. I'll cover this topic in more detail in future post.

Deploy the model

When we are satisfied with the model, we'll deploy the model in production environment and use that to predict the real-life events. We should also closely monitor if the real-life accuracy aligns with our testing result.

Usually, the quality of the model prediction degrades over time due to the stationary assumption (the data pattern in future is similar to the data pattern at training time) not longer holds as time passes. On the other hand, we may acquire additional input signal that can improve the prediction accuracy. Therefore, model should have an expiration time and need to be re-train after that.

Hopefully, this post gave a high level overview on the predictive analytics process. I'll drill down to each items in more detail in my future posts.