Performance differences between RevoScaleR, ColumnStore Table and In-Memory OLTP Table

Running *.XDF files using RevoScaleR computational functions versus have dataset available in Columnstore table or in In-Memory OLTP table will be focus of comparison for this blog post.

For this test, I will use the AirLines dataset, available here. Deliberately, I have picked a sample 200 MB (of 13GB dataset) in order to properly test the differences and what should be the best way.

After unzipping the file, I will use following T-SQL query to import the file into SQL Server.

With this example, you can import xdf file directly to SQL Server table (note, that I have transformed a CSV file into XDF and import xdf file into SQL table):

-- must have a write permissions on folder: C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                rxOptions(sampleDataDir = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData")
                inFile <- file.path(rxGetOption("sampleDataDir"), "airsample.csv")
                of <-  rxDataStep(inData = inFile, outFile = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData/airline20170428_2.xdf", 
                             transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")
                            ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))
                            ,overwrite = TRUE
                            ,maxRowsByCols = 10000000
                            ,rowsPerRead = 200000)
                OutputDataSet <- rxXdfToDataFrame(of)'

DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT 1 AS N'

EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((ArrDelay INT
                    ,CRSDepTime DECIMAL(6,4)
                    ,DofWeek NVARCHAR(20)))
GO

 

So the whole process is to be done by creating a table, converting the above sp_execute_external_script into procedure and import results from external procedure to the table.

--Complete process
CREATE TABLE AirFlights_small 
(id INT IDENTITY(1,1)
,ArrDelay INT
,CRSDepTime DECIMAL(6,4)
,DofWeek NVARCHAR(20) 
);
GO

CREATE Procedure ImportXDFtoSQLTable
AS
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                rxOptions(sampleDataDir = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData")
                inFile <- file.path(rxGetOption("sampleDataDir"), "airsample.csv")
                of <-  rxDataStep(inData = inFile, outFile = "airline20170428_2.xdf", 
                transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")
            ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))
            ,overwrite = TRUE
            ,maxRowsByCols = 10000000)
             OutputDataSet <- data.frame(rxReadXdf(file=of, varsToKeep=c("ArrDelay", "CRSDepTime","DayOfWeek")))'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT 1 AS N'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((ArrDelay INT,CRSDepTime DECIMAL(6,4),DofWeek NVARCHAR(20)));
GO

INSERT INTO AirFlights_small
EXECUTE ImportXDFtoSQLTable;
GO

 

There you go. Data are in T-SQL Table. Now we can start with comparisons.  I will be measuring the time to get average air delay time per day of the week.

2017-04-28 22_44_10-RStudio

RevoScaleR

With using the RevoScaleR package, I will be using rxCrossTabs function with the help of transform argument to convert day of the week into factors:

#importing data
outFile2 <- rxDataStep(inData = inFile, outFile = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData/airline20170428_2.xdf", 
            transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")
           ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))
           ,overwrite = TRUE
           ,maxRowsByCols = 10000000)

of2 <- data.frame(rxReadXdf(file=outFile2, varsToKeep=c("ArrDelay", "CRSDepTime","DayOfWeek")))

summary(rxCrossTabs(ArrDelay~DayOfWeek
                    ,data = of2  #outFile2
                    ,transforms = transforms
                    ,blocksPerRead=300000), output="means")

Now get those times:

# Getting times
system.time({ 
  summary(rxCrossTabs(ArrDelay~DayOfWeek
                      ,data = of2
                      ,transforms = transforms
                      ,blocksPerRead=300000), output="means")
  })

With results of 7.8 on elapsed time and computation time of 3.8 second.

Rows Read: 8400013, Total Rows Processed: 8400013, Total Chunk Time: 3.825 seconds 
Computation time: 3.839 seconds.
   user  system elapsed 
   2.89    0.37    7.89 

 

T-SQL query without any specifics

To have a baseline, let’s run the following query:

SET STATISTICS TIME ON;
SELECT 
[DofWeek]
,AVG(ArrDelay) AS [means]
FROM
    AirFlights_small
GROUP BY 
    [DofWeek]
SET STATISTICS TIME OFF;

And check these time statistics

 SQL Server Execution Times:
CPU time = 6124 ms,  elapsed time = 2019 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.

Obiously the CPU / computation time is higher, although the elapsed time is faster.

ColumnStore Table

Let’s create a nonclustered column store index.

CREATE TABLE AirFlights_CS
(id INT IDENTITY(1,1)
,ArrDelay INT
,CRSDepTime DECIMAL(6,4)
,DofWeek NVARCHAR(20) 
);
GO
INSERT INTO AirFlights_CS(ArrDelay, CRSDepTime, DofWeek)
SELECT ArrDelay, CRSDepTime, DofWeek FROM AirFlights_small 

CREATE NONCLUSTERED COLUMNSTORE INDEX NCCI_AirFlight
ON AirFlights_CS
(id, ArrDelay, CRSDepTime, DofWeek);
GO

With the execution of the same query

SET STATISTICS TIME ON;
SELECT 
[DofWeek]
,AVG(ArrDelay) AS [means]
FROM
  AirFlights_CS
GROUP BY     [DofWeek] SET STATISTICS TIME OFF;

The following time statistics are in

 SQL Server Execution Times:
CPU time = 202 ms,  elapsed time = 109 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.

 

In-Memory OLTP

To get Memory optimized table, we need to add a filegroup and create a table with memory optimized turned on:

CREATE TABLE dbo.AirFlight_M   
(  
  id INT NOT NULL PRIMARY KEY NONCLUSTERED
 ,ArrDelay INT
 ,CRSDepTime DECIMAL(6,4) 
 ,DofWeek NVARCHAR(20)
) WITH (MEMORY_OPTIMIZED=ON, DURABILITY = SCHEMA_AND_DATA);
GO

And insert the data

INSERT INTO AirFlight_M
SELECT * FROM AirFlights_small

Running the simple query

SET STATISTICS TIME ON;
SELECT 
[DofWeek]
,AVG(ArrDelay) AS [means]
FROM
    AirFlight_M
GROUP BY 
    [DofWeek]
SET STATISTICS TIME OFF;

results are:

 SQL Server Execution Times:
CPU time = 6186 ms,  elapsed time = 1627 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.

These results were somehow expected, mostly because the ColumnStore table is the only one having index and reading (also by looking in execution plans) optimized with comparison to others. Also degree of parallelism, clustered and non-clustered index can  be pushed, but the idea was to have tests similar to the one in RevoScaleR and R environemnt. With R, we can not push any index on the XDF file.

In R we run:

system.time({ 
LMResults <- rxLinMod(ArrDelay ~ DayOfWeek, data = outFile2, transforms = transforms)
LMResults$coefficients
})

And in SSMS we run:

SET STATISTICS TIME ON;
-- 1. T-SQL
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)
                OutputDataSet <- data.frame(LMResults$coefficients)'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlights_small]'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((
            --DofWeek NVARCHAR(20)
        --    ,
            Coefficient DECIMAL(10,5)
            ));
GO
SET STATISTICS TIME OFF;


SET STATISTICS TIME ON;
-- 2. ColumnStore
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)
                OutputDataSet <- data.frame(LMResults$coefficients)'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlights_CS]'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((
            --DofWeek NVARCHAR(20)
        --    ,
            Coefficient DECIMAL(10,5)
            ));
GO
SET STATISTICS TIME OFF;


SET STATISTICS TIME ON;
-- 3. Memory optimized
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)
                OutputDataSet <- data.frame(LMResults$coefficients)'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlight_M]'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((
            --DofWeek NVARCHAR(20)
        --    ,
            Coefficient DECIMAL(10,5)
            ));
GO
SET STATISTICS TIME OFF;

 

Conclusion

Gathering statistics on CPU time and elapsed time when running simple Linear regression, this is comparison:

df_LR_comparison <- data.frame (
  method = c("T-SQL", "ColumnStore", "Memory Optimized", "RevoScaleR")
  ,CPUtime = c(3000,1625,2156,7689)
  ,ElapsedTime = c(14323,10851,10600,7760)
  )
library(ggplot2)

ggplot(df_LR_comparison, aes(method, fill=method)) + 
  geom_bar(aes(y=ElapsedTime), stat="identity") +
  geom_line(aes(y=CPUtime, group=1), color="white", size=3) +
  scale_colour_manual(" ", values=c("d1" = "blue", "d2" = "red"))+
  #scale_fill_manual("",values="red")+
  theme(legend.position="none")

Showing that elapsed time for R environment with RevoScaleR is fastest (and getting data from XDF), where as simple T-SQL run with sp_execute_external_script and using RevoScaleR gives the slowest response.

2017-04-29 00_43_10-Plot Zoom

In terms of CPU time (white line), Columnstore with RevoScaleR call through external procedure outperforms all others.

Final conclusion: When running statistical analysis (using RevoScaleR or any other R library), use columnstore and index optimized tables/views to receive best CPU and elapsed times.  Important to remember is also the fact, that any aggregations and calculations that can be done within SQL Server, are better to be perfomered there.

 

As always, code is available at GitHub.

 

Happy coding! 🙂

Advertisements

Playing with Regression prediction and MSE measure

In previous post, I have discussed on how to create a sample data-set in R. So let’s use the created data-set from previous post and start playing with Regression predictions.

With each prediction we want to measure, which one scores better the new values and where over-fitting start.

First we will create training and test subset for model prediction. We will use training dataset for model to train and later we will use test data set to actually test our model.

indices <- sample(1000,400)
train <- dat_set[indices, ]
test <-dat_set[-indices, ]

Once we have our subsets loaded, we can visualize a simple scatter plot, to get idea of the point dispersion and what kind of MSE we will be handling. For scatterplot we will use ggplot2 with simple geom_point function.

ggplot()+geom_point(data=train, aes(x=x, y=y))+ggtitle("TRAINING SET")
ggplot()+geom_point(data=test, aes(x=x, y=y))+ggtitle("TEST SET")

Now we can deploy lm function on training dataset to learn and score the model. we will be predicting variable Y (dependent) with independent variable X.

model <- lm(y~x,  data=train)
model

Lm function shows  (run: model or run: summary(model))

lm(formula = y ~ x, data = train)

Coefficients:
(Intercept)            x  
      5.267        2.065

which can easily be tested. let us define x = 5 and observe the value:

x=5
predict(model, data.frame(x))

#results: 
       1 
15.59432

This means that trained model will predict value Y with 15.59 when new value X =5 will be introduced into the model.

I will add some endpoints to draw a linear regression function. This function will serve as the idea how linear, quadratic, cube, etc. functions will behave when predicting better and accurate results.

x <-  c(-5,30)
x_predict <- predict(model, data.frame(x))
endpoints <- data.frame(x, x_predict)

I hard coded the arbitrary endpoints based on the TRAIN graph (see above). in this case -5,30 as two points on the graphs (min and max endpoints). I let the prediction find the other points.

Now let’s plot the scatterplot together with the linear regression line and see the points vs. the line.

ggplot()+
  geom_point(data=train, aes(x=x, y=y))+
  geom_line(data=endpoints, aes(x=x,y=x_predict), color='brown', 
size=1)+
  ggtitle("TRAINING SET")

Seeing this one might get the quick idea how the “problem” or the dispersion of the dots is not nearly linear.

train_regression_line

Now let’s play with prediction model and  find the best polynomial line fitting the dots. All the time we will also look into the problem of over-fitting the regression function.

We will introduce MSE measure. MSE is mean square error measures the average of the squares of the “errors”, that is, the difference between the estimator and what is estimated.

For this purpose we will calculate use our x values  against the predictions on our model.

#Calculating MSE
x <- test$x
p <- predict(model, data.frame(x)) # or: predict(model, test)

Based on this, we can see the predicted values for each x, which we will use for calculating the MSE.

predictor_Regression

Continuing we take the mean squares of difference between vector of predictions (that is: actual Y values) and vector of observed values corresponding to the inputs to the function which generated the predictions (that is: predicted model of X).

#MSE
sum((test$y - predict(model,data.frame(x)))^2)
#AVG - test mean square error
mse_test_value <- mean((test$y - predict(model,data.frame(x)))^2)

 

Now what we have a variable (mse_value) we can actually search for minimal MSE value among any next polynomial regression function.

So in first step we created linear function, now let’s create a quadratic function for train dataset.

# Quadratic
model_Q <- lm(y~x+I(x^2), data=train)
model_Q

For this quadratic function we create function for ggplot:

#function for quadratic
f_Q <- function(x) {
      return(predict(model_Q,data.frame(x)))
      }


This function is added to previous scatter plot with quadratic function.

#plotting training set with quadratic model for quadratic function
ggplot()+
  geom_point(data=train, aes(x=x, y=y))+
  geom_line(data=endpoints, aes(x=x,y=x_predict), color='brown', 
size=1)+
  stat_function(data=data.frame(x=c(-5,30)), aes(x=x), fun=f_Q,
color='blue', size=1)+
  ggtitle("TRAINING SET")

Plot shows on training data that quadratic function sways from linear function.

regression_quadratic

But before we get ahead of ourselves, let’s repeat the calculation of MSE for linear vs. quadratic function and see where MSE measure is better.

#calculate test MSE for quadratic
mean((test$y-predict(model_Q, test))^2)

MSE Calculation is (for my case of dataset)

94.35723

where as for linear function:

mse_test <- mean((test$y - predict(model,data.frame(x)))^2)

#result 
93.16901

MSE fits slightly better for linear function as for quadratic model. For cubic function (or degree 3) we use following function:

model_3 <- lm(formula=y~poly(x,3, raw=T), data=train)
mse_3 <- mean((test$y-predict(model_3,test))^2)

with result of:

95.07303

we see the cubic fit function is not as good as quadratic or linear and we may start already over-fitting the model. So let’s loop through the linear to degree of 12 of polynomial function and see the fit scores (MSE). Using following code:

for(i in 1:13) {
  model <- lm(formula=y~poly(x,i, raw=T), data=train)
  mse <- mean((test$y-predict(model,test))^2)
  print(mse)
}

Returns following results:

[1] 93.16901
[1] 94.35723
[1] 95.07303
[1] 95.23098
[1] 94.79259
[1] 96.55435
[1] 108.6518
[1] 132.8873
[1] 130.5214
[1] 212.9898
[1] 169.7865
[1] 7321.596
[1] 226.708

And we see that at degree 5 the function fit is 94.79 as of degree of 2 with 94.35, but still linear model outfits  / outperforms all other functions. What happens with function of degree 7 or higher is what we call over-fitting.

 

For better visualization I have reduced the function to only 10 iterations, mainly because at the degree of 11 and higher we get extreme outliers and high values.

mse_v<-numeric()
for(i in 1:10) {
  model <- lm(formula=y~poly(x,i, raw=T), data=train)
  mse_v[i] <- mean((test$y-predict(model,test))^2)
}
mse_v


#visualize MSE
y_m <- mse_v
x_m <- 1:10
mse_p <- data.frame(x_m, y_m)

ggplot()+
  geom_point(data=mse_p, aes(x=x_m, y=y_m), size=2)+
  geom_line(data=mse_p, aes(x=x_m, y=y_m), size=1)

Graph shows how the MSE is growing with each higher degree.

110MSE

For the last part, let’s compare trained data to test data with all the functions from linear to function of degree 10.

Therefore we introduce following function:

mse_calc <- function(train,test){
            for(i in 1:10) {
              model <- lm(formula=y~poly(x,i, raw=T), data=train)
              mse[i] <- mean((test$y-predict(model,test))^2)
             
            }
        return(mse)
}

making following visualization even better with for loop:

x<- 1:10
plot<-ggplot()
for(i in 1:10){
  ind <- sample(1000,500)
  train <- dat_set[ind,]
  test <- dat_set[-ind,]
  y<-mse_calc(train,test)
  mse_poly <- data.frame(x,y)
   plot<-plot+geom_point(data=mse_poly, aes(x,y), size=3)
   plot<-plot+geom_line(data=mse_poly, aes(x,y))
}
plot

Making following outstanding plot:

Regression_MSE_plot_train_test

Couple of words on for loop. In each loop we generate random subset of training and test data, which is predicted for each step and compared with each other, resulting in calculation of MSE. 10 times we calculate linear MSE fit measure for random values, 10 times we calculate quadratic MSE fit measure for random values, 10 times …. for degree 3 to degree 10. We see that over fitting start already at degree 4  and at degree 5 it just explodes.