Native scoring in SQL Server 2017 using R

Native scoring is a much overlooked feature in SQL Server 2017 (available only under Windows and only on-prem), that provides scoring and predicting in pre-build and stored machine learning models in near real-time.

stopwatch

                                                                                             Icons made by Smashicons from www.flaticon.com is licensed by CC 3.0 BY

 

Depending on the definition of real-time, and what does it mean for your line of business, I will not go into the definition of real-time, but for sure, we can say scoring 10.000 rows in a second from a mediocre client computer (similar to mine) .

Native scoring in SQL Server 2017 comes with couple of limitations, but also with a lot of benefits. Limitations are:

  • currently supports only SQL server 2017 and Windows platform
  • trained model should not exceed 100 MiB in size
  • Native scoring with PREDICT function supports only following algorithms from RevoScaleR library:
    • rxLinMod (linear model as linear regression)
    • rxLogit (logistic regression)
    • rxBTrees (Parallel external memory algorithm for Stochastic Gradient Boosted Decision Trees)
    • rxDtree (External memory algorithm for Classification and Regression Trees
    • rxDForest (External memory algorithm for Classification and Regression Decision Trees)

Benefits of using PREDICT function for native scoring are:

  • No configuration of R or ML environment is needed (assuming that the trained models are already stored in the database),
  • Code is cleaner, more readable, and no additional R code is needed when performing scoring,
  • No R engine is called in the run-time, so tremendous deduction of  CPU and I/O costs as well as, no external calls,
  • Client or server running Native scoring with PREDICT function does not need R engine installed, because it uses C++ libraries from Microsoft, that can read serialized model stored in a table and un-serialize it and generate predictions, all without the need of R

Overall, if you are looking for a faster predictions in your enterprise and would love to have a faster code and solution deployment, especially integration with other applications or building API in your ecosystem, native scoring with PREDICT function will surely be advantage to you. Although not all of the predictions/scores are supported, majority of predictions can be done using regression models or decision trees models (it is estimated that both type (with derivatives of regression models and ensemble methods) of algorithms are used in 85% of the predictive analytics).

To put the PREDICT function to the test, I have deliberately taken the semi-larger dataset, available in RevoScaleR package in R – AirlineDemoSmall.csv. Using a simple BULK INSERT, we get the data into the database:

BULK INSERT ArrivalDelay
 FROM 'C:\Program Files\Microsoft SQL Server\140\R_SERVER\library\RevoScaleR\SampleData\AirlineDemoSmall.csv'
 WITH 
 ( FIELDTERMINATOR =',', ROWTERMINATOR = '0x0a', FIRSTROW = 2, CODEPAGE = 'RAW');

Once data is in the database, I will split the data into training and test sub-sets.

SELECT TOP 20000 *
INTO ArrDelay_Train
FROM ArrDelay ORDER BY NEWID()
-- (20000 rows affected)

SELECT *
INTO ArrDelay_Test
FROM ArrDelay AS AR
WHERE NOT EXISTS (SELECT * FROM ArrDelay_Train as ATR
                     WHERE
                       ATR.arrDelay = AR.arrDelay
                   AND ATR.[DayOfWeek] = AR.[DayOfWeek]
                   AND ATR.CRSDepTime = AR.CRSDepTime
                 )
-- (473567 rows affected

And the outlook of the dataset is relatively simple:

ArrDelay CRSDepTime DayOfWeek
1        9,383332   3
4        18,983334  4
0        13,883333  4
65       21,499998  7
-3       6,416667   1

Creating models

So we will create essentially two same models using rxLinMod function with same formula, but one with additional parameter for real-time scoring set to TRUE.

-- regular model creation
DECLARE @model VARBINARY(MAX);
EXECUTE sp_execute_external_script
 @language = N'R'
 ,@script = N'
 arrDelay.LM <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, 
                         data = InputDataSet)
 model <- rxSerializeModel(arrDelay.LM)'
 ,@input_data_1 = N'SELECT * FROM ArrDelay_Train'
 ,@params = N'@model varbinary(max) OUTPUT'
 ,@model = @model OUTPUT
 INSERT [dbo].arrModels([model_name], [native_model])
 VALUES('arrDelay.LM.V1', @model) ;

-- Model for Native scoring
DECLARE @model VARBINARY(MAX);

EXECUTE sp_execute_external_script
 @language = N'R'
 ,@script = N'
 arrDelay.LM <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, 
                             data = InputDataSet)
 model <- rxSerializeModel(arrDelay.LM, realtimeScoringOnly = TRUE)'
 ,@input_data_1 = N'SELECT * FROM ArrDelay_Train'
 ,@params = N'@model varbinary(max) OUTPUT'
 ,@model = @model OUTPUT
 INSERT [dbo].arrModels([model_name], [native_model])
 VALUES('arrDelay.LM.NativeScoring.V1', @model) ;

Both models will have same training set, and will be stored into a table for future scoring. Upon first inspection, we can see there is a difference in the model size:

model_size

Scoring Models

Both models  took relatively the same amount of time to train and to store in the table. Both can also be created on R Machine Learning server and stored in the same way (with or without argument realtimeScoringOnly). The model size gives you an idea, why and how the realtime scoring can be achieved -> is to keep your model as small as possible. Both models will give you exact same predictions scores, just that the native scoring will be much faster. Note also, if you are planning to do any text analysis with real-time scoring, keep in mind the 100 MiB limitation, as the text prediction models often exceed this limitation.

Comparing the execution of scoring models, I will compare using “traditional way” of using external procedure sp_execute_external_script and using PREDICT function.

------------------------------------
-- Using sp_execute_external_script
------------------------------------
DECLARE @model VARBINARY(MAX) = (SELECT native_model FROM arrModels 
WHERE model_name = 'arrDelay.LM.V1')

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
 modelLM <- rxUnserializeModel(model)
 OutputDataSet <- rxPredict( model=modelLM,
                 data = ArrDelay_Test,
                 type = "link",
                 predVarNames = "ArrDelay_Pred",
                 extraVarsToWrite = c("ArrDelay","CRSDepTime","DayOfWeek")
 )'
 ,@input_data_1 = N'SELECT * FROM dbo.ArrDelay_Test'
 ,@input_data_1_name = N'ArrDelay_Test'
 ,@params = N'@model VARBINARY(MAX)'
 ,@model = @model
WITH RESULT SETS
((
 AddDelay_Pred FLOAT
,ArrDelay INT 
,CRSDepTime NUMERIC(16,5)
,[DayOfWeek] INT
))
-- (473567 rows affected)
-- Duration 00:00:08

---------------------------
-- Using Real Time Scoring
---------------------------
DECLARE @model varbinary(max) = ( SELECT native_model FROM arrModels 
WHERE model_name = 'arrDelay.LM.NativeScoring.V1');

SELECT 
 NewData.*
 ,p.*
 FROM PREDICT(MODEL = @model, DATA = dbo.ArrDelay_Test as newData)
 WITH(ArrDelay_Pred FLOAT) as p;
GO
-- (473567 rows affected)
-- Duration 00:00:04

Both examples are different from each other, but PREDICT function looks much more readable and neater. Time performance is also on the PREDICT function side, as the model returns the predictions much faster.

In addition, I have mentioned that PREDICT function does not need R engine or Launchpad Service to be running in the same environment, where the code will be executed. To put this to test, I will simply stop the SQL Server Launchpad Service:

Service_stop

After executing the first set of predictions using sp_execute_external_script, SQL Server or Machine Learning Server will notify you that the service is not running:

fail_to_communicate

whereas, the PREDICT function will work flawlessly.

Verdict

For sure, faster predictions are the something that can be very welcoming in gaming industry, in transport, utility and metal industry, financial as well as any other types, where real-time predictions against OLTP systems will be much appreciated. With the light-weight models and good algorithm support, I would for sure give it an additional thought, especially, if you see a good potential in faster and near real-time predictions.

As always, complete code and data sample are available at Github. Happy coding! 🙂

Advertisements

Playing with Regression prediction and MSE measure

In previous post, I have discussed on how to create a sample data-set in R. So let’s use the created data-set from previous post and start playing with Regression predictions.

With each prediction we want to measure, which one scores better the new values and where over-fitting start.

First we will create training and test subset for model prediction. We will use training dataset for model to train and later we will use test data set to actually test our model.

indices <- sample(1000,400)
train <- dat_set[indices, ]
test <-dat_set[-indices, ]

Once we have our subsets loaded, we can visualize a simple scatter plot, to get idea of the point dispersion and what kind of MSE we will be handling. For scatterplot we will use ggplot2 with simple geom_point function.

ggplot()+geom_point(data=train, aes(x=x, y=y))+ggtitle("TRAINING SET")
ggplot()+geom_point(data=test, aes(x=x, y=y))+ggtitle("TEST SET")

Now we can deploy lm function on training dataset to learn and score the model. we will be predicting variable Y (dependent) with independent variable X.

model <- lm(y~x,  data=train)
model

Lm function shows  (run: model or run: summary(model))

lm(formula = y ~ x, data = train)

Coefficients:
(Intercept)            x  
      5.267        2.065

which can easily be tested. let us define x = 5 and observe the value:

x=5
predict(model, data.frame(x))

#results: 
       1 
15.59432

This means that trained model will predict value Y with 15.59 when new value X =5 will be introduced into the model.

I will add some endpoints to draw a linear regression function. This function will serve as the idea how linear, quadratic, cube, etc. functions will behave when predicting better and accurate results.

x <-  c(-5,30)
x_predict <- predict(model, data.frame(x))
endpoints <- data.frame(x, x_predict)

I hard coded the arbitrary endpoints based on the TRAIN graph (see above). in this case -5,30 as two points on the graphs (min and max endpoints). I let the prediction find the other points.

Now let’s plot the scatterplot together with the linear regression line and see the points vs. the line.

ggplot()+
  geom_point(data=train, aes(x=x, y=y))+
  geom_line(data=endpoints, aes(x=x,y=x_predict), color='brown', 
size=1)+
  ggtitle("TRAINING SET")

Seeing this one might get the quick idea how the “problem” or the dispersion of the dots is not nearly linear.

train_regression_line

Now let’s play with prediction model and  find the best polynomial line fitting the dots. All the time we will also look into the problem of over-fitting the regression function.

We will introduce MSE measure. MSE is mean square error measures the average of the squares of the “errors”, that is, the difference between the estimator and what is estimated.

For this purpose we will calculate use our x values  against the predictions on our model.

#Calculating MSE
x <- test$x
p <- predict(model, data.frame(x)) # or: predict(model, test)

Based on this, we can see the predicted values for each x, which we will use for calculating the MSE.

predictor_Regression

Continuing we take the mean squares of difference between vector of predictions (that is: actual Y values) and vector of observed values corresponding to the inputs to the function which generated the predictions (that is: predicted model of X).

#MSE
sum((test$y - predict(model,data.frame(x)))^2)
#AVG - test mean square error
mse_test_value <- mean((test$y - predict(model,data.frame(x)))^2)

 

Now what we have a variable (mse_value) we can actually search for minimal MSE value among any next polynomial regression function.

So in first step we created linear function, now let’s create a quadratic function for train dataset.

# Quadratic
model_Q <- lm(y~x+I(x^2), data=train)
model_Q

For this quadratic function we create function for ggplot:

#function for quadratic
f_Q <- function(x) {
      return(predict(model_Q,data.frame(x)))
      }


This function is added to previous scatter plot with quadratic function.

#plotting training set with quadratic model for quadratic function
ggplot()+
  geom_point(data=train, aes(x=x, y=y))+
  geom_line(data=endpoints, aes(x=x,y=x_predict), color='brown', 
size=1)+
  stat_function(data=data.frame(x=c(-5,30)), aes(x=x), fun=f_Q,
color='blue', size=1)+
  ggtitle("TRAINING SET")

Plot shows on training data that quadratic function sways from linear function.

regression_quadratic

But before we get ahead of ourselves, let’s repeat the calculation of MSE for linear vs. quadratic function and see where MSE measure is better.

#calculate test MSE for quadratic
mean((test$y-predict(model_Q, test))^2)

MSE Calculation is (for my case of dataset)

94.35723

where as for linear function:

mse_test <- mean((test$y - predict(model,data.frame(x)))^2)

#result 
93.16901

MSE fits slightly better for linear function as for quadratic model. For cubic function (or degree 3) we use following function:

model_3 <- lm(formula=y~poly(x,3, raw=T), data=train)
mse_3 <- mean((test$y-predict(model_3,test))^2)

with result of:

95.07303

we see the cubic fit function is not as good as quadratic or linear and we may start already over-fitting the model. So let’s loop through the linear to degree of 12 of polynomial function and see the fit scores (MSE). Using following code:

for(i in 1:13) {
  model <- lm(formula=y~poly(x,i, raw=T), data=train)
  mse <- mean((test$y-predict(model,test))^2)
  print(mse)
}

Returns following results:

[1] 93.16901
[1] 94.35723
[1] 95.07303
[1] 95.23098
[1] 94.79259
[1] 96.55435
[1] 108.6518
[1] 132.8873
[1] 130.5214
[1] 212.9898
[1] 169.7865
[1] 7321.596
[1] 226.708

And we see that at degree 5 the function fit is 94.79 as of degree of 2 with 94.35, but still linear model outfits  / outperforms all other functions. What happens with function of degree 7 or higher is what we call over-fitting.

 

For better visualization I have reduced the function to only 10 iterations, mainly because at the degree of 11 and higher we get extreme outliers and high values.

mse_v<-numeric()
for(i in 1:10) {
  model <- lm(formula=y~poly(x,i, raw=T), data=train)
  mse_v[i] <- mean((test$y-predict(model,test))^2)
}
mse_v


#visualize MSE
y_m <- mse_v
x_m <- 1:10
mse_p <- data.frame(x_m, y_m)

ggplot()+
  geom_point(data=mse_p, aes(x=x_m, y=y_m), size=2)+
  geom_line(data=mse_p, aes(x=x_m, y=y_m), size=1)

Graph shows how the MSE is growing with each higher degree.

110MSE

For the last part, let’s compare trained data to test data with all the functions from linear to function of degree 10.

Therefore we introduce following function:

mse_calc <- function(train,test){
            for(i in 1:10) {
              model <- lm(formula=y~poly(x,i, raw=T), data=train)
              mse[i] <- mean((test$y-predict(model,test))^2)
             
            }
        return(mse)
}

making following visualization even better with for loop:

x<- 1:10
plot<-ggplot()
for(i in 1:10){
  ind <- sample(1000,500)
  train <- dat_set[ind,]
  test <- dat_set[-ind,]
  y<-mse_calc(train,test)
  mse_poly <- data.frame(x,y)
   plot<-plot+geom_point(data=mse_poly, aes(x,y), size=3)
   plot<-plot+geom_line(data=mse_poly, aes(x,y))
}
plot

Making following outstanding plot:

Regression_MSE_plot_train_test

Couple of words on for loop. In each loop we generate random subset of training and test data, which is predicted for each step and compared with each other, resulting in calculation of MSE. 10 times we calculate linear MSE fit measure for random values, 10 times we calculate quadratic MSE fit measure for random values, 10 times …. for degree 3 to degree 10. We see that over fitting start already at degree 4  and at degree 5 it just explodes.