Native scoring in SQL Server 2017 using R

Native scoring is a much overlooked feature in SQL Server 2017 (available only under Windows and only on-prem), that provides scoring and predicting in pre-build and stored machine learning models in near real-time.

stopwatch

                                                                                             Icons made by Smashicons from www.flaticon.com is licensed by CC 3.0 BY

 

Depending on the definition of real-time, and what does it mean for your line of business, I will not go into the definition of real-time, but for sure, we can say scoring 10.000 rows in a second from a mediocre client computer (similar to mine) .

Native scoring in SQL Server 2017 comes with couple of limitations, but also with a lot of benefits. Limitations are:

  • currently supports only SQL server 2017 and Windows platform
  • trained model should not exceed 100 MiB in size
  • Native scoring with PREDICT function supports only following algorithms from RevoScaleR library:
    • rxLinMod (linear model as linear regression)
    • rxLogit (logistic regression)
    • rxBTrees (Parallel external memory algorithm for Stochastic Gradient Boosted Decision Trees)
    • rxDtree (External memory algorithm for Classification and Regression Trees
    • rxDForest (External memory algorithm for Classification and Regression Decision Trees)

Benefits of using PREDICT function for native scoring are:

  • No configuration of R or ML environment is needed (assuming that the trained models are already stored in the database),
  • Code is cleaner, more readable, and no additional R code is needed when performing scoring,
  • No R engine is called in the run-time, so tremendous deduction of  CPU and I/O costs as well as, no external calls,
  • Client or server running Native scoring with PREDICT function does not need R engine installed, because it uses C++ libraries from Microsoft, that can read serialized model stored in a table and un-serialize it and generate predictions, all without the need of R

Overall, if you are looking for a faster predictions in your enterprise and would love to have a faster code and solution deployment, especially integration with other applications or building API in your ecosystem, native scoring with PREDICT function will surely be advantage to you. Although not all of the predictions/scores are supported, majority of predictions can be done using regression models or decision trees models (it is estimated that both type (with derivatives of regression models and ensemble methods) of algorithms are used in 85% of the predictive analytics).

To put the PREDICT function to the test, I have deliberately taken the semi-larger dataset, available in RevoScaleR package in R – AirlineDemoSmall.csv. Using a simple BULK INSERT, we get the data into the database:

BULK INSERT ArrivalDelay
 FROM 'C:\Program Files\Microsoft SQL Server\140\R_SERVER\library\RevoScaleR\SampleData\AirlineDemoSmall.csv'
 WITH 
 ( FIELDTERMINATOR =',', ROWTERMINATOR = '0x0a', FIRSTROW = 2, CODEPAGE = 'RAW');

Once data is in the database, I will split the data into training and test sub-sets.

SELECT TOP 20000 *
INTO ArrDelay_Train
FROM ArrDelay ORDER BY NEWID()
-- (20000 rows affected)

SELECT *
INTO ArrDelay_Test
FROM ArrDelay AS AR
WHERE NOT EXISTS (SELECT * FROM ArrDelay_Train as ATR
                     WHERE
                       ATR.arrDelay = AR.arrDelay
                   AND ATR.[DayOfWeek] = AR.[DayOfWeek]
                   AND ATR.CRSDepTime = AR.CRSDepTime
                 )
-- (473567 rows affected

And the outlook of the dataset is relatively simple:

ArrDelay CRSDepTime DayOfWeek
1        9,383332   3
4        18,983334  4
0        13,883333  4
65       21,499998  7
-3       6,416667   1

Creating models

So we will create essentially two same models using rxLinMod function with same formula, but one with additional parameter for real-time scoring set to TRUE.

-- regular model creation
DECLARE @model VARBINARY(MAX);
EXECUTE sp_execute_external_script
 @language = N'R'
 ,@script = N'
 arrDelay.LM <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, 
                         data = InputDataSet)
 model <- rxSerializeModel(arrDelay.LM)'
 ,@input_data_1 = N'SELECT * FROM ArrDelay_Train'
 ,@params = N'@model varbinary(max) OUTPUT'
 ,@model = @model OUTPUT
 INSERT [dbo].arrModels([model_name], [native_model])
 VALUES('arrDelay.LM.V1', @model) ;

-- Model for Native scoring
DECLARE @model VARBINARY(MAX);

EXECUTE sp_execute_external_script
 @language = N'R'
 ,@script = N'
 arrDelay.LM <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, 
                             data = InputDataSet)
 model <- rxSerializeModel(arrDelay.LM, realtimeScoringOnly = TRUE)'
 ,@input_data_1 = N'SELECT * FROM ArrDelay_Train'
 ,@params = N'@model varbinary(max) OUTPUT'
 ,@model = @model OUTPUT
 INSERT [dbo].arrModels([model_name], [native_model])
 VALUES('arrDelay.LM.NativeScoring.V1', @model) ;

Both models will have same training set, and will be stored into a table for future scoring. Upon first inspection, we can see there is a difference in the model size:

model_size

Scoring Models

Both models  took relatively the same amount of time to train and to store in the table. Both can also be created on R Machine Learning server and stored in the same way (with or without argument realtimeScoringOnly). The model size gives you an idea, why and how the realtime scoring can be achieved -> is to keep your model as small as possible. Both models will give you exact same predictions scores, just that the native scoring will be much faster. Note also, if you are planning to do any text analysis with real-time scoring, keep in mind the 100 MiB limitation, as the text prediction models often exceed this limitation.

Comparing the execution of scoring models, I will compare using “traditional way” of using external procedure sp_execute_external_script and using PREDICT function.

------------------------------------
-- Using sp_execute_external_script
------------------------------------
DECLARE @model VARBINARY(MAX) = (SELECT native_model FROM arrModels 
WHERE model_name = 'arrDelay.LM.V1')

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
 modelLM <- rxUnserializeModel(model)
 OutputDataSet <- rxPredict( model=modelLM,
                 data = ArrDelay_Test,
                 type = "link",
                 predVarNames = "ArrDelay_Pred",
                 extraVarsToWrite = c("ArrDelay","CRSDepTime","DayOfWeek")
 )'
 ,@input_data_1 = N'SELECT * FROM dbo.ArrDelay_Test'
 ,@input_data_1_name = N'ArrDelay_Test'
 ,@params = N'@model VARBINARY(MAX)'
 ,@model = @model
WITH RESULT SETS
((
 AddDelay_Pred FLOAT
,ArrDelay INT 
,CRSDepTime NUMERIC(16,5)
,[DayOfWeek] INT
))
-- (473567 rows affected)
-- Duration 00:00:08

---------------------------
-- Using Real Time Scoring
---------------------------
DECLARE @model varbinary(max) = ( SELECT native_model FROM arrModels 
WHERE model_name = 'arrDelay.LM.NativeScoring.V1');

SELECT 
 NewData.*
 ,p.*
 FROM PREDICT(MODEL = @model, DATA = dbo.ArrDelay_Test as newData)
 WITH(ArrDelay_Pred FLOAT) as p;
GO
-- (473567 rows affected)
-- Duration 00:00:04

Both examples are different from each other, but PREDICT function looks much more readable and neater. Time performance is also on the PREDICT function side, as the model returns the predictions much faster.

In addition, I have mentioned that PREDICT function does not need R engine or Launchpad Service to be running in the same environment, where the code will be executed. To put this to test, I will simply stop the SQL Server Launchpad Service:

Service_stop

After executing the first set of predictions using sp_execute_external_script, SQL Server or Machine Learning Server will notify you that the service is not running:

fail_to_communicate

whereas, the PREDICT function will work flawlessly.

Verdict

For sure, faster predictions are the something that can be very welcoming in gaming industry, in transport, utility and metal industry, financial as well as any other types, where real-time predictions against OLTP systems will be much appreciated. With the light-weight models and good algorithm support, I would for sure give it an additional thought, especially, if you see a good potential in faster and near real-time predictions.

As always, complete code and data sample are available at Github. Happy coding! 🙂

Advertisements

Using R in SQL Server Reporting Services (SSRS)

SQL Server Reporting services (SSRS) is an outstanding tool for creating, deploying and managing paginated, mobile, KPI reports as well as Power BI reports. Tool provides simple way to share and get data insights in your corporate environment.

ss-reporting-services-all-together

(Photo source: Microsoft docs)

Using the privileges of R language to enrich your data, your statistical analysis or visualization is a simple task to get more out of your reports.

The best practice to embed R code into SSRS report is to create stored procedure and output the results to report. To demonstrate this, we will create two reports; one that will take two input parameters and the second one to demonstrate the usage of R visualization.

First, We will create a designated database and generate some sample data:

CREATE DATABASE R_SSRS;
GO
USE R_SSRS;
GO

CREATE TABLE R_data (
 v1 INT ,v2 INT,v_low INT,v_high INT,letter CHAR(1));
GO

CREATE OR ALTER PROCEDURE generate_data
AS
BEGIN
  INSERT INTO R_data(v1,v2,v_low,v_high,letter)
  SELECT TOP 10
    CAST(v1.number*RAND() AS INT) AS v1
   ,CAST(v2.number*RAND() AS INT) AS v2
   ,v1.low AS v_low
   ,v1.high AS v_high
   ,SUBSTRING(CONVERT(varchar(40), NEWID()),0,2) AS letter
  FROM master..spt_values AS v1
  CROSS JOIN master..spt_values AS v2
  WHERE  v1.[type] = 'P' AND v2.[type] = 'P'
 ORDER BY NEWID() ASC;
END
GO
EXEC generate_data;
GO 10

For the first report we will create a dynamic report, where user will be able to select two parameters. Based on select of the parameters, the resulted dataset will be passed to R, where with additional R code, we will generate summary statistics.

The idea is, to have the values “v1” and “letter” parametrized:

SELECT * FROM R_data
WHERE 
 v1 > 400 -- replace this with parameter; e.g.: @v1
AND letter IN ('1','2','3') -- replace this with parameter; e.g.: @letters

So in this case, we will create a stored procedure, that will get the data from T-SQL and call external procedure for execution of R Code. This procedure will be as:

CREATE OR ALTER PROCEDURE sp_R1(
   @v1 INT
  ,@lett VARCHAR(20)
) AS
BEGIN
 DECLARE @myQuery NVARCHAR(1000)

  CREATE TABLE #t (let CHAR(1))
  INSERT INTO #t
  SELECT value
  FROM STRING_SPLIT(@lett,',')
 
 SET @myQuery = N'
    SELECT * FROM R_data
    WHERE 
    v1 > '+CAST(@v1 AS VARCHAR(10))+'
    AND letter IN (SELECT * FROM #t)'

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
   df <- InputDataSet 
   OutputDataSet <- data.frame(summary(df))'
  ,@input_data_1 = @myQuery
WITH RESULT SETS
((v1 NVARCHAR(100)
 ,v2 NVARCHAR(100)
 ,freq NVARCHAR(100)))
END;
GO

By testing, we can see that passing both parameters (one single-valued and one multi-valued) will return result of R code:

EXEC sp_R1 @v1 = 200, @lett = '1,2,3'

Both of these parameters will be used in report for selection on data (or filtering the data). In SQL Server 2016 Report Builder, we create a blank report, add data source and create a new dataset, using stored procedure:

report1

Once the dataset is added and stored procedure tested, reporting builder will automatically create report parameters:

report2

In addition, we need to set up the parameters, so, adding additional dataset for the paramteres to have a controlled set of available values, and adding this pool of values to each of the parameters, by selecting the dataset, where the parameters will be taken from.

report3

For the multi-valued parameter, we also need to specify that user can select more than one value at the time. By letting the reporting builder know that “lett” parameter will be parsed back as a string of multiple values, we must select “Allow multiple values”.

report4

Once this is completed, both parameters will be shown as a drop-down selection boxes, where user will get the results back, based upon the selection made.

report5

At the end, report should look very similar to this, where the table is result generated by R code, the rest is T-SQL results and the beautiful Reporting Services.report6

For the second report, we will create a new stored procedure, that will generate  a boxplot for selected values retrieved from the user selection.

The procedure will roughly remain the same, just the R code will be changed in order to be able to return the graph (complete procedure is in the code):

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
  df <- InputDataSet 
  image_file <- tempfile()
  jpeg(filename = image_file, width = 400, height = 400)
  boxplot(df$v1~df$letter)
  dev.off()
  OutputDataSet <- data.frame(data=readBin(file(image_file, "rb"), what=raw(), n=1e6))'

 

In addition to that, on our new report, we will need to show the returned value of this procedure (or simply result) as a graph.

On report canvas drag an image and change the settings of this image as following:

report7

Users will be now able to selected the values and based on their selection, the R graph generated inside SSRS report will be changed respectively.

This next Figure show just this, how simple it is for the user to change the selection (add or remove new values) and the boxplot graph is redrawn with updated values.

report8

Both demos should help you leverage the usage of R scripts and R visualizations in your corporate reporting service and bring some additional user interface dynamics into it.

As always, the complete code and reports are available at Github. And code will also hold my random generated dataset, if you want to get same results as in this blogpost.

Happy RCoding 🙂