Using R sp_execute_external_script with JSON

JSON has become part of the SQL Server in the same version as R. Both were very highly anticipated and awaited from the community.

JSON has very powerful statements for converting to and from JSON for storing into / from SQL Server engine (FOR JSON and JSON VALUE, etc).  And since it is gaining popularity for data exchange, I was curious to give it a try with R combination.

I will simply convert a system table into array using for json clause.

SELECT top 10 object_id  FROM sys.objects FOR JSON AUTO;

and it gives back the result:

[{"object_id":3},{"object_id":5},{"object_id":6},{"object_id":7},{"object_id":8},
{"object_id":9},{"object_id":17},{"object_id":18},{"object_id":19},{"object_id":20}]

And sp_execute_external_script query without JSON would look like:

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'OutputDataSet <- InputDataSet'
      ,@input_data_1 = N'SELECT top 10 object_id  FROM sys.objects'
WITH RESULT SETS ((nr INT));

Now, let’s suppose we want to use JSON result directly into T-SQL using sp_execute_external_script. Yes, imagine getting results from an API and you want to push the results immediately into R for analysis. Very straight-forward package in R is called jsonlite (also available is rjson). Query would be as following:

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'library(jsonlite)
                OutputDataSet <- data.frame(fromJSON(InputDataSet))'
      ,@input_data_1 = N'SELECT top 10 object_id  FROM sys.objects FOR JSON AUTO'
WITH RESULT SETS ((nr INT));

Nope!

Msg 39004, Level 16, State 20, Line 15
A 'R' script error occurred during execution of 'sp_execute_external_script' 
with HRESULT 0x80004004.
Msg 39019, Level 16, State 1, Line 15
An external script error occurred: 
Error: Argument 'txt' must be a JSON string, URL or file.

So the argument ‘txt’ must be a JSON string, URL or file. Khm…very “useful” error message, but problem is, that data from T-SQL is stored and presented as data.frame to R environment (Launchpad), because the data type passed to R is array of objects. And would look something like:

2017-01-08-22_04_46-rstudio

Running this query in native (R) environment, we at least get the idea where and how to tackle the problem. So we need to convert the data.frame to a charaters using toJSON and as.character, so that the end T-SQL query would look like:

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'
                library(jsonlite)
                js <- InputDataSet
                js2 <- as.character(toJSON(js))
                OutputDataSet <- data.frame(fromJSON(js2))'
      ,@input_data_1 = N'SELECT top 10 object_id  FROM sys.objects FOR JSON AUTO'
WITH RESULT SETS ((nr INT));

Now we get the correct results (as if we would not used JSON):

2017-01-08-22_22_40-sqlquery7-sql-sicn-kastrun-wideworldimportersdw-spar_si01017988-60_-micr

So R is ready for JSON and JSON is also ready for R.

Happy R+JSON+SQLing!

Association Rules on WideWorldImporters and SQL Server R Services

Association rules are very handy for analyzing Retail data. And WWI database has really neat set of invoices that can be used to make a primer.

Starting with following T-SQL query:

USE WideWorldIMportersDW;
GO

;WITH PRODUCT
AS
(
SELECT 
  [Stock Item Key]
 ,[WWI Stock Item ID]
 ,[Stock Item] 
 ,LEFT([Stock Item], 8) AS L8DESC 
 ,ROW_NUMBER() OVER (PARTITION BY LEFT([Stock Item], 8) ORDER BY ([Stock Item])) AS RN_ID_PR
 ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP
 FROM [Dimension].[Stock Item]
)

SELECT
 O.[WWI Order ID]
,O.[Order Key]
,O.[Stock Item Key]
,P.PRODUCT_GROUP
,O.[Description]

FROM [Fact].[Order] AS O
JOIN PRODUCT AS P
    ON P.[Stock Item Key] = O.[Stock Item Key]

ORDER BY 
    O.[WWI Order ID]
    ,O.[Order Key]

 

I have created very simple product group that will neglect distinction between product variants and treat them  as one. For example:

Stock Item Key    WWI Stock Item ID    Stock Item
54                166                  10 mm Anti static bubble wrap (Blue) 20m
53                167                  10 mm Anti static bubble wrap (Blue) 50m

Both Products are initially the same just the product variant can change; color, size, cap, volume, etc. Product group denotes main products, “without” the product variants. I am doing this simplification out of practical reason, because of a smaller dataset.

So new version of product groups (variable ProductGroup) would be like:

Stock Item Key    WWI Stock Item ID    Stock Item      ProductGroup
54                166                  10 mm Anti      2
53                167                  10 mm Anti      2

So incorporating R code for analyzing association rules in sp_execute_external_procedure is what following code does:

-- Getting Association Rules into T-SQL
DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'WITH PRODUCT
                        AS
                      (
                      SELECT
                      [Stock Item Key]
                      ,[WWI Stock Item ID]
                      ,[Stock Item] 
                      ,LEFT([Stock Item], 8) AS L8DESC 
                      ,ROW_NUMBER() OVER (PARTITION BY LEFT([Stock Item], 8) ORDER BY ([Stock Item])) AS RN_ID_PR
                      ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP
                      FROM [Dimension].[Stock Item]
                      )
                      
                      SELECT
                      O.[WWI Order ID] AS OrderID
                      -- ,O.[Order Key]   AS OrderLineID
                      -- ,O.[Stock Item Key] AS ProductID
                      ,P.PRODUCT_GROUP AS ProductGroup
                      -- ,O.[Description] AS ProductDescription
                      ,LEFT([Stock Item],8) AS ProductDescription
                      
                      FROM [Fact].[Order] AS O
                      JOIN PRODUCT AS P
                      ON P.[Stock Item Key] = O.[Stock Item Key]
                      GROUP BY
                       O.[WWI Order ID]
                      ,P.PRODUCT_GROUP 
                      ,LEFT([Stock Item],8) 
                      ORDER BY 
                      O.[WWI Order ID]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'
                library(arules)
                cust.data <- InputDataSet
                cd_f <- data.frame(OrderID=as.factor(cust.data$OrderID),
ProductGroup=as.factor(cust.data$ProductGroup))
                cd_f2_tran  <- as(split(cd_f[,"ProductGroup"], cd_f[,"OrderID"]), 
"transactions")
                rules <- apriori(cd_f2_tran, parameter=list(support=0.01, 
confidence=0.1))
                OutputDataSet <- data.frame(inspect(rules))'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     lhs NVARCHAR(500)
    ,[Var.2] NVARCHAR(10)
    ,rhs NVARCHAR(500)
    ,support DECIMAL(18,3)
    ,confidence DECIMAL(18,3)
    ,lift DECIMAL(18,3)
                 ));

 

Result is retrieving rules of association between products from transaction that build up support and eventually give lift for any predictions.

By executing this R code:

# chart if needed
plot(rules, method="grouped", control=list(k=20));

one can generate also graphical view of the rules and associations between products.

2016-10-14 22_15_27-Plot Zoom.png

And finally to retrieve information on support for each of the ProductGroup (which is my case), I would execute this R code embedded into T-SQL:

DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'WITH PRODUCT
                        AS
                      (
                      SELECT
                      [Stock Item Key]
                      ,[WWI Stock Item ID]
                      ,[Stock Item] 
                      ,LEFT([Stock Item], 8) AS L8DESC 
                      ,ROW_NUMBER() OVER (PARTITION BY LEFT([Stock Item], 8) ORDER BY ([Stock Item])) AS RN_ID_PR
                      ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP
                      FROM [Dimension].[Stock Item]
                      )
                      
                      SELECT
                      O.[WWI Order ID] AS OrderID
                      -- ,O.[Order Key]   AS OrderLineID
                      -- ,O.[Stock Item Key] AS ProductID
                      ,P.PRODUCT_GROUP AS ProductGroup
                      -- ,O.[Description] AS ProductDescription
                      ,LEFT([Stock Item],8) AS ProductDescription
                      
                      FROM [Fact].[Order] AS O
                      JOIN PRODUCT AS P
                      ON P.[Stock Item Key] = O.[Stock Item Key]
                      GROUP BY
                       O.[WWI Order ID]
                      ,P.PRODUCT_GROUP 
                      ,LEFT([Stock Item],8) 
                      ORDER BY 
                      O.[WWI Order ID]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'
                library(arules)
                cust.data <- InputDataSet
                cd_f <- data.frame(OrderID=as.factor(cust.data$OrderID),
ProductGroup=as.factor(cust.data$ProductGroup))
                cd_f2_tran  <- as(split(cd_f[,"ProductGroup"], cd_f[,"OrderID"]),
 "transactions")
                PgroupSets <- eclat(cd_f2_tran, parameter = list(support = 0.05), 
control = list(verbose=FALSE))
                normalizedGroups <- PgroupSets[size(items(PgroupSets)) == 1]
                eachSupport <- quality(normalizedGroups)$support
                GroupName <- unlist(LIST(items(normalizedGroups), decode = FALSE))
                OutputDataSet <- data.frame(GroupName, eachSupport);'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     ProductGroup NVARCHAR(500)
    ,support DECIMAL(18,3)
                 ));

This ProductGroupID can be joined with T-SQL

2016-10-14-22_18_45-association_rules_tsql-sql-sicn-kastrun-wideworldimportersdw-spar_si01017988

in order to receive labels:

 SELECT 
 LEFT([Stock Item], 8) AS L8DESC 
 ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP
 FROM [Dimension].[Stock Item]

GROUP BY  LEFT([Stock Item], 8)

Pros and cons

Biggest pro is the ability to integrate association rules with T-SQL and to have all R code working as it should be.  This gives data wrangles, data scientiest and data managers to workout the rules that are hidden in transactional/basket data. Working out with different types of outputs (support, confidence, lift) user get to see immediately what works with what. In my case you see and tell that the amount of original data (little over 73K transactions and little over 200K rows) is sometimes not enough to generate meaningful rules that have relevant content. If dataset would have been 100x times bigger, I am sure this would not be a case.

Data size falls under the con. Having larger dataset to be analysed, this would be a performance drawback in terms of memory consumption (sp_execute_external_script procedure is not being able to use RevoScaleR package and *.xdf data file) and speed.  If RevoScaleR Package would have a function to support this calculation, I am very much confident that there would only be pros to Association Rules learning algorithm.

To sum up, association rules is a great and powerful algorithm for finding the correlations between items and the fact that you can use this straight from SSMS, it just gives me goosebumps. Currently just the performance is a bit of a drawback. Also comparing this algorithm to Analysis services (SSAS) association rules, there are many advantages on R side, because of maneuverability and extracting the data to T-SQL, but keep in mind, SSAS is still very awesome and powerful tool for statistical analysis and data predictions.

Code is available at Github.

Happy R-TSQLing!

Comparing performance on dplyr package, RevoScaleR package and T-SQL on simple data manipulation tasks

Long I wanted to test a simple data manipulation tasks and compare the execution time, ease of writing the code and simplicity between T-SQL and R package for data manipulation.

Couple of packages I will mention for data manipulations are plyr, dplyr and data.table and compare the execution time, simplicity and ease of writing with general T-SQL code and RevoScaleR package. For this blog post I will use R package dplyr and T-SQL with possibilites of RevoScaleR computation functions.

My initial query will be. Available in WideWorldImportersDW database. No other alterations have been done to underlying tables (fact.sale or dimension.city).

USE WideWorldImportersDW;
GO

SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN [Dimension].[City] AS c
        ON c.[City Key] = fs.[City Key];

For the metrics of comparison I will be using  simple time statistics (metrics: elapsed time)  and every code will be run from SSMS.

-- Example
SET STATISTICS TIME ON
SELECT GETDATE()
SET STATISTICS TIME OFF

SELECT

First simple test is just printing the data.

PRINT ' '
PRINT 'STATISTICS WITH T-SQL'
PRINT ' '

SET STATISTICS TIME ON
SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]

SET STATISTICS TIME OFF

PRINT ' '
PRINT 'STATISTICS WITH R'
PRINT ' '

SET STATISTICS TIME ON
DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'OutputDataSet <- InputDataSet'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATE
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 
SET STATISTICS TIME OFF

This test is very much simple. Outputting the data. In case to T-SQL it is just a simple SELECT statement and in case of R-TSQL is just passing data to R Engine and returning in a data-frame. The test itself is trivial and “absurd” but very important to get the base-line and time related results.

2016-10-09-18_44_32-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

Running the query couple of times on my client, I have some base comparison. T-SQL outperforms R-TSQL and it is almost two-times faster. I would expect higher difference in terms of time, because it involves a lot of data migration and no computations.

Filtering / WHERE

Now let’s perform a simple WHERE clause or some simple filtering.

PRINT ' '
PRINT 'STATISTICS WITH T-SQL'
PRINT ' '

-- SIMPLE T-SQL
SET STATISTICS TIME ON
SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]
    WHERE
        [Sales Territory] <> 'Southeast'
    AND fs.[Customer Key] <> 0
    AND Profit <= 1000
SET STATISTICS TIME OFF


PRINT ' '
PRINT 'STATISTICS WITH R'
PRINT ' '

-- R Package dplyr and T-SQL
SET STATISTICS TIME ON

DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory] AS SalesTerritory
        ,fs.[Customer Key] AS CustomerKey
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'
                library(dplyr)
                OutputDataSet  <- InputDataSet %>% 
           filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000)'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATETIME
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 

SET STATISTICS TIME OFF

2016-10-09-20_03_04-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

Again results of both executions are the same but difference is getting bigger and bigger. Main reason given is that all the data must be transferred to R engine, before being manipulated and results returned.

Now I can perform a simple comparison with dplyr package and RevoscaleR package. In this manner we will have same ground base for comparison.

PRINT ' '
PRINT 'STATISTICS WITH R dpylr'
PRINT ' '

SET STATISTICS TIME ON

DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory] AS SalesTerritory
        ,fs.[Customer Key] AS CustomerKey
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'
                library(dplyr)
                OutputDataSet  <- InputDataSet %>% 
filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000)'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATETIME
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 

SET STATISTICS TIME OFF

PRINT ' '
PRINT 'STATISTICS WITH R RevoScaleR'
PRINT ' '

SET STATISTICS TIME ON

DECLARE @TSQL1 AS NVARCHAR(MAX)
SET @TSQL1 = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory] AS SalesTerritory
        ,fs.[Customer Key] AS CustomerKey
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RevoRScript AS NVARCHAR(MAX)
SET @RevoRScript = N'
                    library(RevoScaleR)
                    OutputDataSet <- rxXdfToDataFrame(rxDataStep(
inData = InputDataSet, outFile="Cust_data_filter.xdf", 
overwrite=TRUE, rowsPerRead = 100000, 
rowSelection =SalesTerritory != "Southeast" & CustomerKey != 0 & Profit <= 1000))'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RevoRScript
          ,@input_data_1 = @TSQL1
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATETIME
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 

SET STATISTICS TIME OFF

With printed output results.

2016-10-09-21_12_18-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

We can see from the print that dplyr package is in this case slightly ahead of RevoScaleR package, but the difference is 0,3 seconds in favour of dplyr. Going through the code, I am creating *.xdf on the fly and still get almost same time results.

Aggregating Data

Now to test the results for aggregating data for all three scenarios.

PRINT ' '
PRINT 'STATISTICS WITH T-SQL'
PRINT ' '

-- SIMPLE T-SQL
SET STATISTICS TIME ON
SELECT 
         c.City
        ,c.[Sales Territory]
        ,SUM(fs.[Total Including Tax]) AS Total_turnover
        ,MAX(fs.[Total Including Tax]) AS Max_turnover
        ,MIN(fs.[Total Including Tax]) AS Min_turnover
        --,(fs.[Total Including Tax]) AS Median_turnover
        ,VAR(fs.[Total Including Tax]) AS Var_turnover
        ,STDEV(fs.[Total Including Tax]) AS stdev_turnover
        ,AVG(fs.[Total Including Tax]/fs.Profit) AS KPI_avg_Turnover2profit
        
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]
    WHERE
        [Sales Territory] <> 'Southeast'
    AND fs.[Customer Key] <> 0
    AND Profit <= 1000
    GROUP BY
         c.[Sales Territory]
        ,c.City

SET STATISTICS TIME OFF


PRINT ' '
PRINT 'STATISTICS WITH R dpylr'
PRINT ' '

SET STATISTICS TIME ON

-- Difference with T-SQL, I Have to pass all the values needed to filter out and 
-- aggregate data
DECLARE @TSQL1 AS NVARCHAR(MAX)
SET @TSQL1 = N'SELECT 
                     c.City
                    ,c.[Sales Territory] AS SalesTerritory
                    ,fs.[Customer Key] AS CustomerKey
                    ,fs.[Total Including Tax] AS TotalIncludingTax
                    ,fs.Profit
                FROM [Fact].[Sale] AS  fs
                    JOIN dimension.city AS c
                    ON c.[City Key] = fs.[City Key]'

DECLARE @RdplyrScript AS NVARCHAR(MAX)
SET @RdplyrScript = N'
                library(dplyr)
                OutputDataSet  <- InputDataSet %>% 
                                     mutate(Turnover2profit=TotalIncludingTax/Profit) %>%
                                    group_by(City, SalesTerritory)   %>%
                                    filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000) %>%
                                      summarise(    
                                         Total_turnover = sum(TotalIncludingTax, na.rm=TRUE)
                                         ,Max_turnover = max(TotalIncludingTax, na.rm=TRUE)
                                         ,Min_turnover = min(TotalIncludingTax, na.rm=TRUE)
                                         ,Median_turnover = median(TotalIncludingTax, na.rm=TRUE)
                                         ,Var_turnover = var(TotalIncludingTax, na.rm=TRUE)
                                         ,stdev_turnover = sd(TotalIncludingTax, na.rm=TRUE)
                                         ,KPI_avg_Turnover2profit = mean(Turnover2profit, na.rm=TRUE)    
                                      )'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RdplyrScript
          ,@input_data_1 = @TSQL1
          
WITH RESULT SETS ((
     [City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,Total_turnover DECIMAL(18,3)
    ,Max_turnover DECIMAL(18,3)
    ,Min_turnover DECIMAL(18,3)
    ,Median_turnover DECIMAL(18,3)
    ,Var_turnover DECIMAL(18,3)
    ,stdev_turnover DECIMAL(18,3)
    ,KPI_avg_Turnover2profit DECIMAL(18,3)
    )); 

SET STATISTICS TIME OFF


PRINT ' '
PRINT 'STATISTICS WITH R RevoScaleR'
PRINT ' '

SET STATISTICS TIME ON

DECLARE @TSQL2 AS NVARCHAR(MAX)
SET @TSQL2 = N'SELECT 
                     c.City
                    ,c.[Sales Territory] AS SalesTerritory
                    ,fs.[Customer Key] AS CustomerKey
                    ,fs.[Total Including Tax] AS TotalIncludingTax
                    ,fs.Profit
                FROM [Fact].[Sale] AS  fs
                    JOIN dimension.city AS c
                    ON c.[City Key] = fs.[City Key]'

DECLARE @RevoRScript AS NVARCHAR(MAX)
SET @RevoRScript = N'
                    library(RevoScaleR)
                    sales_rx1 <- rxDataStep(inData = InputDataSet, 
outFile="Cust_data_rx1.xdf", overwrite=TRUE, rowsPerRead = 100000, 
                                  rowSelection =SalesTerritory != "Southeast" &
 CustomerKey != 0 & Profit <= 1000)

                    sales_rx2 <- rxDataStep(sales_rx1, outFile="Cust_data_rx2.xdf",
                        transforms=list(
                            turnover2profit=TotalIncludingTax/Profit
                            ,City = City
                            ,SalesTerritory = SalesTerritory
                            ,TotalIncludingTax = TotalIncludingTax
                            ,Profit = Profit
                        ),
                        overwrite=TRUE, rowsPerRead = 100000)
                    
                        sales_rx4 <- rxSummary(TotalIncludingTax~SalesTerritory:City, 
data=sales_rx2,
                                               summaryStats=c("Mean", "StdDev",
 "Min", "Max", "Sum"))

                        sales_rx4_1 <- sales_rx4$categorical[[1]][c("SalesTerritory", 
"City", "Sum", "StdDev", "Min", "Max")]

                        sales_rx4P <- rxSummary(turnover2profit~SalesTerritory:City,
 data=sales_rx2,
                                               summaryStats=c("Mean"))

                        sales_rx4_2 <- sales_rx4P$categorical[[1]][c("SalesTerritory",
 "City", "Means")]

                        ## Merge data
                        sales_rx5 <- merge(sales_rx4_1, sales_rx4_2, 
by=c("SalesTerritory","City"), all=TRUE)
                        names(sales_rx5)[3] <- "Total_turnover"
                        names(sales_rx5)[4] <- "Max_turnover"
                        names(sales_rx5)[5] <- "Min_turnover"
                        names(sales_rx5)[6] <- "stdev_turnover"
                        names(sales_rx5)[7] <- "KPI_avg_Turnover2profit"
                                                                
                    OutputDataSet <- sales_rx5''

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RevoRScript
          ,@input_data_1 = @TSQL2
          
WITH RESULT SETS ((
     [City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,Total_turnover DECIMAL(18,3)
    ,Max_turnover DECIMAL(18,3)
    ,Min_turnover DECIMAL(18,3)
    ,Median_turnover DECIMAL(18,3)
    ,Var_turnover DECIMAL(18,3)
    ,stdev_turnover DECIMAL(18,3)
    ,KPI_avg_Turnover2profit DECIMAL(18,3)
    )); 

SET STATISTICS TIME OFF

 

The results are obvious. T-SQL is in this case faster, which is understood. But comparing dplyr and RevoScaleR, RevoscaleR is faster when doing data manipulation and returning results into SSMS. 2016-10-10-11_27_20-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

The only problem I have is the amount of code needed to create a simple summary statistics. Even though RevoScaleR is faster as dplyr (and plyr), it is absolutely hard to read and write the code. Chaining / Piping is not possible meaning, that you have to store the intermediate results and storing the xdf files.

2016-10-10-11_31_22-datatk

RevoScaleR package did the job but consider doing data manipulation in T-SQL or a dedicated data manipulation package, unless you are fine with all the merging, renaming and outputting.

I deliberately added some of the statistics in the output and you can see that median is supported only in dplyr package, where as in T-SQL or RevoScaleR you have to write it and compute it manually. As well I miss Variance metric in RevoScaleR and Mode / Modus in all three – T-SQL, dplyr and RevoScaleR.

In addition, I have also used dplyrXdf which is a dplyr package for *.xdf dataframe and enables user to do chaining / piping with Revoscale external data format. You can install package in RTVS or RStudio using:

devtools::install_github("RevolutionAnalytics/dplyrXdf")
SET STATISTICS TIME ON

-- Difference with T-SQL, I Have to pass all the values needed to filter out 
-- and aggregate data
DECLARE @TSQL2 AS NVARCHAR(MAX)
SET @TSQL2 = N'SELECT 
                     c.City
                    ,c.[Sales Territory] AS SalesTerritory
                    ,fs.[Customer Key] AS CustomerKey
                    ,fs.[Total Including Tax] AS TotalIncludingTax
                    ,fs.Profit
                FROM [Fact].[Sale] AS  fs
                    JOIN dimension.city AS c
                    ON c.[City Key] = fs.[City Key]'

DECLARE @RevoRScript AS NVARCHAR(MAX)
SET @RevoRScript = N'
                    library(RevoScaleR)
                    library(dplyr)
                    library(dplyrXdf)
                    custXdf <- rxDataFrameToXdf(InputDataSet, 
                               outFile="Cust_data_aggregate.xdf", overwrite=TRUE)
                    custagg <- custXdf %>%
                      mutate(Turnover2profit=TotalIncludingTax/Profit) %>%
                      group_by(City, SalesTerritory)   %>%
                      filter(SalesTerritory != "Southeast", CustomerKey != 0, 
                                 Profit <= 1000) %>%
                      summarise(
                        Total_turnover = sum(TotalIncludingTax, na.rm=TRUE)
                        ,Max_turnover = max(TotalIncludingTax, na.rm=TRUE)
                        ,Min_turnover = min(TotalIncludingTax, na.rm=TRUE)
                        ,Median_turnover = median(TotalIncludingTax, na.rm=TRUE)
                        ,Var_turnover = var(TotalIncludingTax, na.rm=TRUE)
                        ,stdev_turnover = sd(TotalIncludingTax, na.rm=TRUE)
                        ,KPI_avg_Turnover2profit = mean(Turnover2profit, na.rm=TRUE)
                                )
                OutputDataSet <- data.frame(custagg)'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RevoRScript
          ,@input_data_1 = @TSQL2
          
WITH RESULT SETS ((
     [City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,Total_turnover DECIMAL(18,3)
    ,Max_turnover DECIMAL(18,3)
    ,Min_turnover DECIMAL(18,3)
    ,Median_turnover DECIMAL(18,3)
    ,Var_turnover DECIMAL(18,3)
    ,stdev_turnover DECIMAL(18,3)
    ,KPI_avg_Turnover2profit DECIMAL(18,3)
    )); 

SET STATISTICS TIME OFF

Problem with dplyrXdf is performance. The code is very very easy to read but execution time is just absurd. I had to run it couple of times and usually get around 1 minute of elapsed time. Which is just beyond my patience zone. I have used different computeContext and it had no effect.

My conclusione: If you can, do all the data manipulation prior to pushing data from T-SQL into R code, because dplyr or data.table do not support parallel computation and no-memory  limitations and on the other hand, RevoScaleR functions and capabilities were not meant to do data manipulation at this scale, but are very good for summary statistics and overview.

As always, code is available at GitHub.

Happy R-TSQLing.

 

FileTable and storing graphs from Microsoft R Server

FileTable has been around now for quite some time and and it is useful  for storing files, documents, pictures and and binary files in a designated SQL Server table – FileTable. The best part of FileTable is the fact one can access it from windows or other application as if it were stored on file system (because they are) and not making any other changes on the client.

And this feature is absolutely handy for using and storing outputs from Microsoft R Server. In this blog post I will focus mainly on persistently storing charts from statistical analysis.

First we need to secure that FileStream is enabled. Open SQL Server Configuration Manager and navigate to your running SQL Server. Right click and select FILESTREAM and enable Filestream for T-SQL access and I/O access. In addition, allow remote clients access to Filestream data as well.

2016-09-23-21_53_34-sqlquery9-sql-sicn-kastrun-master-spar_si01017988-63_-microsoft-sql-serv

Next step is to enable the configurations in Management Studio.

EXEC sp_configure 'filestream_access_level' , 2;
GO
RECONFIGURE;
GO

For this purpose I have decided to have a dedicated database for storing charts created in R. And this database will have FileTable enabled.

USE master;
GO

CREATE DATABASE FileTableRChart 
ON PRIMARY  (NAME = N'FileTableRChart', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\FileTableRChart.mdf' , SIZE = 8192KB , FILEGROWTH = 65536KB ),
FILEGROUP FileStreamGroup1 CONTAINS FILESTREAM( NAME = ChartsFG, FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\RCharts')
LOG ON (NAME = N'FileTableRChart_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\FileTableRChart_log.ldf' , SIZE = 8192KB , FILEGROWTH = 65536KB )
GO

ALTER DATABASE FileTableRChart
    SET FILESTREAM ( NON_TRANSACTED_ACCESS = FULL, DIRECTORY_NAME = N'RCharts' )

So I will have folder RCharts available as a BLOB storage to my FileTableRChart SQL server database. Adding a table to get all the needed information on my charts.

USE FileTableRChart;
GO

CREATE TABLE ChartsR AS FILETABLE
WITH (
 FileTable_Directory = 'DocumentTable'
,FileTable_Collate_Filename = database_default  
);
GO

Setting the BLOB, we can focus now on R code within T-SQL. Following R Code will be used to generated histograms with normal curve for quick data overview (note, this is just a sample):

x <- data.frame(val = c(1,2,3,6,3,2,3,4,5,6,7,7,6,6,6,5,5,4,8))
y <- data.frame(val = c(1,2,5,8,5,4,2,4,5,6,3,2,3,5,5,6,7,7,8))
x$class <- 'XX'
y$class <- 'YY'
d <- rbind(x,y)

#normal function with counts
gghist <- ggplot(d, aes(x=val)) + geom_histogram(binwidth=2, 
                  aes(y=..density.., fill=..count..))
gghist <- gghist + stat_function(fun=dnorm, args=list(mean=mean(d$val), 
                   sd=sd(d$val)), colour="red")
gghist <- gghist + ggtitle("Histogram of val with normal curve")  + 
                   xlab("Variable Val") + ylab("Density of Val")

Returning diagram that will be further parametrized when inserted into T-SQL code.

hist_norm_curv

Besides parametrization, I will add a function to loop through all the input variables and generated diagrams for each of the given variable/column in SQL Server query passed through sp_execute_external_script stored procedure.

Final code:

DECLARE @SQLStat NVARCHAR(4000)
SET @SQLStat = 'SELECT
                     fs.[Sale Key] AS SalesID
                    ,c.[City] AS City
                    ,c.[State Province] AS StateProvince
                    ,c.[Sales Territory] AS SalesTerritory
                    ,fs.[Customer Key] AS CustomerKey
                    ,fs.[Stock Item Key] AS StockItem
                    ,fs.[Quantity] AS Quantity
                    ,fs.[Total Including Tax] AS Total
                    ,fs.[Profit] AS Profit

                    FROM [Fact].[Sale] AS  fs
                    JOIN dimension.city AS c
                    ON c.[City Key] = fs.[City Key]
                    WHERE
                        fs.[customer key] <> 0'

DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'library(ggplot2)
              library(stringr)
              #library(jpeg)
              cust_data <- Sales
              n <- ncol(cust_data)
              for (i in 1:n) 
                        {
                          path <- 
''\\\\SICN-KASTRUN\\mssqlserver\\RCharts\\DocumentTable\\Plot_''
                          colid   <- data.frame(val=(cust_data)[i])
                          colname <- names(cust_data)[i]
                          #print(colname)
                          #print(colid)
                          gghist <- ggplot(colid, aes(x=val)) + 
geom_histogram(binwidth=2, aes(y=..density.., fill=..count..))
                          gghist <- gghist + stat_function(fun=dnorm, 
args=list(mean=mean(colid$val), sd=sd(colid$val)), colour="red")
                          gghist <- gghist + ggtitle("Histogram of val with 
normal curve")  + xlab("Variable Val") + ylab("Density of Val")
                          path <- paste(path,colname,''.jpg'')
                          path <- str_replace_all(path," ","")
                          #jpeg(file=path)
                          ggsave(path, width = 4, height = 4)
                          plot(gghist)
                          dev.off()
                        }';

EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RStat
    ,@input_data_1 = @SQLStat
    ,@input_data_1_name = N'Sales'

I am using ggsave function, but jpeg function from package jpeg is also an option. Matter of a flavour. And variable path should be pointing to your local FileTable directory.

Now I can have graphs and charts stored persistently in filetable and retrieving information on files with simple query:

SELECT FT.Name
,IIF(FT.is_directory=1,'Directory','Files') [File Category]
,FT.file_type [File Type]
,(FT.cached_file_size)/1024.0 [File Size (KB)]
,FT.creation_time [Created Time]
,FT.file_stream.GetFileNamespacePath(1,0) [File Path]
,ISNULL(PT.file_stream.GetFileNamespacePath(1,0),'Root Directory') [Parent Path]
FROM [dbo].[ChartsR] FT
LEFT JOIN [dbo].[ChartsR] PT
ON FT.path_locator.GetAncestor(1) = PT.path_locator

2016-09-25-13_33_40-using_filetable_to_store_graphs_generated_with_revoscale-sql-sicn-kastrun-file

Going through the charts could be now much easier for multiple purposes.

2016-09-25-13_35_08-edit-post-tomaztsql-wordpress-com

There might be some security issues: I have used mklink to create a logical drive pointing to FileTable directory.

2016-09-25-12_05_54-administrator_-command-prompt

You might also want to use Local group policy editor for MSSQLLaunchpad to have access granted (write permissions) to FileTable directory.

2016-09-25-13_37_41-local-group-policy-editor

Code is available at GitHub.

Happy R-SQLing!

 

Size of XDF files using RevoScaleR package

It came to my attention that size of XDF (external data frame) file can change drastically based on the compute context and environment. When testing the output of a dataset I was working on in SQL Server Management Studio I was simultaneously testing R code in RTVS or  R Studio and I have noticed a file growth.

Following stored procedure will do a simple test:

CREATE PROCEDURE rxImport_Test ( @rowsPerRead INT )
AS
BEGIN
    DECLARE @RStat NVARCHAR(4000)
    SET @RStat = 'library(RevoScaleR)
                  #rxSetComputeContext("RxLocalSeq")
                  ptm <- proc.time()
                  inFile <- file.path(rxGetOption("sampleDataDir"), "AirlineDemoSmall.csv")
                  filename <- "AirlineDemoSmall_'+CAST(@rowsPerRead AS VARCHAR(100))+'_TSQL_NC.xdf"
                  rxTextToXdf(inFile = inFile, outFile = filename,  stringsAsFactors = T, rowsPerRead = '+CAST(@rowsPerRead AS VARCHAR(100))+', overwrite=TRUE)
                  outFile <- file.path(rxGetOption("sampleDataDir"), filename)
                  rxImport(inData = inFile, outFile = outFile, overwrite=TRUE)
                  d <- proc.time() - ptm
                  filesize <- data.frame(file.size(filename))    
                  time     <- data.frame(d[3])
                  RowsPerRead <- data.frame('+CAST(@rowsPerRead AS VARCHAR(100))+')
                  filename_xdf <- data.frame(filename)
                  ran <- data.frame(Sys.time())
                  OutputDataSet <- cbind(as.character(filesize), time, RowsPerRead, filename_xdf, ran)';
    EXECUTE sp_execute_external_script
          @language = N'R'
         ,@script = @RStat
    WITH RESULT SETS (( 
                         Filesize NVARCHAR(100)
                        ,Time_df NVARCHAR(100)
                        ,RowsPerRead NVARCHAR(100)
                        ,filename_xdf NVARCHAR(100)
                        ,DateExecute NVARCHAR(100)
                        ))
END

But let’s first understand and test the Computation context and path to the data.

-- Make sure your path location is pointing to RevoscaleR library folder!
EXECUTE sp_execute_external_script
      @language = N'R'
     ,@script = N'library(RevoScaleR) 
                OutputDataSet <- data.frame(rxGetOption("sampleDataDir"))'
WITH RESULT SETS (( 
                    path_folder NVARCHAR(1000)
                    ))

-- check for ComputeContext
DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'library(RevoScaleR)
             cc <- rxGetOption("computeContext")
             OutputDataSet <- data.frame(cc@description)';
EXECUTE sp_execute_external_script
      @language = N'R'
     ,@script = @RStat
WITH RESULT SETS ((compute_context NVARCHAR(100)))

At my computer, this looks like this:

2016-09-22-14_33_40-revoscale_r_file_conversion-sql-sicn-kastrun-sqlr-spar_si01017988-59-mic

No we will run procedure

rxImport_Test

with different chunk sizes (this is what I will test) and observe execution times.

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 2;
GO

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 20;
GO

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 200;
GO

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 2000;
GO

Running with different chunk size the procedure, it yields interesting results:

2016-09-22 15_22_37-Revoscale_R_file_conversion.sql - SICN-KASTRUN.SQLR (SPAR_si01017988 (60))_ - Mi.png

Now, let’s see the summary information on this file / dataset.

 

2016-09-22-15_35_58-book1-excel

Considering that original file holds 600.000 rows, number of blocks for each of the files is also changing and therefore the size of the files is growing.

Retrieving information on block size

DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'library(RevoScaleR)    
              info <- rxGetInfoXdf(data="AirlineDemoSmall_20000000_TSQL_NC.xdf", getVarInfo = TRUE)    
              OutputDataSet <- data.frame(info$numBlocks)';

EXECUTE sp_execute_external_script
      @language = N'R'
     ,@script = @RStat
WITH RESULT SETS (( 
                    nof_blocks NVARCHAR(100)))

one can see the change between the files and where is the optimal block size. In my test, number of blocks would be 3 to 30 max to receive maximum performance from creating XDF file. This means from 2000 up to 200.000 rows per block would yield best performance results. Otherwise I haven’t found the the golden rule of the block size, but take caution, especially when dealing with larger files.

I ran test couple of times in order to check the consistency of the results, and they hold water. As for the file size; this is the presentation of internal file, as of *.xdf file (as external structure) size should not differ as the block size changes, but perfomance does!

Code is available at Github.

Happy R-sqling!

Resource governor and external resource pool for Microsoft R Services

Setting external resource pool for execution of R commands using sp_execute_external_script has proven extremely useful, especially in cases where you have other workers present, when you don’t want to overdo on data analysis and get to much resources from others (especially when running data analysis in production environment) or when you know that your data analysis will require maximum CPU and memory for a specific period. In such cases using and defining external resource pool is not only useful but highly recommended.

Resource governor is a feature that enables you to manage SQL Server workload and system resource consumption their limits. Limits can be configures for any workload in terms of CPU, Memory and I/O consumption. Where you have many different workloads on the same SQL Server, resource Governor helps allocate requested resources.

By default, two pools are defined – default and internal for system resources, and only default for system external resources.

2016-08-18 22_19_51-SQLQuery4.sql - SICN-KASTRUN.RevoTestDB (SPAR_si01017988 (54))_ - Microsoft SQL

Resource governor has resource pools, workload groups that are groups of similar load within same classification criteria. When defining external resource pool we will be defining the limitations for CPU and/or memory consumption for external resources. And R Services (using external procedure sp_execute_external_script) falls under external resources. According to MSDN, with external resource pool for R services will govern “rterm.exe, BxlServer.exe, and other processes spawned by them (quoted: link)”

First of all, you need to have resource governor enabled:

-- Enable Resource Governor
ALTER RESOURCE GOVERNOR RECONFIGURE;  
GO

Once you have this, let us set the query for test so we can measure and see how the resource governor will behave.

I will be using RevoScaleR sample data, that come with each R Service installation of SQL Server 2016. This sample data should be available on following location:

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\R_SERVICES\library\RevoScaleR\SampleData

2016-08-18 22_41_08-SampleData

And I will be using AirlineDemoSmall.csv  – CSV dataset, roughly 14MB size. Please note, that path to the SampleData might differ, based on your SQLServer 2016 installation folder.

In your database create and import data to your SQL Server table:

CREATE TABLE AirlineDemoSmall(
     ArrDelay varchar(100) NOT NULL
    ,CRSDepTime float NOT NULL
    ,[DayOfWeek] varchar(12) NOT NULL  
)
GO

-- this file should be at your location! so no need to download it
BULK INSERT AirlineDemoSmall
FROM 'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\
R_SERVICES\library\RevoScaleR\SampleData\AirlineDemoSmall.csv'
WITH (
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '\n',
    FIRSTROW = 2 -- Skip header
)
*Please note; both statements (CREATE TABLE and BULK INSERT) are part of RevoScaleR ImportAirlineDB.sql file. I am just reusing a section of this code. Whole file will be added to Github.

Once you have your data imported, we will run a simple linear regression using RevoScale library.

EXECUTE  sp_execute_external_script
                 @language = N'R'
                ,@script = N'
            library(RevoScaleR)
            f <- formula(as.numeric(ArrDelay) ~ as.numeric(DayOfWeek) + CRSDepTime)
            s <- system.time(mod <- rxLinMod(formula = f, data = AirLine))
            OutputDataSet <-  data.frame(system_time = s[3]);'
                ,@input_data_1 = N'SELECT * FROM AirlineDemoSmall'
                ,@input_data_1_name = N'AirLine'
-- WITH RESULT SETS UNDEFINED
WITH RESULT SETS 
            ((
                 Elapsed_time FLOAT
            ));

 

So I am only exporting R system time function and only Total elapsed time (hence 3 attribute from vector result of system.time function). Usual representation of this function is following format: user time, system time and elapsed time.

2016-08-18 22_51_34-RStudio

And will only take elapsed time as a measure and comparison between resource pool enabled and disabled. Results from our control environment (before setting the resource pool):

2016-08-18 22_53_45-SQLQuery4.sql - SICN-KASTRUN.RevoTestDB (SPAR_si01017988 (54))_ - Microsoft SQL

and computation time is 0.058 of a second where as elapsed time  – value returned from our sp_execute_external_script – is 0,63 of a second.

2016-08-18 22_58_47-SQLQuery4.sql - SICN-KASTRUN.RevoTestDB (SPAR_si01017988 (54))_ - Microsoft SQL

Now we will create a new pool and limit the CPU and memory available for such workloads:

-- Default value
ALTER EXTERNAL RESOURCE POOL [default] 
WITH (AFFINITY CPU = AUTO)
GO

CREATE EXTERNAL RESOURCE POOL RService_Resource_Pool  
WITH (  
     MAX_CPU_PERCENT = 10  
    ,MAX_MEMORY_PERCENT = 5
);  

ALTER RESOURCE POOL [default] WITH (max_memory_percent = 60, max_cpu_percent=90);  
ALTER EXTERNAL RESOURCE POOL [default] WITH (max_memory_percent = 40, max_cpu_percent=10);  
ALTER RESOURCE GOVERNOR reconfigure;

 

New pool – called: RService_Resource_Pool is created  with new values set.

2016-08-18 23_09_23-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

And obligatory reconfiguration:

ALTER RESOURCE GOVERNOR RECONFIGURE;  
GO

This configuration can also be found using this system configuration table:

-- Check configuration
SELECT * FROM sys.resource_governor_external_resource_pools

2016-08-18 23_36_54-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

Please note that only 20 workers are dedicated to be used for R external processes.

Capture

With each run of RevoScaleR library functions an increment will appear for DMV statistics on external script execution. Also, only RevoScaleR functions are counted here. Any other function (to my knowledge) is not counted here (as of SQL Server 2016 version).

SELECT * FROM sys.dm_external_script_execution_stats

2016-08-18 23_34_01-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

Once you have this, we need to set the Classification function as well:

CREATE WORKLOAD GROUP R_workgroup WITH (importance = medium) USING "default", 
EXTERNAL "RService_Resource_Pool";  

ALTER RESOURCE GOVERNOR WITH (classifier_function = NULL);  
ALTER RESOURCE GOVERNOR reconfigure;  

USE master  
GO  
CREATE FUNCTION RG_Class_function()  
RETURNS sysname  
WITH schemabinding  
AS  
BEGIN  
    IF program_name() in ('Microsoft R Host', 'RStudio') RETURN 'R_workgroup';  
    RETURN 'default'  
    END;  
GO  

ALTER RESOURCE GOVERNOR WITH  (classifier_function = dbo.RG_Class_function);  
ALTER RESOURCE GOVERNOR reconfigure;  
go

 

On MSDN you will find a post on setting up resource governor for R Services and due to security reasons, “However, direct management of the Launchpad service by using Resource Governor is not supported.” So, this practically means that you can also create a login with one of twenty R User Names (MSSQLServer01 … MSSQLServer20) in group of SQLRUserGroup in order for Resource Governor to work.

So, much coding needed at this point. I will be using User MSSQLSERVER01 and create all the necessary things: logins, permissions, etc.After setting up MSSQLSERVER01 windows user and using EXECUTE AS LOGIN command prior to running sp_execute_external_script.

Running the command:

SELECT * FROM sys.resource_governor_workload_groups;

you will now  see all the workload groups.

And now, finally we can rerun the same procedure:

-- We will run same query

EXECUTE  sp_execute_external_script
                 @language = N'R'
                ,@script = N'
   library(RevoScaleR)
   f <- formula(as.numeric(ArrDelay) ~ as.numeric(DayOfWeek) + CRSDepTime)
    s <- system.time(mod <- rxLinMod(formula = f, data = AirLine))
                            OutputDataSet <-  data.frame(system_time = s[3]);'
                ,@input_data_1 = N'SELECT * FROM AirlineDemoSmall'
                ,@input_data_1_name = N'AirLine'
-- WITH RESULT SETS UNDEFINED
WITH RESULT SETS 
            ((
                 Elapsed_time FLOAT
            ));

 

And CPU consumption will stay within the limits:

2016-08-19 00_40_57-Task Manager

and the elapsed time will be higher due to lower CPU available.

2016-08-19 00_41_54-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

Happy R-Tsqling! 🙂

Code Is available at Github.

 

NOTE!

CPU consumption will rise after the execution of external execution command due to the worker MSSQLSERVER01 cleaning the session files and log files!

2016-08-19 00_44_26-MSSQLSERVER01

Since log and session files are part of Launchpad service, Resource Governor will not be able to limit the CPU, Memory of I/O consumption.

 

Handling required and missing R packages in Microsoft R Services

I have seen several times, that execution of the R code with procedure sp_execute_external_script was not valid due to missing library or library dependencies.

Problem is – in general – not solved out of the box. But can be solved using and maintaining a list of installed libraries used by Microsoft R services or by simply create a “pre-code” R code to do a check for you.

In both cases, user will end up with additional code, but it might be a good check if you are installing library in production and you run such check prior to running any relevant R code.

Let’s start with simple R code:

USE WideWorldImporters;
GO

EXECUTE sp_execute_external_script    
           @language = N'R'
          ,@script=N'library(Hmisc) 
                    df <- data.frame(rcorr(as.matrix(sp_RStats_query), 
                            type="pearson")$P)
                    OutputDataSet<-df'
          ,@input_data_1 = N'SELECT 
                     SupplierID
                    ,UnitPackageID
                    ,OuterPackageID
                FROM [Warehouse].[StockItems]'
          ,@input_data_1_name = N'sp_RStats_query'
    WITH RESULT SETS ((SupplierID NVARCHAR(200)
                    ,UnitPackageID NVARCHAR(200)
                    ,OuterPackageID NVARCHAR(200)));

 

This code will in my case return error message, that R Service is missing a specific library in order to execute R code – in my case library Hmisc.

Error_installed_library_01.Wide

So the first step to solve this issue is to declare variable for R script and parametrize  @script parameter for procedure sp_execute_external_script.

DECLARE @OutScript NVARCHAR(MAX) 
SET @OutScript =N'library(Hmisc) 
                    df <- data.frame(rcorr(as.matrix(sp_RStats_query), 
                     type="pearson")$P)
                    OutputDataSet<-df'

EXECUTE sp_execute_external_script    
           @language = N'R'
          ,@script= @OutScript
          ,@input_data_1 = N'SELECT 
                     SupplierID
                    ,UnitPackageID
                    ,OuterPackageID
                FROM [Warehouse].[StockItems]'
          ,@input_data_1_name = N'sp_RStats_query'
    WITH RESULT SETS ((
                         SupplierID    NVARCHAR(200)
                        ,UnitPackageID NVARCHAR(200)
                        ,OuterPackageID NVARCHAR(200)
                    ));

Now we need to do a string search for following patterns:

  • library (c(package1,package2))
  • library (package1),  library(package2)
  • library(package1)

All these reserved R words denote the need for particular library/libraries installation.

DECLARE @OutScript NVARCHAR(MAX) 
SET @OutScript =N'library(Hmisc)
                  library(test123) 
                    df <- data.frame(rcorr(as.matrix(sp_RStats_query), 
                      type="pearson")$P)
                    OutputDataSet<-df'

/*
***************************************************************
START: CODE FOR CHECKING AND INSTALLING MISSING R LIBRARIES
***************************************************************
*/


DECLARE @Tally TABLE (num TINYINT,R_Code NVARCHAR(MAX))
INSERT INTO @Tally VALUES (1,@OutScript)
DECLARE @libstatement NVARCHAR(MAX)
DECLARE @cmdstatement NVARCHAR(MAX)

;WITH CTE_R(num,R_Code, libname)
AS
(
SELECT
      1 AS num,
      RIGHT(R_Code, LEN(R_Code) - CHARINDEX(')', R_Code, 0)) AS  R_Code, 
      substring(R_Code, CHARINDEX('library(', R_Code, 0) + 0, CHARINDEX(')',
 R_Code, 0) - CHARINDEX('library(', R_Code, 0) + 1) AS libname
FROM @Tally
WHERE  
        CHARINDEX('(', R_Code, 0) > 0 
    AND CHARINDEX('library(',R_Code,0) > 0

UNION ALL

SELECT
     1 AS num,
     RIGHT(R_Code, LEN(R_Code) - CHARINDEX(')', R_Code, 0)) AS  R_Code,
     substring(R_Code, CHARINDEX('library(', R_Code, 0) + 0, CHARINDEX(')', 
R_Code, 0) - CHARINDEX('library(', R_Code, 0) + 1) AS libname
FROM CTE_R
WHERE 
    CHARINDEX('(', R_Code, 0) > 0 
AND CHARINDEX('library(',R_Code,0) > 0

)
, fin AS
(
SELECT TOP 1 stuff((SELECT ' install.packages(''''' + 
REPLACE(REPLACE(REPLACE(c1.libname,'library',''),')',''),'(','') + '''''
           , dependencies = T)'
              FROM CTE_R AS c1 
              WHERE 
                    c1.num = c2.num
              FOR XML PATH (''),TYPE).value('.','NVARCHAR(MAX)'),1,1,'') AS lib_stat
              FROM CTE_R AS c2
)
SELECT 
        @libstatement = lib_stat 
FROM fin

SET @cmdstatement = 'EXEC xp_cmdshell ''"C:\Program Files\Microsoft SQL Server\
MSSQL13.MSSQLSERVER\R_SERVICES\bin\R.EXE" cmd -e ' + @libstatement + ''''
EXEC SP_EXECUTESQL @cmdstatement

/*
***************************************************************
END: CODE FOR CHECKING AND INSTALLING MISSING R LIBRARIES
***************************************************************
*/


EXECUTE sp_execute_external_script    
           @language = N'R'
          ,@script= @OutScript
          ,@input_data_1 = N'SELECT 
                     SupplierID
                    ,UnitPackageID
                    ,OuterPackageID
                FROM [Warehouse].[StockItems]'
          ,@input_data_1_name = N'sp_RStats_query'
    WITH RESULT SETS ((
                         SupplierID    NVARCHAR(200)
                        ,UnitPackageID NVARCHAR(200)
                        ,OuterPackageID NVARCHAR(200)
                    ));

 

Result in this case will be successful with correct R results and sp_execute_external_script will not return error for missing libraries.

I added a “fake” library called test123 for testing purposes if all the libraries will be installed successfully.

At the end the script generated xp_cmdshell command (in one line):

EXEC xp_cmdshell '"C:\Program Files\Microsoft SQL Server\
MSSQL13.MSSQLSERVER\R_SERVICES\bin\R.EXE" cmd 
-e install.packages(''Hmisc'') install.packages(''test123'')'

You might also experience the following error in the output of xp_cmdshell command:

Error_installed_library_02.Wide

In this case, go to the following location and enable write permission for this folder and subfolders.

Error_installed_library_03.Wide

For the end, you will always find more elegant and easy way to write R code that will inadvertently check the installation (as well as version and dependencies for library) with following R code:

if(!is.element("Hmisc", installed.packages()))
{install.packages("Hmisc", dependencies = T)
}else{library("Hmisc")}

So the original code can simply be changed to:

USE WideWorldImporters;
GO

EXECUTE sp_execute_external_script    
           @language = N'R'
          ,@script=N'if(!is.element("Hmisc", installed.packages()))
                      {install.packages("Hmisc", dependencies = T)
                        }else{library("Hmisc")}
                    df <- data.frame(rcorr(as.matrix(sp_RStats_query), 
 type="pearson")$P)
                    OutputDataSet<-df'
          ,@input_data_1 = N'SELECT 
                     SupplierID
                    ,UnitPackageID
                    ,OuterPackageID
                FROM [Warehouse].[StockItems]'
          ,@input_data_1_name = N'sp_RStats_query'
    WITH RESULT SETS ((SupplierID NVARCHAR(200)
                    ,UnitPackageID NVARCHAR(200)
                    ,OuterPackageID NVARCHAR(200)));

But unfortunately, one can not always count on the consistence of developer or data scientists or author of the code, that they will always add a simple check for library installation.

Code is available at GitHub.

Happy R-TSQLing!