Performance differences between RevoScaleR, ColumnStore Table and In-Memory OLTP Table

Running *.XDF files using RevoScaleR computational functions versus have dataset available in Columnstore table or in In-Memory OLTP table will be focus of comparison for this blog post.

For this test, I will use the AirLines dataset, available here. Deliberately, I have picked a sample 200 MB (of 13GB dataset) in order to properly test the differences and what should be the best way.

After unzipping the file, I will use following T-SQL query to import the file into SQL Server.

With this example, you can import xdf file directly to SQL Server table (note, that I have transformed a CSV file into XDF and import xdf file into SQL table):

-- must have a write permissions on folder: C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                rxOptions(sampleDataDir = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData")
                inFile <- file.path(rxGetOption("sampleDataDir"), "airsample.csv")
                of <-  rxDataStep(inData = inFile, outFile = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData/airline20170428_2.xdf", 
                             transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")
                            ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))
                            ,overwrite = TRUE
                            ,maxRowsByCols = 10000000
                            ,rowsPerRead = 200000)
                OutputDataSet <- rxXdfToDataFrame(of)'

DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT 1 AS N'

EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((ArrDelay INT
                    ,CRSDepTime DECIMAL(6,4)
                    ,DofWeek NVARCHAR(20)))
GO

 

So the whole process is to be done by creating a table, converting the above sp_execute_external_script into procedure and import results from external procedure to the table.

--Complete process
CREATE TABLE AirFlights_small 
(id INT IDENTITY(1,1)
,ArrDelay INT
,CRSDepTime DECIMAL(6,4)
,DofWeek NVARCHAR(20) 
);
GO

CREATE Procedure ImportXDFtoSQLTable
AS
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                rxOptions(sampleDataDir = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData")
                inFile <- file.path(rxGetOption("sampleDataDir"), "airsample.csv")
                of <-  rxDataStep(inData = inFile, outFile = "airline20170428_2.xdf", 
                transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")
            ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))
            ,overwrite = TRUE
            ,maxRowsByCols = 10000000)
             OutputDataSet <- data.frame(rxReadXdf(file=of, varsToKeep=c("ArrDelay", "CRSDepTime","DayOfWeek")))'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT 1 AS N'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((ArrDelay INT,CRSDepTime DECIMAL(6,4),DofWeek NVARCHAR(20)));
GO

INSERT INTO AirFlights_small
EXECUTE ImportXDFtoSQLTable;
GO

 

There you go. Data are in T-SQL Table. Now we can start with comparisons.  I will be measuring the time to get average air delay time per day of the week.

2017-04-28 22_44_10-RStudio

RevoScaleR

With using the RevoScaleR package, I will be using rxCrossTabs function with the help of transform argument to convert day of the week into factors:

#importing data
outFile2 <- rxDataStep(inData = inFile, outFile = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData/airline20170428_2.xdf", 
            transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")
           ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))
           ,overwrite = TRUE
           ,maxRowsByCols = 10000000)

of2 <- data.frame(rxReadXdf(file=outFile2, varsToKeep=c("ArrDelay", "CRSDepTime","DayOfWeek")))

summary(rxCrossTabs(ArrDelay~DayOfWeek
                    ,data = of2  #outFile2
                    ,transforms = transforms
                    ,blocksPerRead=300000), output="means")

Now get those times:

# Getting times
system.time({ 
  summary(rxCrossTabs(ArrDelay~DayOfWeek
                      ,data = of2
                      ,transforms = transforms
                      ,blocksPerRead=300000), output="means")
  })

With results of 7.8 on elapsed time and computation time of 3.8 second.

Rows Read: 8400013, Total Rows Processed: 8400013, Total Chunk Time: 3.825 seconds 
Computation time: 3.839 seconds.
   user  system elapsed 
   2.89    0.37    7.89 

 

T-SQL query without any specifics

To have a baseline, let’s run the following query:

SET STATISTICS TIME ON;
SELECT 
[DofWeek]
,AVG(ArrDelay) AS [means]
FROM
    AirFlights_small
GROUP BY 
    [DofWeek]
SET STATISTICS TIME OFF;

And check these time statistics

 SQL Server Execution Times:
CPU time = 6124 ms,  elapsed time = 2019 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.

Obiously the CPU / computation time is higher, although the elapsed time is faster.

ColumnStore Table

Let’s create a nonclustered column store index.

CREATE TABLE AirFlights_CS
(id INT IDENTITY(1,1)
,ArrDelay INT
,CRSDepTime DECIMAL(6,4)
,DofWeek NVARCHAR(20) 
);
GO
INSERT INTO AirFlights_CS(ArrDelay, CRSDepTime, DofWeek)
SELECT ArrDelay, CRSDepTime, DofWeek FROM AirFlights_small 

CREATE NONCLUSTERED COLUMNSTORE INDEX NCCI_AirFlight
ON AirFlights_CS
(id, ArrDelay, CRSDepTime, DofWeek);
GO

With the execution of the same query

SET STATISTICS TIME ON;
SELECT 
[DofWeek]
,AVG(ArrDelay) AS [means]
FROM
  AirFlights_CS
GROUP BY     [DofWeek] SET STATISTICS TIME OFF;

The following time statistics are in

 SQL Server Execution Times:
CPU time = 202 ms,  elapsed time = 109 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.

 

In-Memory OLTP

To get Memory optimized table, we need to add a filegroup and create a table with memory optimized turned on:

CREATE TABLE dbo.AirFlight_M   
(  
  id INT NOT NULL PRIMARY KEY NONCLUSTERED
 ,ArrDelay INT
 ,CRSDepTime DECIMAL(6,4) 
 ,DofWeek NVARCHAR(20)
) WITH (MEMORY_OPTIMIZED=ON, DURABILITY = SCHEMA_AND_DATA);
GO

And insert the data

INSERT INTO AirFlight_M
SELECT * FROM AirFlights_small

Running the simple query

SET STATISTICS TIME ON;
SELECT 
[DofWeek]
,AVG(ArrDelay) AS [means]
FROM
    AirFlight_M
GROUP BY 
    [DofWeek]
SET STATISTICS TIME OFF;

results are:

 SQL Server Execution Times:
CPU time = 6186 ms,  elapsed time = 1627 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.

These results were somehow expected, mostly because the ColumnStore table is the only one having index and reading (also by looking in execution plans) optimized with comparison to others. Also degree of parallelism, clustered and non-clustered index can  be pushed, but the idea was to have tests similar to the one in RevoScaleR and R environemnt. With R, we can not push any index on the XDF file.

In R we run:

system.time({ 
LMResults <- rxLinMod(ArrDelay ~ DayOfWeek, data = outFile2, transforms = transforms)
LMResults$coefficients
})

And in SSMS we run:

SET STATISTICS TIME ON;
-- 1. T-SQL
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)
                OutputDataSet <- data.frame(LMResults$coefficients)'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlights_small]'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((
            --DofWeek NVARCHAR(20)
        --    ,
            Coefficient DECIMAL(10,5)
            ));
GO
SET STATISTICS TIME OFF;


SET STATISTICS TIME ON;
-- 2. ColumnStore
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)
                OutputDataSet <- data.frame(LMResults$coefficients)'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlights_CS]'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((
            --DofWeek NVARCHAR(20)
        --    ,
            Coefficient DECIMAL(10,5)
            ));
GO
SET STATISTICS TIME OFF;


SET STATISTICS TIME ON;
-- 3. Memory optimized
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)
                OutputDataSet <- data.frame(LMResults$coefficients)'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlight_M]'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((
            --DofWeek NVARCHAR(20)
        --    ,
            Coefficient DECIMAL(10,5)
            ));
GO
SET STATISTICS TIME OFF;

 

Conclusion

Gathering statistics on CPU time and elapsed time when running simple Linear regression, this is comparison:

df_LR_comparison <- data.frame (
  method = c("T-SQL", "ColumnStore", "Memory Optimized", "RevoScaleR")
  ,CPUtime = c(3000,1625,2156,7689)
  ,ElapsedTime = c(14323,10851,10600,7760)
  )
library(ggplot2)

ggplot(df_LR_comparison, aes(method, fill=method)) + 
  geom_bar(aes(y=ElapsedTime), stat="identity") +
  geom_line(aes(y=CPUtime, group=1), color="white", size=3) +
  scale_colour_manual(" ", values=c("d1" = "blue", "d2" = "red"))+
  #scale_fill_manual("",values="red")+
  theme(legend.position="none")

Showing that elapsed time for R environment with RevoScaleR is fastest (and getting data from XDF), where as simple T-SQL run with sp_execute_external_script and using RevoScaleR gives the slowest response.

2017-04-29 00_43_10-Plot Zoom

In terms of CPU time (white line), Columnstore with RevoScaleR call through external procedure outperforms all others.

Final conclusion: When running statistical analysis (using RevoScaleR or any other R library), use columnstore and index optimized tables/views to receive best CPU and elapsed times.  Important to remember is also the fact, that any aggregations and calculations that can be done within SQL Server, are better to be perfomered there.

 

As always, code is available at GitHub.

 

Happy coding! 🙂

Comparing performance on dplyr package, RevoScaleR package and T-SQL on simple data manipulation tasks

Long I wanted to test a simple data manipulation tasks and compare the execution time, ease of writing the code and simplicity between T-SQL and R package for data manipulation.

Couple of packages I will mention for data manipulations are plyr, dplyr and data.table and compare the execution time, simplicity and ease of writing with general T-SQL code and RevoScaleR package. For this blog post I will use R package dplyr and T-SQL with possibilites of RevoScaleR computation functions.

My initial query will be. Available in WideWorldImportersDW database. No other alterations have been done to underlying tables (fact.sale or dimension.city).

USE WideWorldImportersDW;
GO

SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN [Dimension].[City] AS c
        ON c.[City Key] = fs.[City Key];

For the metrics of comparison I will be using  simple time statistics (metrics: elapsed time)  and every code will be run from SSMS.

-- Example
SET STATISTICS TIME ON
SELECT GETDATE()
SET STATISTICS TIME OFF

SELECT

First simple test is just printing the data.

PRINT ' '
PRINT 'STATISTICS WITH T-SQL'
PRINT ' '

SET STATISTICS TIME ON
SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]

SET STATISTICS TIME OFF

PRINT ' '
PRINT 'STATISTICS WITH R'
PRINT ' '

SET STATISTICS TIME ON
DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'OutputDataSet <- InputDataSet'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATE
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 
SET STATISTICS TIME OFF

This test is very much simple. Outputting the data. In case to T-SQL it is just a simple SELECT statement and in case of R-TSQL is just passing data to R Engine and returning in a data-frame. The test itself is trivial and “absurd” but very important to get the base-line and time related results.

2016-10-09-18_44_32-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

Running the query couple of times on my client, I have some base comparison. T-SQL outperforms R-TSQL and it is almost two-times faster. I would expect higher difference in terms of time, because it involves a lot of data migration and no computations.

Filtering / WHERE

Now let’s perform a simple WHERE clause or some simple filtering.

PRINT ' '
PRINT 'STATISTICS WITH T-SQL'
PRINT ' '

-- SIMPLE T-SQL
SET STATISTICS TIME ON
SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]
    WHERE
        [Sales Territory] <> 'Southeast'
    AND fs.[Customer Key] <> 0
    AND Profit <= 1000
SET STATISTICS TIME OFF


PRINT ' '
PRINT 'STATISTICS WITH R'
PRINT ' '

-- R Package dplyr and T-SQL
SET STATISTICS TIME ON

DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory] AS SalesTerritory
        ,fs.[Customer Key] AS CustomerKey
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'
                library(dplyr)
                OutputDataSet  <- InputDataSet %>% 
           filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000)'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATETIME
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 

SET STATISTICS TIME OFF

2016-10-09-20_03_04-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

Again results of both executions are the same but difference is getting bigger and bigger. Main reason given is that all the data must be transferred to R engine, before being manipulated and results returned.

Now I can perform a simple comparison with dplyr package and RevoscaleR package. In this manner we will have same ground base for comparison.

PRINT ' '
PRINT 'STATISTICS WITH R dpylr'
PRINT ' '

SET STATISTICS TIME ON

DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory] AS SalesTerritory
        ,fs.[Customer Key] AS CustomerKey
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'
                library(dplyr)
                OutputDataSet  <- InputDataSet %>% 
filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000)'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATETIME
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 

SET STATISTICS TIME OFF

PRINT ' '
PRINT 'STATISTICS WITH R RevoScaleR'
PRINT ' '

SET STATISTICS TIME ON

DECLARE @TSQL1 AS NVARCHAR(MAX)
SET @TSQL1 = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory] AS SalesTerritory
        ,fs.[Customer Key] AS CustomerKey
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RevoRScript AS NVARCHAR(MAX)
SET @RevoRScript = N'
                    library(RevoScaleR)
                    OutputDataSet <- rxXdfToDataFrame(rxDataStep(
inData = InputDataSet, outFile="Cust_data_filter.xdf", 
overwrite=TRUE, rowsPerRead = 100000, 
rowSelection =SalesTerritory != "Southeast" & CustomerKey != 0 & Profit <= 1000))'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RevoRScript
          ,@input_data_1 = @TSQL1
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATETIME
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 

SET STATISTICS TIME OFF

With printed output results.

2016-10-09-21_12_18-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

We can see from the print that dplyr package is in this case slightly ahead of RevoScaleR package, but the difference is 0,3 seconds in favour of dplyr. Going through the code, I am creating *.xdf on the fly and still get almost same time results.

Aggregating Data

Now to test the results for aggregating data for all three scenarios.

PRINT ' '
PRINT 'STATISTICS WITH T-SQL'
PRINT ' '

-- SIMPLE T-SQL
SET STATISTICS TIME ON
SELECT 
         c.City
        ,c.[Sales Territory]
        ,SUM(fs.[Total Including Tax]) AS Total_turnover
        ,MAX(fs.[Total Including Tax]) AS Max_turnover
        ,MIN(fs.[Total Including Tax]) AS Min_turnover
        --,(fs.[Total Including Tax]) AS Median_turnover
        ,VAR(fs.[Total Including Tax]) AS Var_turnover
        ,STDEV(fs.[Total Including Tax]) AS stdev_turnover
        ,AVG(fs.[Total Including Tax]/fs.Profit) AS KPI_avg_Turnover2profit
        
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]
    WHERE
        [Sales Territory] <> 'Southeast'
    AND fs.[Customer Key] <> 0
    AND Profit <= 1000
    GROUP BY
         c.[Sales Territory]
        ,c.City

SET STATISTICS TIME OFF


PRINT ' '
PRINT 'STATISTICS WITH R dpylr'
PRINT ' '

SET STATISTICS TIME ON

-- Difference with T-SQL, I Have to pass all the values needed to filter out and 
-- aggregate data
DECLARE @TSQL1 AS NVARCHAR(MAX)
SET @TSQL1 = N'SELECT 
                     c.City
                    ,c.[Sales Territory] AS SalesTerritory
                    ,fs.[Customer Key] AS CustomerKey
                    ,fs.[Total Including Tax] AS TotalIncludingTax
                    ,fs.Profit
                FROM [Fact].[Sale] AS  fs
                    JOIN dimension.city AS c
                    ON c.[City Key] = fs.[City Key]'

DECLARE @RdplyrScript AS NVARCHAR(MAX)
SET @RdplyrScript = N'
                library(dplyr)
                OutputDataSet  <- InputDataSet %>% 
                                     mutate(Turnover2profit=TotalIncludingTax/Profit) %>%
                                    group_by(City, SalesTerritory)   %>%
                                    filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000) %>%
                                      summarise(    
                                         Total_turnover = sum(TotalIncludingTax, na.rm=TRUE)
                                         ,Max_turnover = max(TotalIncludingTax, na.rm=TRUE)
                                         ,Min_turnover = min(TotalIncludingTax, na.rm=TRUE)
                                         ,Median_turnover = median(TotalIncludingTax, na.rm=TRUE)
                                         ,Var_turnover = var(TotalIncludingTax, na.rm=TRUE)
                                         ,stdev_turnover = sd(TotalIncludingTax, na.rm=TRUE)
                                         ,KPI_avg_Turnover2profit = mean(Turnover2profit, na.rm=TRUE)    
                                      )'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RdplyrScript
          ,@input_data_1 = @TSQL1
          
WITH RESULT SETS ((
     [City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,Total_turnover DECIMAL(18,3)
    ,Max_turnover DECIMAL(18,3)
    ,Min_turnover DECIMAL(18,3)
    ,Median_turnover DECIMAL(18,3)
    ,Var_turnover DECIMAL(18,3)
    ,stdev_turnover DECIMAL(18,3)
    ,KPI_avg_Turnover2profit DECIMAL(18,3)
    )); 

SET STATISTICS TIME OFF


PRINT ' '
PRINT 'STATISTICS WITH R RevoScaleR'
PRINT ' '

SET STATISTICS TIME ON

DECLARE @TSQL2 AS NVARCHAR(MAX)
SET @TSQL2 = N'SELECT 
                     c.City
                    ,c.[Sales Territory] AS SalesTerritory
                    ,fs.[Customer Key] AS CustomerKey
                    ,fs.[Total Including Tax] AS TotalIncludingTax
                    ,fs.Profit
                FROM [Fact].[Sale] AS  fs
                    JOIN dimension.city AS c
                    ON c.[City Key] = fs.[City Key]'

DECLARE @RevoRScript AS NVARCHAR(MAX)
SET @RevoRScript = N'
                    library(RevoScaleR)
                    sales_rx1 <- rxDataStep(inData = InputDataSet, 
outFile="Cust_data_rx1.xdf", overwrite=TRUE, rowsPerRead = 100000, 
                                  rowSelection =SalesTerritory != "Southeast" &
 CustomerKey != 0 & Profit <= 1000)

                    sales_rx2 <- rxDataStep(sales_rx1, outFile="Cust_data_rx2.xdf",
                        transforms=list(
                            turnover2profit=TotalIncludingTax/Profit
                            ,City = City
                            ,SalesTerritory = SalesTerritory
                            ,TotalIncludingTax = TotalIncludingTax
                            ,Profit = Profit
                        ),
                        overwrite=TRUE, rowsPerRead = 100000)
                    
                        sales_rx4 <- rxSummary(TotalIncludingTax~SalesTerritory:City, 
data=sales_rx2,
                                               summaryStats=c("Mean", "StdDev",
 "Min", "Max", "Sum"))

                        sales_rx4_1 <- sales_rx4$categorical[[1]][c("SalesTerritory", 
"City", "Sum", "StdDev", "Min", "Max")]

                        sales_rx4P <- rxSummary(turnover2profit~SalesTerritory:City,
 data=sales_rx2,
                                               summaryStats=c("Mean"))

                        sales_rx4_2 <- sales_rx4P$categorical[[1]][c("SalesTerritory",
 "City", "Means")]

                        ## Merge data
                        sales_rx5 <- merge(sales_rx4_1, sales_rx4_2, 
by=c("SalesTerritory","City"), all=TRUE)
                        names(sales_rx5)[3] <- "Total_turnover"
                        names(sales_rx5)[4] <- "Max_turnover"
                        names(sales_rx5)[5] <- "Min_turnover"
                        names(sales_rx5)[6] <- "stdev_turnover"
                        names(sales_rx5)[7] <- "KPI_avg_Turnover2profit"
                                                                
                    OutputDataSet <- sales_rx5''

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RevoRScript
          ,@input_data_1 = @TSQL2
          
WITH RESULT SETS ((
     [City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,Total_turnover DECIMAL(18,3)
    ,Max_turnover DECIMAL(18,3)
    ,Min_turnover DECIMAL(18,3)
    ,Median_turnover DECIMAL(18,3)
    ,Var_turnover DECIMAL(18,3)
    ,stdev_turnover DECIMAL(18,3)
    ,KPI_avg_Turnover2profit DECIMAL(18,3)
    )); 

SET STATISTICS TIME OFF

 

The results are obvious. T-SQL is in this case faster, which is understood. But comparing dplyr and RevoScaleR, RevoscaleR is faster when doing data manipulation and returning results into SSMS. 2016-10-10-11_27_20-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

The only problem I have is the amount of code needed to create a simple summary statistics. Even though RevoScaleR is faster as dplyr (and plyr), it is absolutely hard to read and write the code. Chaining / Piping is not possible meaning, that you have to store the intermediate results and storing the xdf files.

2016-10-10-11_31_22-datatk

RevoScaleR package did the job but consider doing data manipulation in T-SQL or a dedicated data manipulation package, unless you are fine with all the merging, renaming and outputting.

I deliberately added some of the statistics in the output and you can see that median is supported only in dplyr package, where as in T-SQL or RevoScaleR you have to write it and compute it manually. As well I miss Variance metric in RevoScaleR and Mode / Modus in all three – T-SQL, dplyr and RevoScaleR.

In addition, I have also used dplyrXdf which is a dplyr package for *.xdf dataframe and enables user to do chaining / piping with Revoscale external data format. You can install package in RTVS or RStudio using:

devtools::install_github("RevolutionAnalytics/dplyrXdf")
SET STATISTICS TIME ON

-- Difference with T-SQL, I Have to pass all the values needed to filter out 
-- and aggregate data
DECLARE @TSQL2 AS NVARCHAR(MAX)
SET @TSQL2 = N'SELECT 
                     c.City
                    ,c.[Sales Territory] AS SalesTerritory
                    ,fs.[Customer Key] AS CustomerKey
                    ,fs.[Total Including Tax] AS TotalIncludingTax
                    ,fs.Profit
                FROM [Fact].[Sale] AS  fs
                    JOIN dimension.city AS c
                    ON c.[City Key] = fs.[City Key]'

DECLARE @RevoRScript AS NVARCHAR(MAX)
SET @RevoRScript = N'
                    library(RevoScaleR)
                    library(dplyr)
                    library(dplyrXdf)
                    custXdf <- rxDataFrameToXdf(InputDataSet, 
                               outFile="Cust_data_aggregate.xdf", overwrite=TRUE)
                    custagg <- custXdf %>%
                      mutate(Turnover2profit=TotalIncludingTax/Profit) %>%
                      group_by(City, SalesTerritory)   %>%
                      filter(SalesTerritory != "Southeast", CustomerKey != 0, 
                                 Profit <= 1000) %>%
                      summarise(
                        Total_turnover = sum(TotalIncludingTax, na.rm=TRUE)
                        ,Max_turnover = max(TotalIncludingTax, na.rm=TRUE)
                        ,Min_turnover = min(TotalIncludingTax, na.rm=TRUE)
                        ,Median_turnover = median(TotalIncludingTax, na.rm=TRUE)
                        ,Var_turnover = var(TotalIncludingTax, na.rm=TRUE)
                        ,stdev_turnover = sd(TotalIncludingTax, na.rm=TRUE)
                        ,KPI_avg_Turnover2profit = mean(Turnover2profit, na.rm=TRUE)
                                )
                OutputDataSet <- data.frame(custagg)'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RevoRScript
          ,@input_data_1 = @TSQL2
          
WITH RESULT SETS ((
     [City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,Total_turnover DECIMAL(18,3)
    ,Max_turnover DECIMAL(18,3)
    ,Min_turnover DECIMAL(18,3)
    ,Median_turnover DECIMAL(18,3)
    ,Var_turnover DECIMAL(18,3)
    ,stdev_turnover DECIMAL(18,3)
    ,KPI_avg_Turnover2profit DECIMAL(18,3)
    )); 

SET STATISTICS TIME OFF

Problem with dplyrXdf is performance. The code is very very easy to read but execution time is just absurd. I had to run it couple of times and usually get around 1 minute of elapsed time. Which is just beyond my patience zone. I have used different computeContext and it had no effect.

My conclusione: If you can, do all the data manipulation prior to pushing data from T-SQL into R code, because dplyr or data.table do not support parallel computation and no-memory  limitations and on the other hand, RevoScaleR functions and capabilities were not meant to do data manipulation at this scale, but are very good for summary statistics and overview.

As always, code is available at GitHub.

Happy R-TSQLing.

 

Resource governor and external resource pool for Microsoft R Services

Setting external resource pool for execution of R commands using sp_execute_external_script has proven extremely useful, especially in cases where you have other workers present, when you don’t want to overdo on data analysis and get to much resources from others (especially when running data analysis in production environment) or when you know that your data analysis will require maximum CPU and memory for a specific period. In such cases using and defining external resource pool is not only useful but highly recommended.

Resource governor is a feature that enables you to manage SQL Server workload and system resource consumption their limits. Limits can be configures for any workload in terms of CPU, Memory and I/O consumption. Where you have many different workloads on the same SQL Server, resource Governor helps allocate requested resources.

By default, two pools are defined – default and internal for system resources, and only default for system external resources.

2016-08-18 22_19_51-SQLQuery4.sql - SICN-KASTRUN.RevoTestDB (SPAR_si01017988 (54))_ - Microsoft SQL

Resource governor has resource pools, workload groups that are groups of similar load within same classification criteria. When defining external resource pool we will be defining the limitations for CPU and/or memory consumption for external resources. And R Services (using external procedure sp_execute_external_script) falls under external resources. According to MSDN, with external resource pool for R services will govern “rterm.exe, BxlServer.exe, and other processes spawned by them (quoted: link)”

First of all, you need to have resource governor enabled:

-- Enable Resource Governor
ALTER RESOURCE GOVERNOR RECONFIGURE;  
GO

Once you have this, let us set the query for test so we can measure and see how the resource governor will behave.

I will be using RevoScaleR sample data, that come with each R Service installation of SQL Server 2016. This sample data should be available on following location:

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\R_SERVICES\library\RevoScaleR\SampleData

2016-08-18 22_41_08-SampleData

And I will be using AirlineDemoSmall.csv  – CSV dataset, roughly 14MB size. Please note, that path to the SampleData might differ, based on your SQLServer 2016 installation folder.

In your database create and import data to your SQL Server table:

CREATE TABLE AirlineDemoSmall(
     ArrDelay varchar(100) NOT NULL
    ,CRSDepTime float NOT NULL
    ,[DayOfWeek] varchar(12) NOT NULL  
)
GO

-- this file should be at your location! so no need to download it
BULK INSERT AirlineDemoSmall
FROM 'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\
R_SERVICES\library\RevoScaleR\SampleData\AirlineDemoSmall.csv'
WITH (
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '\n',
    FIRSTROW = 2 -- Skip header
)
*Please note; both statements (CREATE TABLE and BULK INSERT) are part of RevoScaleR ImportAirlineDB.sql file. I am just reusing a section of this code. Whole file will be added to Github.

Once you have your data imported, we will run a simple linear regression using RevoScale library.

EXECUTE  sp_execute_external_script
                 @language = N'R'
                ,@script = N'
            library(RevoScaleR)
            f <- formula(as.numeric(ArrDelay) ~ as.numeric(DayOfWeek) + CRSDepTime)
            s <- system.time(mod <- rxLinMod(formula = f, data = AirLine))
            OutputDataSet <-  data.frame(system_time = s[3]);'
                ,@input_data_1 = N'SELECT * FROM AirlineDemoSmall'
                ,@input_data_1_name = N'AirLine'
-- WITH RESULT SETS UNDEFINED
WITH RESULT SETS 
            ((
                 Elapsed_time FLOAT
            ));

 

So I am only exporting R system time function and only Total elapsed time (hence 3 attribute from vector result of system.time function). Usual representation of this function is following format: user time, system time and elapsed time.

2016-08-18 22_51_34-RStudio

And will only take elapsed time as a measure and comparison between resource pool enabled and disabled. Results from our control environment (before setting the resource pool):

2016-08-18 22_53_45-SQLQuery4.sql - SICN-KASTRUN.RevoTestDB (SPAR_si01017988 (54))_ - Microsoft SQL

and computation time is 0.058 of a second where as elapsed time  – value returned from our sp_execute_external_script – is 0,63 of a second.

2016-08-18 22_58_47-SQLQuery4.sql - SICN-KASTRUN.RevoTestDB (SPAR_si01017988 (54))_ - Microsoft SQL

Now we will create a new pool and limit the CPU and memory available for such workloads:

-- Default value
ALTER EXTERNAL RESOURCE POOL [default] 
WITH (AFFINITY CPU = AUTO)
GO

CREATE EXTERNAL RESOURCE POOL RService_Resource_Pool  
WITH (  
     MAX_CPU_PERCENT = 10  
    ,MAX_MEMORY_PERCENT = 5
);  

ALTER RESOURCE POOL [default] WITH (max_memory_percent = 60, max_cpu_percent=90);  
ALTER EXTERNAL RESOURCE POOL [default] WITH (max_memory_percent = 40, max_cpu_percent=10);  
ALTER RESOURCE GOVERNOR reconfigure;

 

New pool – called: RService_Resource_Pool is created  with new values set.

2016-08-18 23_09_23-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

And obligatory reconfiguration:

ALTER RESOURCE GOVERNOR RECONFIGURE;  
GO

This configuration can also be found using this system configuration table:

-- Check configuration
SELECT * FROM sys.resource_governor_external_resource_pools

2016-08-18 23_36_54-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

Please note that only 20 workers are dedicated to be used for R external processes.

Capture

With each run of RevoScaleR library functions an increment will appear for DMV statistics on external script execution. Also, only RevoScaleR functions are counted here. Any other function (to my knowledge) is not counted here (as of SQL Server 2016 version).

SELECT * FROM sys.dm_external_script_execution_stats

2016-08-18 23_34_01-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

Once you have this, we need to set the Classification function as well:

CREATE WORKLOAD GROUP R_workgroup WITH (importance = medium) USING "default", 
EXTERNAL "RService_Resource_Pool";  

ALTER RESOURCE GOVERNOR WITH (classifier_function = NULL);  
ALTER RESOURCE GOVERNOR reconfigure;  

USE master  
GO  
CREATE FUNCTION RG_Class_function()  
RETURNS sysname  
WITH schemabinding  
AS  
BEGIN  
    IF program_name() in ('Microsoft R Host', 'RStudio') RETURN 'R_workgroup';  
    RETURN 'default'  
    END;  
GO  

ALTER RESOURCE GOVERNOR WITH  (classifier_function = dbo.RG_Class_function);  
ALTER RESOURCE GOVERNOR reconfigure;  
go

 

On MSDN you will find a post on setting up resource governor for R Services and due to security reasons, “However, direct management of the Launchpad service by using Resource Governor is not supported.” So, this practically means that you can also create a login with one of twenty R User Names (MSSQLServer01 … MSSQLServer20) in group of SQLRUserGroup in order for Resource Governor to work.

So, much coding needed at this point. I will be using User MSSQLSERVER01 and create all the necessary things: logins, permissions, etc.After setting up MSSQLSERVER01 windows user and using EXECUTE AS LOGIN command prior to running sp_execute_external_script.

Running the command:

SELECT * FROM sys.resource_governor_workload_groups;

you will now  see all the workload groups.

And now, finally we can rerun the same procedure:

-- We will run same query

EXECUTE  sp_execute_external_script
                 @language = N'R'
                ,@script = N'
   library(RevoScaleR)
   f <- formula(as.numeric(ArrDelay) ~ as.numeric(DayOfWeek) + CRSDepTime)
    s <- system.time(mod <- rxLinMod(formula = f, data = AirLine))
                            OutputDataSet <-  data.frame(system_time = s[3]);'
                ,@input_data_1 = N'SELECT * FROM AirlineDemoSmall'
                ,@input_data_1_name = N'AirLine'
-- WITH RESULT SETS UNDEFINED
WITH RESULT SETS 
            ((
                 Elapsed_time FLOAT
            ));

 

And CPU consumption will stay within the limits:

2016-08-19 00_40_57-Task Manager

and the elapsed time will be higher due to lower CPU available.

2016-08-19 00_41_54-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

Happy R-Tsqling! 🙂

Code Is available at Github.

 

NOTE!

CPU consumption will rise after the execution of external execution command due to the worker MSSQLSERVER01 cleaning the session files and log files!

2016-08-19 00_44_26-MSSQLSERVER01

Since log and session files are part of Launchpad service, Resource Governor will not be able to limit the CPU, Memory of I/O consumption.