Comparing performance on dplyr package, RevoScaleR package and T-SQL on simple data manipulation tasks

Long I wanted to test a simple data manipulation tasks and compare the execution time, ease of writing the code and simplicity between T-SQL and R package for data manipulation.

Couple of packages I will mention for data manipulations are plyr, dplyr and data.table and compare the execution time, simplicity and ease of writing with general T-SQL code and RevoScaleR package. For this blog post I will use R package dplyr and T-SQL with possibilites of RevoScaleR computation functions.

My initial query will be. Available in WideWorldImportersDW database. No other alterations have been done to underlying tables (fact.sale or dimension.city).

USE WideWorldImportersDW;
GO

SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN [Dimension].[City] AS c
        ON c.[City Key] = fs.[City Key];

For the metrics of comparison I will be using  simple time statistics (metrics: elapsed time)  and every code will be run from SSMS.

-- Example
SET STATISTICS TIME ON
SELECT GETDATE()
SET STATISTICS TIME OFF

SELECT

First simple test is just printing the data.

PRINT ' '
PRINT 'STATISTICS WITH T-SQL'
PRINT ' '

SET STATISTICS TIME ON
SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]

SET STATISTICS TIME OFF

PRINT ' '
PRINT 'STATISTICS WITH R'
PRINT ' '

SET STATISTICS TIME ON
DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'OutputDataSet <- InputDataSet'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATE
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 
SET STATISTICS TIME OFF

This test is very much simple. Outputting the data. In case to T-SQL it is just a simple SELECT statement and in case of R-TSQL is just passing data to R Engine and returning in a data-frame. The test itself is trivial and “absurd” but very important to get the base-line and time related results.

2016-10-09-18_44_32-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

Running the query couple of times on my client, I have some base comparison. T-SQL outperforms R-TSQL and it is almost two-times faster. I would expect higher difference in terms of time, because it involves a lot of data migration and no computations.

Filtering / WHERE

Now let’s perform a simple WHERE clause or some simple filtering.

PRINT ' '
PRINT 'STATISTICS WITH T-SQL'
PRINT ' '

-- SIMPLE T-SQL
SET STATISTICS TIME ON
SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]
    WHERE
        [Sales Territory] <> 'Southeast'
    AND fs.[Customer Key] <> 0
    AND Profit <= 1000
SET STATISTICS TIME OFF


PRINT ' '
PRINT 'STATISTICS WITH R'
PRINT ' '

-- R Package dplyr and T-SQL
SET STATISTICS TIME ON

DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory] AS SalesTerritory
        ,fs.[Customer Key] AS CustomerKey
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'
                library(dplyr)
                OutputDataSet  <- InputDataSet %>% 
           filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000)'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATETIME
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 

SET STATISTICS TIME OFF

2016-10-09-20_03_04-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

Again results of both executions are the same but difference is getting bigger and bigger. Main reason given is that all the data must be transferred to R engine, before being manipulated and results returned.

Now I can perform a simple comparison with dplyr package and RevoscaleR package. In this manner we will have same ground base for comparison.

PRINT ' '
PRINT 'STATISTICS WITH R dpylr'
PRINT ' '

SET STATISTICS TIME ON

DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory] AS SalesTerritory
        ,fs.[Customer Key] AS CustomerKey
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'
                library(dplyr)
                OutputDataSet  <- InputDataSet %>% 
filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000)'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATETIME
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 

SET STATISTICS TIME OFF

PRINT ' '
PRINT 'STATISTICS WITH R RevoScaleR'
PRINT ' '

SET STATISTICS TIME ON

DECLARE @TSQL1 AS NVARCHAR(MAX)
SET @TSQL1 = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory] AS SalesTerritory
        ,fs.[Customer Key] AS CustomerKey
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RevoRScript AS NVARCHAR(MAX)
SET @RevoRScript = N'
                    library(RevoScaleR)
                    OutputDataSet <- rxXdfToDataFrame(rxDataStep(
inData = InputDataSet, outFile="Cust_data_filter.xdf", 
overwrite=TRUE, rowsPerRead = 100000, 
rowSelection =SalesTerritory != "Southeast" & CustomerKey != 0 & Profit <= 1000))'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RevoRScript
          ,@input_data_1 = @TSQL1
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATETIME
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 

SET STATISTICS TIME OFF

With printed output results.

2016-10-09-21_12_18-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

We can see from the print that dplyr package is in this case slightly ahead of RevoScaleR package, but the difference is 0,3 seconds in favour of dplyr. Going through the code, I am creating *.xdf on the fly and still get almost same time results.

Aggregating Data

Now to test the results for aggregating data for all three scenarios.

PRINT ' '
PRINT 'STATISTICS WITH T-SQL'
PRINT ' '

-- SIMPLE T-SQL
SET STATISTICS TIME ON
SELECT 
         c.City
        ,c.[Sales Territory]
        ,SUM(fs.[Total Including Tax]) AS Total_turnover
        ,MAX(fs.[Total Including Tax]) AS Max_turnover
        ,MIN(fs.[Total Including Tax]) AS Min_turnover
        --,(fs.[Total Including Tax]) AS Median_turnover
        ,VAR(fs.[Total Including Tax]) AS Var_turnover
        ,STDEV(fs.[Total Including Tax]) AS stdev_turnover
        ,AVG(fs.[Total Including Tax]/fs.Profit) AS KPI_avg_Turnover2profit
        
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]
    WHERE
        [Sales Territory] <> 'Southeast'
    AND fs.[Customer Key] <> 0
    AND Profit <= 1000
    GROUP BY
         c.[Sales Territory]
        ,c.City

SET STATISTICS TIME OFF


PRINT ' '
PRINT 'STATISTICS WITH R dpylr'
PRINT ' '

SET STATISTICS TIME ON

-- Difference with T-SQL, I Have to pass all the values needed to filter out and 
-- aggregate data
DECLARE @TSQL1 AS NVARCHAR(MAX)
SET @TSQL1 = N'SELECT 
                     c.City
                    ,c.[Sales Territory] AS SalesTerritory
                    ,fs.[Customer Key] AS CustomerKey
                    ,fs.[Total Including Tax] AS TotalIncludingTax
                    ,fs.Profit
                FROM [Fact].[Sale] AS  fs
                    JOIN dimension.city AS c
                    ON c.[City Key] = fs.[City Key]'

DECLARE @RdplyrScript AS NVARCHAR(MAX)
SET @RdplyrScript = N'
                library(dplyr)
                OutputDataSet  <- InputDataSet %>% 
                                     mutate(Turnover2profit=TotalIncludingTax/Profit) %>%
                                    group_by(City, SalesTerritory)   %>%
                                    filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000) %>%
                                      summarise(    
                                         Total_turnover = sum(TotalIncludingTax, na.rm=TRUE)
                                         ,Max_turnover = max(TotalIncludingTax, na.rm=TRUE)
                                         ,Min_turnover = min(TotalIncludingTax, na.rm=TRUE)
                                         ,Median_turnover = median(TotalIncludingTax, na.rm=TRUE)
                                         ,Var_turnover = var(TotalIncludingTax, na.rm=TRUE)
                                         ,stdev_turnover = sd(TotalIncludingTax, na.rm=TRUE)
                                         ,KPI_avg_Turnover2profit = mean(Turnover2profit, na.rm=TRUE)    
                                      )'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RdplyrScript
          ,@input_data_1 = @TSQL1
          
WITH RESULT SETS ((
     [City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,Total_turnover DECIMAL(18,3)
    ,Max_turnover DECIMAL(18,3)
    ,Min_turnover DECIMAL(18,3)
    ,Median_turnover DECIMAL(18,3)
    ,Var_turnover DECIMAL(18,3)
    ,stdev_turnover DECIMAL(18,3)
    ,KPI_avg_Turnover2profit DECIMAL(18,3)
    )); 

SET STATISTICS TIME OFF


PRINT ' '
PRINT 'STATISTICS WITH R RevoScaleR'
PRINT ' '

SET STATISTICS TIME ON

DECLARE @TSQL2 AS NVARCHAR(MAX)
SET @TSQL2 = N'SELECT 
                     c.City
                    ,c.[Sales Territory] AS SalesTerritory
                    ,fs.[Customer Key] AS CustomerKey
                    ,fs.[Total Including Tax] AS TotalIncludingTax
                    ,fs.Profit
                FROM [Fact].[Sale] AS  fs
                    JOIN dimension.city AS c
                    ON c.[City Key] = fs.[City Key]'

DECLARE @RevoRScript AS NVARCHAR(MAX)
SET @RevoRScript = N'
                    library(RevoScaleR)
                    sales_rx1 <- rxDataStep(inData = InputDataSet, 
outFile="Cust_data_rx1.xdf", overwrite=TRUE, rowsPerRead = 100000, 
                                  rowSelection =SalesTerritory != "Southeast" &
 CustomerKey != 0 & Profit <= 1000)

                    sales_rx2 <- rxDataStep(sales_rx1, outFile="Cust_data_rx2.xdf",
                        transforms=list(
                            turnover2profit=TotalIncludingTax/Profit
                            ,City = City
                            ,SalesTerritory = SalesTerritory
                            ,TotalIncludingTax = TotalIncludingTax
                            ,Profit = Profit
                        ),
                        overwrite=TRUE, rowsPerRead = 100000)
                    
                        sales_rx4 <- rxSummary(TotalIncludingTax~SalesTerritory:City, 
data=sales_rx2,
                                               summaryStats=c("Mean", "StdDev",
 "Min", "Max", "Sum"))

                        sales_rx4_1 <- sales_rx4$categorical[[1]][c("SalesTerritory", 
"City", "Sum", "StdDev", "Min", "Max")]

                        sales_rx4P <- rxSummary(turnover2profit~SalesTerritory:City,
 data=sales_rx2,
                                               summaryStats=c("Mean"))

                        sales_rx4_2 <- sales_rx4P$categorical[[1]][c("SalesTerritory",
 "City", "Means")]

                        ## Merge data
                        sales_rx5 <- merge(sales_rx4_1, sales_rx4_2, 
by=c("SalesTerritory","City"), all=TRUE)
                        names(sales_rx5)[3] <- "Total_turnover"
                        names(sales_rx5)[4] <- "Max_turnover"
                        names(sales_rx5)[5] <- "Min_turnover"
                        names(sales_rx5)[6] <- "stdev_turnover"
                        names(sales_rx5)[7] <- "KPI_avg_Turnover2profit"
                                                                
                    OutputDataSet <- sales_rx5''

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RevoRScript
          ,@input_data_1 = @TSQL2
          
WITH RESULT SETS ((
     [City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,Total_turnover DECIMAL(18,3)
    ,Max_turnover DECIMAL(18,3)
    ,Min_turnover DECIMAL(18,3)
    ,Median_turnover DECIMAL(18,3)
    ,Var_turnover DECIMAL(18,3)
    ,stdev_turnover DECIMAL(18,3)
    ,KPI_avg_Turnover2profit DECIMAL(18,3)
    )); 

SET STATISTICS TIME OFF

 

The results are obvious. T-SQL is in this case faster, which is understood. But comparing dplyr and RevoScaleR, RevoscaleR is faster when doing data manipulation and returning results into SSMS. 2016-10-10-11_27_20-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

The only problem I have is the amount of code needed to create a simple summary statistics. Even though RevoScaleR is faster as dplyr (and plyr), it is absolutely hard to read and write the code. Chaining / Piping is not possible meaning, that you have to store the intermediate results and storing the xdf files.

2016-10-10-11_31_22-datatk

RevoScaleR package did the job but consider doing data manipulation in T-SQL or a dedicated data manipulation package, unless you are fine with all the merging, renaming and outputting.

I deliberately added some of the statistics in the output and you can see that median is supported only in dplyr package, where as in T-SQL or RevoScaleR you have to write it and compute it manually. As well I miss Variance metric in RevoScaleR and Mode / Modus in all three – T-SQL, dplyr and RevoScaleR.

In addition, I have also used dplyrXdf which is a dplyr package for *.xdf dataframe and enables user to do chaining / piping with Revoscale external data format. You can install package in RTVS or RStudio using:

devtools::install_github("RevolutionAnalytics/dplyrXdf")
SET STATISTICS TIME ON

-- Difference with T-SQL, I Have to pass all the values needed to filter out 
-- and aggregate data
DECLARE @TSQL2 AS NVARCHAR(MAX)
SET @TSQL2 = N'SELECT 
                     c.City
                    ,c.[Sales Territory] AS SalesTerritory
                    ,fs.[Customer Key] AS CustomerKey
                    ,fs.[Total Including Tax] AS TotalIncludingTax
                    ,fs.Profit
                FROM [Fact].[Sale] AS  fs
                    JOIN dimension.city AS c
                    ON c.[City Key] = fs.[City Key]'

DECLARE @RevoRScript AS NVARCHAR(MAX)
SET @RevoRScript = N'
                    library(RevoScaleR)
                    library(dplyr)
                    library(dplyrXdf)
                    custXdf <- rxDataFrameToXdf(InputDataSet, 
                               outFile="Cust_data_aggregate.xdf", overwrite=TRUE)
                    custagg <- custXdf %>%
                      mutate(Turnover2profit=TotalIncludingTax/Profit) %>%
                      group_by(City, SalesTerritory)   %>%
                      filter(SalesTerritory != "Southeast", CustomerKey != 0, 
                                 Profit <= 1000) %>%
                      summarise(
                        Total_turnover = sum(TotalIncludingTax, na.rm=TRUE)
                        ,Max_turnover = max(TotalIncludingTax, na.rm=TRUE)
                        ,Min_turnover = min(TotalIncludingTax, na.rm=TRUE)
                        ,Median_turnover = median(TotalIncludingTax, na.rm=TRUE)
                        ,Var_turnover = var(TotalIncludingTax, na.rm=TRUE)
                        ,stdev_turnover = sd(TotalIncludingTax, na.rm=TRUE)
                        ,KPI_avg_Turnover2profit = mean(Turnover2profit, na.rm=TRUE)
                                )
                OutputDataSet <- data.frame(custagg)'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RevoRScript
          ,@input_data_1 = @TSQL2
          
WITH RESULT SETS ((
     [City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,Total_turnover DECIMAL(18,3)
    ,Max_turnover DECIMAL(18,3)
    ,Min_turnover DECIMAL(18,3)
    ,Median_turnover DECIMAL(18,3)
    ,Var_turnover DECIMAL(18,3)
    ,stdev_turnover DECIMAL(18,3)
    ,KPI_avg_Turnover2profit DECIMAL(18,3)
    )); 

SET STATISTICS TIME OFF

Problem with dplyrXdf is performance. The code is very very easy to read but execution time is just absurd. I had to run it couple of times and usually get around 1 minute of elapsed time. Which is just beyond my patience zone. I have used different computeContext and it had no effect.

My conclusione: If you can, do all the data manipulation prior to pushing data from T-SQL into R code, because dplyr or data.table do not support parallel computation and no-memory  limitations and on the other hand, RevoScaleR functions and capabilities were not meant to do data manipulation at this scale, but are very good for summary statistics and overview.

As always, code is available at GitHub.

Happy R-TSQLing.

 

Data science show at SQL Server Radio

SQL Server Radio with Matan Yungman (info | twitter| company ) and Guy Glantser (info | twitter | company ) was hosting me for SQL Server radio 52th series. Main topic was discussion on  data science, statistics, who data scientist is and what SQL Server offers with R integration, SSRS, PowerBI, Excel and other tools.

cover170x170

45 minutes of talk and laughter with Matan and Guy. Both are very funny and highly professional and our discussion was full of interesting topics and questions, anecdotes and discussing use cases from experiences.

Listen to podcast of 52th show of SQL Server radio and hear what we were laughing and discussing about.

To recap what we were nibbling and chatting about (well, what I was chatting and I am still laughing hearing the show he he):

  • what data science is who data scientist is
  • Problems with outsourcing data scientists
  • What tools are commonly used in data science
  • What Microsoft is offering with SQL Server 2016
  • Purpose and goals of data science
  • Predictions, models and analytics
  • R language and R integration
  • Use cases from the field

 

Enjoy listening and thank you guys Matan and Guy for this show, I had a great time!

Running multiple correlations with R and T-SQL

Getting to know the data is always an interesting part of data science. With R integration into SQL Server, the exploration part is still part of the game.

Usual way to get some statistics out of the dataset is to run some frequencies, descriptive statistics and nevertheless correlations.

Running correlations against a set of variables in T-SQL might be a bit of a drag, hence using R code with SP_EXECUTE_EXTERNAL_SCRIPT is just as easy as following:

USE WideWorldImporters;
GO

 DECLARE @sql NVARCHAR(MAX)
 SET @sql = 'SELECT 
                      SupplierID
                    , UnitPackageID
                    , OuterPackageID
                    , LeadTimeDays
                    , QuantityPerOuter
                    , TaxRate
                    , UnitPrice
                    , RecommendedRetailPrice
                    , TypicalWeightPerUnit
                FROM [Warehouse].[StockItems]'

DECLARE @Rscript NVARCHAR(MAX)
SET @Rscript = N'df <- data.frame(cor(Stock, use="complete.obs", method="pearson"))
                OutputDataSet<-df'

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=@Rscript
      ,@input_data_1 = @sql
      ,@input_data_1_name = N'Stock'
WITH RESULT SETS (( 
                     SupplierID NVARCHAR(100)
                    ,UnitPackageID NVARCHAR(100)
                    ,OuterPackageID NVARCHAR(100)
                    ,LeadTimeDays NVARCHAR(100)
                    ,QuantityPerOuter NVARCHAR(100)
                    ,TaxRate NVARCHAR(100)
                    ,UnitPrice NVARCHAR(100)
                    ,RecommendedRetailPrice NVARCHAR(100)
                    ,TypicalWeightPerUnit NVARCHAR(100)
                    ));

I am using WideWorldImporters; (GitHub or at Codeplex);  new Demo database from Microsoft that was released just this month, beginning of June 2016.

By running this query with correlations R returns a dataframe that T-SQL is able to interpret and output the results in SSMS in following format. Very cool.

2016-06-26 07_31_07-SQLQuery1.sql - SICN-00031_SQLSERVER2016RC3.WideWorldImporters (SPAR_si01017988

The outlook is very similar to one for example in SPSS:

2016-06-26 09_06_37-_Output1 [Document1] - IBM SPSS Statistics Viewer

Numbers are matching (!) and the outline is relatively the same; very clear and easily readable. One thing is missing – SPSS delivers statistical significance (p-value) whereas R only delivers value of Pearson correlation coefficient. For that matter we need to run additional T-SQL / R procedure in order to get p-values.

DECLARE @sql NVARCHAR(MAX)
SET @sql = 'SELECT 
                     SupplierID
                    ,UnitPackageID
                    ,OuterPackageID
                    ,LeadTimeDays
                    ,QuantityPerOuter
                    ,TaxRate
                    ,UnitPrice
                    ,RecommendedRetailPrice
                    ,TypicalWeightPerUnit
                FROM [Warehouse].[StockItems]'

DECLARE @Rscript NVARCHAR(MAX)
SET @Rscript = N'
                library(Hmisc) 
                df <- data.frame(rcorr(as.matrix(Stock), type="pearson")$P)
                OutputDataSet<-df
                '

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=@Rscript
      ,@input_data_1 = @sql
      ,@input_data_1_name = N'Stock'
WITH RESULT SETS (( 
                     SupplierID DECIMAL(10,5)
                    ,UnitPackageID DECIMAL(10,5)
                    ,OuterPackageID DECIMAL(10,5)
                    ,LeadTimeDays DECIMAL(10,5)
                    ,QuantityPerOuter DECIMAL(10,5)
                    ,TaxRate DECIMAL(10,5)
                    ,UnitPrice DECIMAL(10,5)
                    ,RecommendedRetailPrice DECIMAL(10,5)
                    ,TypicalWeightPerUnit DECIMAL(10,5)
                    ));

So we have now statistical significance of our correlation matrix. I used using library Hmisc and function rcorr.

2016-06-26 09_34_04-SQLQuery1.sql - SICN-00031_SQLSERVER2016RC3.WideWorldImporters (SPAR_si01017988

Rcorr function has very little options to be set. So results may vary when compared to other (by default) functions. You can also use cor.test function:

data.frame(p_value = cor.test(df$my_var1,df$my_var2,use="complete.obs", 
method="pearson")$p.value, var1= "my_var1", var2= "my_var2")

but since the function can not deal with matrix / dataframe, a loop function to go through every combination of variables and store the results with variable names into dataframe. The rcorr function will do the trick, for now.

The final step would be (hint) to combine both sp_execute_external_script into one stored procedure, store both results from R, combine the coefficients with significance level and export only one table with all the information needed. This is already prepared as part of my R scripts.

Happy R-SQLing!

Faux pas of data science

Data scientist has been the “sexiest” job  for the past years mostly thanking to all the buzzword bubble created around it. With emergence of so called Big Data came the need of new word formulation. And it is very much similar to creation of phrase business intelligence (BI). Some of you might still remember it used to be called decision support system (DSS). Regardless of the name, it evolved slowly with computer science and entrance of computers in daily life.

I am fine with DSS or BI naming, it still encapsulates the gist of how and when the acquisition and transformation of raw data into meaningful and useful information can help support business.

I am also fine with the slow evolution from decision support to research to data mining to machine learning to data science. For me, it is still just crunching the numbers, knowing mathematics and statistics, all the “non-fancy” stuff as cleaning, normalizing, de-duplicating data to exploring and even more exploring, to peer-to-peer reviews and again diving into data until coming to “fancy” part of drawing a conclusions and coming to business people with helping them on their decisions.

What I am not fine with is following:

  1. Data science combines all the standard practices and knowledge a statistician must know!
  2. Data science is sexy for the part of knowing and understanding the algorithms for multivariate statistics, for making predictions and for finding the patterns in the data. This is sexy, but to get to this point, one must be a mathematician/statistician with lots of years of experience.  The rest is just crap! Assuring the data quality (no business want to hear that, nobody wants to do this. Well. In reality, if your data is of poor quality, don’t expect good quality results), siting countless hours with one or two variables and finding out the behavior, correlation, causality, diving into literature for finding a smoothing algorithm to assure a better result, etc. Well, this is not really crap, but this is usually what “buzz-word” people don’t really like to mention!
  3. With Big data come big big big problems. Eventual consistency is probably the biggest lie ever (the abuse is similar to the one of statistical significance of p-value). having inconsistent data represents a big challenge. Big data made a big promise which a lot of data scientist couldn’t deliver (not of the lack of the knowledge but usually the lack of time or money). Big data never cared to look into the relational-model. It was never meant for business to adopt it in order to extract a relevant information. But again, this was not the fault of data scientist, but slowly adapting businesses. Stories about 4V (volume, velocity, variety, value) can be misleading mainly because technology of 4V is usually separate story to real research and mining of data (unless you are dealing with stream analysis or daily pushing new models in your business; but also a week old data will be sufficient for proving a point).
  4. Everyone wants to be a data scientist. Yes, and I want a pony. No, no. I want a rainbow unicorn. Being data scientist is dedication, is reading pile of books with formulas (usually hard to understand, but they actually make sense!), siting with random data sets, switching between random mathematical/statistical/database/script programs and languages in order to – well – just to prepare the data.
  5. All new technologies are boosting the ego of non-data-scientist with this fake vision, that a simple prediction of your company’s sales can be done with couple of clicks. I can’t argue with that. My only question is, would the result of this 5 minutes drag-and-drop prediction be of any relevance? or correct?
  6. Everyone like data scientist. But nobody like statisticians. Or mathematicians. First are usually the abusive toward data and they lie about the results and the latter are philosophers with countless formulas proving the existence of life on fifteen  decimal place. But reality is, data scientist = statisticians + mathematicians. So get over it! I still vividly remember 20+ years ago, how “data science” back then was neglected and it’s reputation was… well, it wasn’t.
  7. R and Python is the next best thing I have to learn. Well don’t, if you don’t intend to use it. Go and learn something more useful. Spanish for example. R has been in the community for past 30+ years and it wasn’t invented just recently. So has been python. And we have been using both for the purpose of supporting business decisions. If you would like to learn R, ask your self: 1) Do I know any statistics? and 2) Can I explain the difference between Naive Bayes and Pearson correlation coefficient?. If you answer on both negative, I suggest you to start learning spanish.
  8. Programing is in a lot of aspects very close to theory of statistics. Sampling for example is one of those areas where good programming knowledge will bust your abilities in data sampling and different approaches to probability theory
  9. Salaries are relative. Data scientist can get a very good salaries, especially those who are able to combine a) knowledge  of statistics/mathematics with b) computer literacy (programing, data manipulation) and c) very good understanding of business processes. A lot of knowledge and understanding come from experience and repetitive work, the rest with determination and intelligence.
  10. It is hard to be data scientist in a semi to big company! But much easier in small or as a freelance.

So next time you use term data science or data scientist or you label yourself as one, keep in mind couple of points from above. And unless you have done any kind of research for years and still get a kick out of it, please, don’t call it a sexy job. You might offend someone.