Is it possible to use RevoScaleR package in Power BI?

I was invited to deliver a session for Belgium User Group on SQL Server and R integration. After the session – which we did online using web based Citrix  – I got an interesting question: “Is it possible to use RevoScaleR performance computational functions within Power BI?“. My first answer was,  a sceptical yes. But I said, that I haven’t used it in this manner yet and that there might be some limitations.

The idea of having the scalable environment and the parallel computational package with all the predictive analytical functions in Power BI is absolutely great. But something tells me, that it will not be that straight forward.

So let’s start by taking a large (500 MB) txt file and create XDF file:

library(RevoScaleR)
file.name <- "YearPredictionMSD.txt"
rxOptions(sampleDataDir = "C:\\Files")
sampleDataDir

File is available on-line at this address with the zip file.

Getting data with R script

Open Power BI and choose Get Data -> R Script -> and copy/Paste the following slightly changed code:

library(RevoScaleR)
file.name <- "YearPredictionMSD.txt";
rxOptions(sampleDataDir = "C:\\Files");
sampleDataDir

With copy pasting and clicking OK,

2017-03-20 18_56_17-Untitled - Power BI Desktop

You will have to wait for the data to be read into the memory, the data models to be created and after monitoring the memory consumption and patiently waiting, you will notice, that this particular dataset (500 MB or 160 MB XDF), that minimum 3 GB of RAM will be consumed and you will end up with preview:

4 - 2017-03-20 19_01_53-

By now, you will also notice that after saving this Power BI document, it will take somewhere up to 700 MB of your disk space and all the data visualization will consume additional RAM and time. After you will close the Power BI document, you will notice a lot of RAM being released.

Using R Script in the visuals

When you create a new Power BI document, I will create new dataset by Entering data. I will create three “dummy” variables.

7 - 2017-03-20 19_18_23-

With these three variables I will try to inject the data returned from XDF data format and have data represented in Power BI.

After selecting the new visual and choosing R visual, I inserted following code:

library(RevoScaleR)
file.name <- "YearPredictionMSD.txt";
rxOptions(sampleDataDir = "C:\\Files");
sampleDataDir

And this time, the result is fascinating. R is plotting histogram in a split of a second, simply meaning it takes advantage of XDF file and inject it to Power BI.

8 - 2017-03-20 19_26_47-Untitled - Power BI Desktop

This is still – an outer file or dataset -, that Power BI does not have a clue about. Meaning, no slicers are available for dynamic change of the user selection.

Let’s try to insert the data into those three dummy variables, where the third one will be a factor that I have to pre-prepare. Since in this case factor is Year, it is relatively easy to do:

library(RevoScaleR)
library(gridExtra)
library(dplyr)
Year % filter(year == c("2000","2001","2002")))
grid.table(df_f %>% filter(year == Year))

Once I have this inserted in new R visualize, I just need to add a dummy slicer.

9 - 2017-03-20 20_52_52-RevoScale_and_PowerBI - Power BI Desktop

Now, I can easily change the years for my cross-tabulation (using rxCrosstab function). Since calculation is comprehended in the back on the whole dataset and using dplyr package just to omit or filter the results, it is also possible to use rxDatastep:

rxDataStep(inData=outputFile, outFile="C:\\Files\\YearPredictMSD_Year.xdf", 
             overwrite=TRUE, transforms=list(LateYears = V1 > 1999))
rxCrossTabs(V2~F(LateYears), data = "C:\\Files\\YearPredictMSD_Year.xdf")

In this way, you will be creating new XDF file through PowerBI with the transformation. Bear in mind, that this step might take some extra seconds to create new variable or to make a subset, if you would need. Again, this is up to  you to decide, based on the file size.

Using SQL Server procedure with R Script

This approach is not that uncommon, because it has been proven that using Stored Procedures with T-SQL and R code is useful and powerful way to use SQL Server and R integration within SSRS.  Changing the computational context is sure another way to make a work around.

Creating Stored procedure:

CREATE PROCEDURE [dbo].[SP_YearMSD_CrossTab]
AS
BEGIN
    DECLARE @RScript nvarchar(max)
        SET @RScript = N'
                library(RevoScaleR)
                sampleDataDir

Or by copying the T-SQL Code into the SQL Server Data Source, the result is the same.

10 -- 2017-03-20 21_51_04-RevoScale_and_PowerBI - Power BI Desktop

In both cases, you should have a cross-tabulational  representation of XDF dataset within Power BI. And now you can really use all the advantages of Power BI visuals, Slicers and as well any additional R predictions.

12 --- 2017-03-20 21_54_42-RevoScale_and_PowerBI - Power BI Desktop

There is a slight minus to this (if not all) approaches like this. You need to have many stored procedures or queries having generated like this. Also rxCube will help you to some extent, but repetitive work will not be avoided.

Using HDInsight or Hadoop?

Using XDF data files stored in HD-Insight or in Hadoop would generaly mean using same dataset and step as for SQL Server procedure. Just that you would need to – prior to executing T-SQL script, also change comptutational context:

# HD Insight - Spark - Azure
HDInsight mySshUsername = USNM,mySshHostname = HSTNM,
mySshSwitches= SWTCH) 
rxSetComputeContext("HDInsight")
## Hadoop
Hadoop mySshUsername = USNM,mySshHostname = HSTNM,
mySshSwitches= SWTCH)
rxSetComputeContext("Hadoop")

Verdict

I have explored couple of ways how to use the Power BI visuals and environment with RevoScaleR XDF (eXternal Data Frame) datafiles. I have to admit, I was surprised that there will be a way to do it in a relatively easy way, but from data scientist perspective, it is still some additional load and work before you can start with actual data analysis. Last two approaches (R script in Visuals and SQL Server Procedures) are by far the fastest and also take the advantage of using parallel and distributed computations that RevoScaleR package brings.

I would very strongly advise Microsoft and Power BI development team to add XDF plug-in to Power BI. Plug-in would work with metadata presentation of the data each time the computations should be used, the metadata would push the code against R Server to have results returned. This would, for sure be a great way to bring Big Data concept to Power BI Desktop.

As always, code and samples are available at GitHub.

Happy coding!

RevoScaleR package dependencies with graph visualization

MRAN currently holds 7520 R Packages. We can see this with usage of following command (stipulating that you are using MRAN R version. ):

library(tools)
df_ap <- data.frame(available.packages())
head(df_ap)

2017-03-13 19_38_02-RStudio

With importing package tools, we get many useful functions to find additional information on packages.

Function package.dependencies() parses and check dependencies of a package in current environment. Function package_dependencies()  (with underscore and not dot) will find all dependent and reverse dependent packages.

With following code I can extract the packages and their dependencies (this will perform a data normalization):

net <- data.frame(df_ap[,c(1,4)])
library(dplyr)
netN <- net %>% 
        mutate(Depends = strsplit(as.character(Depends), ",")) %>% 
        unnest(Depends)
netN

And the result is:

Source: local data frame [14,820 x 2]

   Package       Depends
    (fctr)         (chr)
1       A3 R (>= 2.15.0)
2       A3        xtable
3       A3       pbapply
4   abbyyR  R (>= 3.2.0)
5      abc   R (>= 2.10)
6      abc      abc.data
7      abc          nnet
8      abc      quantreg
9      abc          MASS
10     abc        locfit
..     ...           ...

Presented way needs to be further cleaned and prepared.

Once you have data normalized, we can use any of the network packages for visualizing the data. With use of igraph package, I created visual presentation of the RevoScaleR package; dependencies and imported packages.

With the code I filter out the RevoScaleR package and create visual:

library(igraph)
netN_g <- graph.data.frame(edges[edges$src %in% c('RevoScaleR', deptree), ])
plot(netN_g)

2017-03-15 17_01_14-Plot Zoom

 

Happy Ring!

 

 

RevoScaleR package for Microsoft R

RevoscaleR Package for R language is  package for scalable, distributed and parallel computation, available along with Microsoft R Server (and in-Database R Services). It solves many of limitations that R language is facing when run from a client machine. RevoScaleR Package addresses several of these issues:

  • memory based data access model -> dataset can be bigger than the size of a RAM
  • lack of parallel computation -> offers distributed and parallel computation
  • data movement -> no more need for data movement due to ability to set computational context
  • duplication costs -> with computational context set and different R versions (Open, Client or Server) data reside on one place, making maintenance cheaper and no duplication on different locations are needed
  • governance and providence -> RevoscaleR offers oversight of both with setting and additional services in R Server
  • hybrid typologies and agile development -> on-premises + cloud + client combination allow hybrid environment development for faster time to production

 

Before continuing, make sure you have RevoScaleR package installed in your R environment. To check, which computational functions are available within this package, let us run following:

RevoInfo <-packageVersion("RevoScaleR")
RevoInfo

to see the version of RevoScaleR package. In this case it is:

[1] ‘9.0.1’

Now we will run command to get the list of all functions:

revoScaleR_objects <- ls("package:RevoScaleR")
revoScaleR_objects

Here is the list:

2017-01-30-21_19_51-rstudio

All RevoScaleR functions have prefix rx or Rx, so it is much easier to distinguish functions from functions available in other similar packages – for example rxKMeans and kmeans.

find("rxKmeans")
find("kmeans")

Showing results – name of the package where each function is based:

> find("rxKmeans")
[1] "package:RevoScaleR"
> find("kmeans")
[1] "package:stats"

The output or RevoScaleR object, shows 200 computational functions, but I will focus only on couple of them.

RevoScaleR package and computational function were designed for parallel computation with no memory limitation, mainly because this package introduced it’s own file format, called XDF. eXternal Data Frame was designed for fast processing of smaller chunks of data, and gains it’s efficiency when reading and writing the XDF data by loading chucks of data into RAM one by at a time and only what is needed. The way this is done, means no limitations for the size of RAM, computations run much faster (because it is using C++ to write these algorithms, which is faster than original, which were written in interpretative language). Data scientist still make a single R call, bur R will use distrubuteR component to determine, how many cores, sockets and threads are available and then launch smaller portion of load into each thread, analyze data a bit at a time. With XDF, data is retrieved many times, but since it is 5-10times smaller (as I have already shown in previous blog posts when compared to *.txt or *.csv files), and it is written and stored into XDF file the same way as it was extracted from the memory, it enables faster computations, because no parsing of data chunks is required and because of the way, how data is stored, is maximizes the retrieval time of the data.

Preparing and storing or importing your data into XDF is important part of achieving faster computational time. Download some sample data from revolution analytics blog. I will be taking some AirOnTime data, a CSV file from here.

With help of following functions will help you to, I will import file from csv into xdf format.

rxTextToXdf() – for importing data to .xdf format from a delimited text file or csv.

rxDataStepXdf() – for transforming and subseting data of variables and/or rows for data exploration and analysis.

 With following code:
setwd("C:/Users/Documents/33")
rxTextToXdf(inFile = "airOT201201.csv", outFile = "airOT201201.xdf",  
stringsAsFactors = T, rowsPerRead = 200000)
I have now converted csv file into xdf file within cca 13 seconds.
2017-02-03-23_42_32-rstudio
and files look like:
2017-02-03-23_44_08-33
which is from original 105MB to 15 MB, it is 7 times smaller data file.
For further information on data handling, a very nice blog post is available here.
Quick information on the data set can be done using:
rxGetInfo("airOT201201.xdf", getVarInfo = TRUE, numRows = 20)
2017-02-03-23_52_02-rstudio
but we can also use following functions to expore and wrangle the data:
rxSummary(), rxCube, rxCrossTabs() – summary statistics for column and compute correlations or crosstabulation between the columns
rxHistogram() – plot a histogram of a column (variable)
rxLinePlot() – plot a line or scatterplot from XDF file or from rxCube
Running summary statistics for column DAY_OF_WEEK:
rxSummary(~DAY_OF_WEEK, data="airOT201201.xdf")
#or for the whole dataset
rxSummary(~., data="airOT201201.xdf")
we see the execution time and results of this statistic:
Rows Read: 200000, Total Rows Processed: 200000, Total Chunk Time: 0.007 seconds
Rows Read: 200000, Total Rows Processed: 400000, Total Chunk Time: 0.002 seconds
Rows Read: 86133, Total Rows Processed: 486133, Total Chunk Time: 0.002 seconds 
Computation time: 0.018 seconds.
Call:
rxSummary(formula = ~DAY_OF_WEEK, data = "airOT201201.xdf")

Summary Statistics Results for: ~DAY_OF_WEEK
Data: "airOT201201.xdf" (RxXdfData Data Source)
File name: airOT201201.xdf
Number of valid observations: 486133 
 
 Name        Mean     StdDev   Min Max ValidObs MissingObs
 DAY_OF_WEEK 3.852806 2.064557 1   7   486133   0
And run now rxHistogram for selected column:
#histogram
rxHistogram(~DAY_OF_WEEK, data="airOT201201.xdf")

Rows Read: 200000, Total Rows Processed: 200000, Total Chunk Time: 0.007 seconds
Rows Read: 200000, Total Rows Processed: 400000, Total Chunk Time: 0.004 seconds
Rows Read: 86133, Total Rows Processed: 486133, Total Chunk Time: Less than .001 seconds 
Computation time: 0.019 seconds.
to get the results for histogram:
2017-02-04 00_01_14-RStudio.png

Some of the following algorithms for predictions are available (and many more in addition):

rxLinMod() – linear regression model for XDF file
rxLogit() – logistic regression model for XDF file
rxDTree() – classification tree for XDF file
rxNaiveBayes() – bayes classifier for XDF file
rxGlm() – group of general linear models for XDF file
rxPredict() – predictions and residuals computations
 Let’s create a bit larger regression decision tree on our sample data on departure delay, day of the week, distance and elapsed time.
Air_DTree <- rxDTree(DEP_DELAY_NEW ~ DAY_OF_WEEK + ACTUAL_ELAPSED_TIME +
 DISTANCE_GROUP,  maxDepth = 3, minBucket = 30000, data = "airOT201201.xdf")

Visualizing the tree data:

plotcp(rxAddInheritance(Air_DTree))
plot(rxAddInheritance(Air_DTree))
text(rxAddInheritance(Air_DTree))

2017-02-04-00_22_57-plot-zoom

or you can use the RevoTreeView package, which is even smarter:

library(RevoTreeView)
plot(createTreeView(Air_DTree))

we can visualize the tree:

2017-02-04-00_20_49-microsoft-corporation

Of course, pruning and checking for over-fitting must also be done.

When comparing – for example exDTrees to original function, the performance si much better in favor of R. And if you have the ability to use RevoScaleR package for computations on larger datasets or your client might be an issue, use this package. It sure will make your life easier.

 

Happy R-SQLing.

Performance comparison between kmeans and RevoScaleR rxKmeans

In my previous blog post, I was focusing on data manipulation tasks with RevoScaleR Package in comparison to other data manipulation packages and at the end conclusions were obvious; RevoScaleR can not (without the help of dplyrXdf) do piping (or chaining) and storing temporary results take time and on top of that, data manipulation can be done easier (cleaner and faster) with dplyr package or data.table package. Another conclusion was, that you should do (as much as possible) all the data manipulation tasks within your client, so you diminish the value of the data sent to computation environment.

In this post, I will do a simple performance comparison between  kmeans clustering function  available in default stats package and RevoScaleR rxKmeans function for clustering.

Data will be loaded from WideWorldImportersDW.

library(RODBC)
library(RevoScaleR)
library(ggplot2)
library(dplyr)

myconn <-odbcDriverConnect("driver={SQL Server};Server=T-KASTRUN;
database=WideWorldImportersDW;trusted_connection=true")

cust.data <- sqlQuery(myconn, "SELECT 
                      fs.[Sale Key] AS SalesID
                      ,fs.[City Key] AS CityKey
                      ,c.[City] AS City
                      ,c.[State Province] AS StateProvince
                      ,c.[Sales Territory] AS SalesTerritory
                      ,fs.[Customer Key] AS CustomerKey
                      ,fs.[Stock Item Key] AS StockItem
                      ,fs.[Quantity] AS Quantity
                      ,fs.[Total Including Tax] AS Total
                      ,fs.[Profit] AS Profit
                      
                      
                      FROM [Fact].[Sale] AS  fs
                      JOIN dimension.city AS c
                      ON c.[City Key] = fs.[City Key]
                      WHERE
                      fs.[customer key] <> 0 ")

close(myconn)

 

In essence, I will be using same dataset, and comparing same algorithm (Lloyd) with same variables (columns) taken with all parameters the same. So RevoScaleR rxKmeans with following columns:

rxSalesCluster <- rxKmeans(formula= ~SalesID + CityKey + CustomerKey + 
StockItem + Quantity, data =cust.data, numCluster=9,algorithm = "lloyd", 
outFile = "SalesCluster.xdf", outColName = "Cluster", overwrite = TRUE)

vs. kmeans example:

SalesCluster <- kmeans(cust.data[,c(1,2,6,7,8)], 9, nstart = 20, algorithm="Lloyd")

After running 50 iterations on both of the algorithms, measured two metrics. First one was elapsed computation time and second one was the ratio between the “between-cluster sum of squares” and “total within-cluster sum of squares”.

fit <- rxSalesCluster$betweenss/rxSalesCluster$totss
tot.withinss    [totss]  Total within-cluster sum of squares, i.e. sum(withinss).
betweenss   [betweenss] The between-cluster sum of squares, i.e. totss-tot.withinss.

Graph of the performance over 50 iterations with both clustering functions.

2016-10-12 12_11_48-Plot Zoom.jpg

The performance results are very much obvious, rxKmeans function from RevoScaleR package outperforms stats kmeans by almost 3-times, which is at given dataset (143K Rows and 5 variables) a pretty substantiate improvement due to parallel computation.

So in terms of elapsed computation time:

result %>%
  group_by(cl_met) %>%
  summarize(
    average_score = mean(et)
    ,variance = var(et)
    ,st_dev = sd(et)
  )

2016-10-12 12_15_44-RStudio.png

kmeans average computation time is little below 4 seconds and rxKmeans computation time is little over 1 second. Please keep in mind that additional computation is performed in this time for clusters fit but in terms of time difference, it remains the same. From graph and from variance/standard deviations, one can see that rxKmeans has higher deviations which results in spikes on graph and are results of parallelizing the data on file. Every 6th iteration there is additional workload / worker dedicated for additional computation.

But the biggest concern is the results itself. Over 50 iterations I stored also the computation of with-in and between cluster sum of squares calculations. And results are stunning.

result %>% 
  group_by(cl_met) %>%
  summarize(
             average_score = mean(fit)
            ,variance = var(fit)
            ,st_dev = sd(fit)
           )

2016-10-12-12_22_37-rstudio

The difference between average calculation of the fit is absolutely so minor that it is worthless giving any special attention.

(mean(result[result$cl_met == 'kmeans',]$fit)*100 - mean(result[result$cl_met 
== 'rxKmeans',]$fit)*100)

is

0.04407133.

 

With rxKmeans you gain super good performances and the results are equal as to the default kmean clustering function.

Code is availble at GitHub.

Happy R-TSQLing!

Comparing performance on dplyr package, RevoScaleR package and T-SQL on simple data manipulation tasks

Long I wanted to test a simple data manipulation tasks and compare the execution time, ease of writing the code and simplicity between T-SQL and R package for data manipulation.

Couple of packages I will mention for data manipulations are plyr, dplyr and data.table and compare the execution time, simplicity and ease of writing with general T-SQL code and RevoScaleR package. For this blog post I will use R package dplyr and T-SQL with possibilites of RevoScaleR computation functions.

My initial query will be. Available in WideWorldImportersDW database. No other alterations have been done to underlying tables (fact.sale or dimension.city).

USE WideWorldImportersDW;
GO

SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN [Dimension].[City] AS c
        ON c.[City Key] = fs.[City Key];

For the metrics of comparison I will be using  simple time statistics (metrics: elapsed time)  and every code will be run from SSMS.

-- Example
SET STATISTICS TIME ON
SELECT GETDATE()
SET STATISTICS TIME OFF

SELECT

First simple test is just printing the data.

PRINT ' '
PRINT 'STATISTICS WITH T-SQL'
PRINT ' '

SET STATISTICS TIME ON
SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]

SET STATISTICS TIME OFF

PRINT ' '
PRINT 'STATISTICS WITH R'
PRINT ' '

SET STATISTICS TIME ON
DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'OutputDataSet <- InputDataSet'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATE
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 
SET STATISTICS TIME OFF

This test is very much simple. Outputting the data. In case to T-SQL it is just a simple SELECT statement and in case of R-TSQL is just passing data to R Engine and returning in a data-frame. The test itself is trivial and “absurd” but very important to get the base-line and time related results.

2016-10-09-18_44_32-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

Running the query couple of times on my client, I have some base comparison. T-SQL outperforms R-TSQL and it is almost two-times faster. I would expect higher difference in terms of time, because it involves a lot of data migration and no computations.

Filtering / WHERE

Now let’s perform a simple WHERE clause or some simple filtering.

PRINT ' '
PRINT 'STATISTICS WITH T-SQL'
PRINT ' '

-- SIMPLE T-SQL
SET STATISTICS TIME ON
SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory]
        ,fs.[Customer Key]
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]
    WHERE
        [Sales Territory] <> 'Southeast'
    AND fs.[Customer Key] <> 0
    AND Profit <= 1000
SET STATISTICS TIME OFF


PRINT ' '
PRINT 'STATISTICS WITH R'
PRINT ' '

-- R Package dplyr and T-SQL
SET STATISTICS TIME ON

DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory] AS SalesTerritory
        ,fs.[Customer Key] AS CustomerKey
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'
                library(dplyr)
                OutputDataSet  <- InputDataSet %>% 
           filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000)'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATETIME
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 

SET STATISTICS TIME OFF

2016-10-09-20_03_04-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

Again results of both executions are the same but difference is getting bigger and bigger. Main reason given is that all the data must be transferred to R engine, before being manipulated and results returned.

Now I can perform a simple comparison with dplyr package and RevoscaleR package. In this manner we will have same ground base for comparison.

PRINT ' '
PRINT 'STATISTICS WITH R dpylr'
PRINT ' '

SET STATISTICS TIME ON

DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory] AS SalesTerritory
        ,fs.[Customer Key] AS CustomerKey
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'
                library(dplyr)
                OutputDataSet  <- InputDataSet %>% 
filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000)'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATETIME
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 

SET STATISTICS TIME OFF

PRINT ' '
PRINT 'STATISTICS WITH R RevoScaleR'
PRINT ' '

SET STATISTICS TIME ON

DECLARE @TSQL1 AS NVARCHAR(MAX)
SET @TSQL1 = N'SELECT 
        fs.[City Key]
        ,c.City
        ,c.[Sales Territory] AS SalesTerritory
        ,fs.[Customer Key] AS CustomerKey
        ,fs.[Invoice Date Key]
        ,fs.[Quantity]
        ,fs.[Unit Price]
        ,fs.[Tax Amount]
        ,fs.[Total Including Tax]
        ,fs.Profit
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]'

DECLARE @RevoRScript AS NVARCHAR(MAX)
SET @RevoRScript = N'
                    library(RevoScaleR)
                    OutputDataSet <- rxXdfToDataFrame(rxDataStep(
inData = InputDataSet, outFile="Cust_data_filter.xdf", 
overwrite=TRUE, rowsPerRead = 100000, 
rowSelection =SalesTerritory != "Southeast" & CustomerKey != 0 & Profit <= 1000))'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RevoRScript
          ,@input_data_1 = @TSQL1
          
WITH RESULT SETS ((
     [City Key]  INT
    ,[City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,[Customer Key]  INT
    ,[Invoice Date Key]  DATETIME
    ,[Quantity]  INT
    ,[Unit Price]  DECIMAL(18,3)
    ,[Tax Amount]  DECIMAL(18,3)
    ,[Total Including Tax]  DECIMAL(18,2)
    ,[Profit]  DECIMAL(18,2)
                 )); 

SET STATISTICS TIME OFF

With printed output results.

2016-10-09-21_12_18-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

We can see from the print that dplyr package is in this case slightly ahead of RevoScaleR package, but the difference is 0,3 seconds in favour of dplyr. Going through the code, I am creating *.xdf on the fly and still get almost same time results.

Aggregating Data

Now to test the results for aggregating data for all three scenarios.

PRINT ' '
PRINT 'STATISTICS WITH T-SQL'
PRINT ' '

-- SIMPLE T-SQL
SET STATISTICS TIME ON
SELECT 
         c.City
        ,c.[Sales Territory]
        ,SUM(fs.[Total Including Tax]) AS Total_turnover
        ,MAX(fs.[Total Including Tax]) AS Max_turnover
        ,MIN(fs.[Total Including Tax]) AS Min_turnover
        --,(fs.[Total Including Tax]) AS Median_turnover
        ,VAR(fs.[Total Including Tax]) AS Var_turnover
        ,STDEV(fs.[Total Including Tax]) AS stdev_turnover
        ,AVG(fs.[Total Including Tax]/fs.Profit) AS KPI_avg_Turnover2profit
        
    FROM [Fact].[Sale] AS  fs
        JOIN dimension.city AS c
        ON c.[City Key] = fs.[City Key]
    WHERE
        [Sales Territory] <> 'Southeast'
    AND fs.[Customer Key] <> 0
    AND Profit <= 1000
    GROUP BY
         c.[Sales Territory]
        ,c.City

SET STATISTICS TIME OFF


PRINT ' '
PRINT 'STATISTICS WITH R dpylr'
PRINT ' '

SET STATISTICS TIME ON

-- Difference with T-SQL, I Have to pass all the values needed to filter out and 
-- aggregate data
DECLARE @TSQL1 AS NVARCHAR(MAX)
SET @TSQL1 = N'SELECT 
                     c.City
                    ,c.[Sales Territory] AS SalesTerritory
                    ,fs.[Customer Key] AS CustomerKey
                    ,fs.[Total Including Tax] AS TotalIncludingTax
                    ,fs.Profit
                FROM [Fact].[Sale] AS  fs
                    JOIN dimension.city AS c
                    ON c.[City Key] = fs.[City Key]'

DECLARE @RdplyrScript AS NVARCHAR(MAX)
SET @RdplyrScript = N'
                library(dplyr)
                OutputDataSet  <- InputDataSet %>% 
                                     mutate(Turnover2profit=TotalIncludingTax/Profit) %>%
                                    group_by(City, SalesTerritory)   %>%
                                    filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000) %>%
                                      summarise(    
                                         Total_turnover = sum(TotalIncludingTax, na.rm=TRUE)
                                         ,Max_turnover = max(TotalIncludingTax, na.rm=TRUE)
                                         ,Min_turnover = min(TotalIncludingTax, na.rm=TRUE)
                                         ,Median_turnover = median(TotalIncludingTax, na.rm=TRUE)
                                         ,Var_turnover = var(TotalIncludingTax, na.rm=TRUE)
                                         ,stdev_turnover = sd(TotalIncludingTax, na.rm=TRUE)
                                         ,KPI_avg_Turnover2profit = mean(Turnover2profit, na.rm=TRUE)    
                                      )'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RdplyrScript
          ,@input_data_1 = @TSQL1
          
WITH RESULT SETS ((
     [City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,Total_turnover DECIMAL(18,3)
    ,Max_turnover DECIMAL(18,3)
    ,Min_turnover DECIMAL(18,3)
    ,Median_turnover DECIMAL(18,3)
    ,Var_turnover DECIMAL(18,3)
    ,stdev_turnover DECIMAL(18,3)
    ,KPI_avg_Turnover2profit DECIMAL(18,3)
    )); 

SET STATISTICS TIME OFF


PRINT ' '
PRINT 'STATISTICS WITH R RevoScaleR'
PRINT ' '

SET STATISTICS TIME ON

DECLARE @TSQL2 AS NVARCHAR(MAX)
SET @TSQL2 = N'SELECT 
                     c.City
                    ,c.[Sales Territory] AS SalesTerritory
                    ,fs.[Customer Key] AS CustomerKey
                    ,fs.[Total Including Tax] AS TotalIncludingTax
                    ,fs.Profit
                FROM [Fact].[Sale] AS  fs
                    JOIN dimension.city AS c
                    ON c.[City Key] = fs.[City Key]'

DECLARE @RevoRScript AS NVARCHAR(MAX)
SET @RevoRScript = N'
                    library(RevoScaleR)
                    sales_rx1 <- rxDataStep(inData = InputDataSet, 
outFile="Cust_data_rx1.xdf", overwrite=TRUE, rowsPerRead = 100000, 
                                  rowSelection =SalesTerritory != "Southeast" &
 CustomerKey != 0 & Profit <= 1000)

                    sales_rx2 <- rxDataStep(sales_rx1, outFile="Cust_data_rx2.xdf",
                        transforms=list(
                            turnover2profit=TotalIncludingTax/Profit
                            ,City = City
                            ,SalesTerritory = SalesTerritory
                            ,TotalIncludingTax = TotalIncludingTax
                            ,Profit = Profit
                        ),
                        overwrite=TRUE, rowsPerRead = 100000)
                    
                        sales_rx4 <- rxSummary(TotalIncludingTax~SalesTerritory:City, 
data=sales_rx2,
                                               summaryStats=c("Mean", "StdDev",
 "Min", "Max", "Sum"))

                        sales_rx4_1 <- sales_rx4$categorical[[1]][c("SalesTerritory", 
"City", "Sum", "StdDev", "Min", "Max")]

                        sales_rx4P <- rxSummary(turnover2profit~SalesTerritory:City,
 data=sales_rx2,
                                               summaryStats=c("Mean"))

                        sales_rx4_2 <- sales_rx4P$categorical[[1]][c("SalesTerritory",
 "City", "Means")]

                        ## Merge data
                        sales_rx5 <- merge(sales_rx4_1, sales_rx4_2, 
by=c("SalesTerritory","City"), all=TRUE)
                        names(sales_rx5)[3] <- "Total_turnover"
                        names(sales_rx5)[4] <- "Max_turnover"
                        names(sales_rx5)[5] <- "Min_turnover"
                        names(sales_rx5)[6] <- "stdev_turnover"
                        names(sales_rx5)[7] <- "KPI_avg_Turnover2profit"
                                                                
                    OutputDataSet <- sales_rx5''

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RevoRScript
          ,@input_data_1 = @TSQL2
          
WITH RESULT SETS ((
     [City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,Total_turnover DECIMAL(18,3)
    ,Max_turnover DECIMAL(18,3)
    ,Min_turnover DECIMAL(18,3)
    ,Median_turnover DECIMAL(18,3)
    ,Var_turnover DECIMAL(18,3)
    ,stdev_turnover DECIMAL(18,3)
    ,KPI_avg_Turnover2profit DECIMAL(18,3)
    )); 

SET STATISTICS TIME OFF

 

The results are obvious. T-SQL is in this case faster, which is understood. But comparing dplyr and RevoScaleR, RevoscaleR is faster when doing data manipulation and returning results into SSMS. 2016-10-10-11_27_20-testing-t-sql-and-r_dplyr-sql-sicn-kastrun-wideworldimportersdw-spar_si010179

The only problem I have is the amount of code needed to create a simple summary statistics. Even though RevoScaleR is faster as dplyr (and plyr), it is absolutely hard to read and write the code. Chaining / Piping is not possible meaning, that you have to store the intermediate results and storing the xdf files.

2016-10-10-11_31_22-datatk

RevoScaleR package did the job but consider doing data manipulation in T-SQL or a dedicated data manipulation package, unless you are fine with all the merging, renaming and outputting.

I deliberately added some of the statistics in the output and you can see that median is supported only in dplyr package, where as in T-SQL or RevoScaleR you have to write it and compute it manually. As well I miss Variance metric in RevoScaleR and Mode / Modus in all three – T-SQL, dplyr and RevoScaleR.

In addition, I have also used dplyrXdf which is a dplyr package for *.xdf dataframe and enables user to do chaining / piping with Revoscale external data format. You can install package in RTVS or RStudio using:

devtools::install_github("RevolutionAnalytics/dplyrXdf")
SET STATISTICS TIME ON

-- Difference with T-SQL, I Have to pass all the values needed to filter out 
-- and aggregate data
DECLARE @TSQL2 AS NVARCHAR(MAX)
SET @TSQL2 = N'SELECT 
                     c.City
                    ,c.[Sales Territory] AS SalesTerritory
                    ,fs.[Customer Key] AS CustomerKey
                    ,fs.[Total Including Tax] AS TotalIncludingTax
                    ,fs.Profit
                FROM [Fact].[Sale] AS  fs
                    JOIN dimension.city AS c
                    ON c.[City Key] = fs.[City Key]'

DECLARE @RevoRScript AS NVARCHAR(MAX)
SET @RevoRScript = N'
                    library(RevoScaleR)
                    library(dplyr)
                    library(dplyrXdf)
                    custXdf <- rxDataFrameToXdf(InputDataSet, 
                               outFile="Cust_data_aggregate.xdf", overwrite=TRUE)
                    custagg <- custXdf %>%
                      mutate(Turnover2profit=TotalIncludingTax/Profit) %>%
                      group_by(City, SalesTerritory)   %>%
                      filter(SalesTerritory != "Southeast", CustomerKey != 0, 
                                 Profit <= 1000) %>%
                      summarise(
                        Total_turnover = sum(TotalIncludingTax, na.rm=TRUE)
                        ,Max_turnover = max(TotalIncludingTax, na.rm=TRUE)
                        ,Min_turnover = min(TotalIncludingTax, na.rm=TRUE)
                        ,Median_turnover = median(TotalIncludingTax, na.rm=TRUE)
                        ,Var_turnover = var(TotalIncludingTax, na.rm=TRUE)
                        ,stdev_turnover = sd(TotalIncludingTax, na.rm=TRUE)
                        ,KPI_avg_Turnover2profit = mean(Turnover2profit, na.rm=TRUE)
                                )
                OutputDataSet <- data.frame(custagg)'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RevoRScript
          ,@input_data_1 = @TSQL2
          
WITH RESULT SETS ((
     [City]  NVARCHAR(200)
    ,[Sales Territory]  NVARCHAR(200)
    ,Total_turnover DECIMAL(18,3)
    ,Max_turnover DECIMAL(18,3)
    ,Min_turnover DECIMAL(18,3)
    ,Median_turnover DECIMAL(18,3)
    ,Var_turnover DECIMAL(18,3)
    ,stdev_turnover DECIMAL(18,3)
    ,KPI_avg_Turnover2profit DECIMAL(18,3)
    )); 

SET STATISTICS TIME OFF

Problem with dplyrXdf is performance. The code is very very easy to read but execution time is just absurd. I had to run it couple of times and usually get around 1 minute of elapsed time. Which is just beyond my patience zone. I have used different computeContext and it had no effect.

My conclusione: If you can, do all the data manipulation prior to pushing data from T-SQL into R code, because dplyr or data.table do not support parallel computation and no-memory  limitations and on the other hand, RevoScaleR functions and capabilities were not meant to do data manipulation at this scale, but are very good for summary statistics and overview.

As always, code is available at GitHub.

Happy R-TSQLing.

 

Size of XDF files using RevoScaleR package

It came to my attention that size of XDF (external data frame) file can change drastically based on the compute context and environment. When testing the output of a dataset I was working on in SQL Server Management Studio I was simultaneously testing R code in RTVS or  R Studio and I have noticed a file growth.

Following stored procedure will do a simple test:

CREATE PROCEDURE rxImport_Test ( @rowsPerRead INT )
AS
BEGIN
    DECLARE @RStat NVARCHAR(4000)
    SET @RStat = 'library(RevoScaleR)
                  #rxSetComputeContext("RxLocalSeq")
                  ptm <- proc.time()
                  inFile <- file.path(rxGetOption("sampleDataDir"), "AirlineDemoSmall.csv")
                  filename <- "AirlineDemoSmall_'+CAST(@rowsPerRead AS VARCHAR(100))+'_TSQL_NC.xdf"
                  rxTextToXdf(inFile = inFile, outFile = filename,  stringsAsFactors = T, rowsPerRead = '+CAST(@rowsPerRead AS VARCHAR(100))+', overwrite=TRUE)
                  outFile <- file.path(rxGetOption("sampleDataDir"), filename)
                  rxImport(inData = inFile, outFile = outFile, overwrite=TRUE)
                  d <- proc.time() - ptm
                  filesize <- data.frame(file.size(filename))    
                  time     <- data.frame(d[3])
                  RowsPerRead <- data.frame('+CAST(@rowsPerRead AS VARCHAR(100))+')
                  filename_xdf <- data.frame(filename)
                  ran <- data.frame(Sys.time())
                  OutputDataSet <- cbind(as.character(filesize), time, RowsPerRead, filename_xdf, ran)';
    EXECUTE sp_execute_external_script
          @language = N'R'
         ,@script = @RStat
    WITH RESULT SETS (( 
                         Filesize NVARCHAR(100)
                        ,Time_df NVARCHAR(100)
                        ,RowsPerRead NVARCHAR(100)
                        ,filename_xdf NVARCHAR(100)
                        ,DateExecute NVARCHAR(100)
                        ))
END

But let’s first understand and test the Computation context and path to the data.

-- Make sure your path location is pointing to RevoscaleR library folder!
EXECUTE sp_execute_external_script
      @language = N'R'
     ,@script = N'library(RevoScaleR) 
                OutputDataSet <- data.frame(rxGetOption("sampleDataDir"))'
WITH RESULT SETS (( 
                    path_folder NVARCHAR(1000)
                    ))

-- check for ComputeContext
DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'library(RevoScaleR)
             cc <- rxGetOption("computeContext")
             OutputDataSet <- data.frame(cc@description)';
EXECUTE sp_execute_external_script
      @language = N'R'
     ,@script = @RStat
WITH RESULT SETS ((compute_context NVARCHAR(100)))

At my computer, this looks like this:

2016-09-22-14_33_40-revoscale_r_file_conversion-sql-sicn-kastrun-sqlr-spar_si01017988-59-mic

No we will run procedure

rxImport_Test

with different chunk sizes (this is what I will test) and observe execution times.

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 2;
GO

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 20;
GO

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 200;
GO

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 2000;
GO

Running with different chunk size the procedure, it yields interesting results:

2016-09-22 15_22_37-Revoscale_R_file_conversion.sql - SICN-KASTRUN.SQLR (SPAR_si01017988 (60))_ - Mi.png

Now, let’s see the summary information on this file / dataset.

 

2016-09-22-15_35_58-book1-excel

Considering that original file holds 600.000 rows, number of blocks for each of the files is also changing and therefore the size of the files is growing.

Retrieving information on block size

DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'library(RevoScaleR)    
              info <- rxGetInfoXdf(data="AirlineDemoSmall_20000000_TSQL_NC.xdf", getVarInfo = TRUE)    
              OutputDataSet <- data.frame(info$numBlocks)';

EXECUTE sp_execute_external_script
      @language = N'R'
     ,@script = @RStat
WITH RESULT SETS (( 
                    nof_blocks NVARCHAR(100)))

one can see the change between the files and where is the optimal block size. In my test, number of blocks would be 3 to 30 max to receive maximum performance from creating XDF file. This means from 2000 up to 200.000 rows per block would yield best performance results. Otherwise I haven’t found the the golden rule of the block size, but take caution, especially when dealing with larger files.

I ran test couple of times in order to check the consistency of the results, and they hold water. As for the file size; this is the presentation of internal file, as of *.xdf file (as external structure) size should not differ as the block size changes, but perfomance does!

Code is available at Github.

Happy R-sqling!