Size of XDF files using RevoScaleR package

It came to my attention that size of XDF (external data frame) file can change drastically based on the compute context and environment. When testing the output of a dataset I was working on in SQL Server Management Studio I was simultaneously testing R code in RTVS or  R Studio and I have noticed a file growth.

Following stored procedure will do a simple test:

CREATE PROCEDURE rxImport_Test ( @rowsPerRead INT )
AS
BEGIN
    DECLARE @RStat NVARCHAR(4000)
    SET @RStat = 'library(RevoScaleR)
                  #rxSetComputeContext("RxLocalSeq")
                  ptm <- proc.time()
                  inFile <- file.path(rxGetOption("sampleDataDir"), "AirlineDemoSmall.csv")
                  filename <- "AirlineDemoSmall_'+CAST(@rowsPerRead AS VARCHAR(100))+'_TSQL_NC.xdf"
                  rxTextToXdf(inFile = inFile, outFile = filename,  stringsAsFactors = T, rowsPerRead = '+CAST(@rowsPerRead AS VARCHAR(100))+', overwrite=TRUE)
                  outFile <- file.path(rxGetOption("sampleDataDir"), filename)
                  rxImport(inData = inFile, outFile = outFile, overwrite=TRUE)
                  d <- proc.time() - ptm
                  filesize <- data.frame(file.size(filename))    
                  time     <- data.frame(d[3])
                  RowsPerRead <- data.frame('+CAST(@rowsPerRead AS VARCHAR(100))+')
                  filename_xdf <- data.frame(filename)
                  ran <- data.frame(Sys.time())
                  OutputDataSet <- cbind(as.character(filesize), time, RowsPerRead, filename_xdf, ran)';
    EXECUTE sp_execute_external_script
          @language = N'R'
         ,@script = @RStat
    WITH RESULT SETS (( 
                         Filesize NVARCHAR(100)
                        ,Time_df NVARCHAR(100)
                        ,RowsPerRead NVARCHAR(100)
                        ,filename_xdf NVARCHAR(100)
                        ,DateExecute NVARCHAR(100)
                        ))
END

But let’s first understand and test the Computation context and path to the data.

-- Make sure your path location is pointing to RevoscaleR library folder!
EXECUTE sp_execute_external_script
      @language = N'R'
     ,@script = N'library(RevoScaleR) 
                OutputDataSet <- data.frame(rxGetOption("sampleDataDir"))'
WITH RESULT SETS (( 
                    path_folder NVARCHAR(1000)
                    ))

-- check for ComputeContext
DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'library(RevoScaleR)
             cc <- rxGetOption("computeContext")
             OutputDataSet <- data.frame(cc@description)';
EXECUTE sp_execute_external_script
      @language = N'R'
     ,@script = @RStat
WITH RESULT SETS ((compute_context NVARCHAR(100)))

At my computer, this looks like this:

2016-09-22-14_33_40-revoscale_r_file_conversion-sql-sicn-kastrun-sqlr-spar_si01017988-59-mic

No we will run procedure

rxImport_Test

with different chunk sizes (this is what I will test) and observe execution times.

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 2;
GO

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 20;
GO

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 200;
GO

INSERT INTO rxImport_results
EXEC rxImport_Test @rowsPerRead = 2000;
GO

Running with different chunk size the procedure, it yields interesting results:

2016-09-22 15_22_37-Revoscale_R_file_conversion.sql - SICN-KASTRUN.SQLR (SPAR_si01017988 (60))_ - Mi.png

Now, let’s see the summary information on this file / dataset.

 

2016-09-22-15_35_58-book1-excel

Considering that original file holds 600.000 rows, number of blocks for each of the files is also changing and therefore the size of the files is growing.

Retrieving information on block size

DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'library(RevoScaleR)    
              info <- rxGetInfoXdf(data="AirlineDemoSmall_20000000_TSQL_NC.xdf", getVarInfo = TRUE)    
              OutputDataSet <- data.frame(info$numBlocks)';

EXECUTE sp_execute_external_script
      @language = N'R'
     ,@script = @RStat
WITH RESULT SETS (( 
                    nof_blocks NVARCHAR(100)))

one can see the change between the files and where is the optimal block size. In my test, number of blocks would be 3 to 30 max to receive maximum performance from creating XDF file. This means from 2000 up to 200.000 rows per block would yield best performance results. Otherwise I haven’t found the the golden rule of the block size, but take caution, especially when dealing with larger files.

I ran test couple of times in order to check the consistency of the results, and they hold water. As for the file size; this is the presentation of internal file, as of *.xdf file (as external structure) size should not differ as the block size changes, but perfomance does!

Code is available at Github.

Happy R-sqling!

Resource governor and external resource pool for Microsoft R Services

Setting external resource pool for execution of R commands using sp_execute_external_script has proven extremely useful, especially in cases where you have other workers present, when you don’t want to overdo on data analysis and get to much resources from others (especially when running data analysis in production environment) or when you know that your data analysis will require maximum CPU and memory for a specific period. In such cases using and defining external resource pool is not only useful but highly recommended.

Resource governor is a feature that enables you to manage SQL Server workload and system resource consumption their limits. Limits can be configures for any workload in terms of CPU, Memory and I/O consumption. Where you have many different workloads on the same SQL Server, resource Governor helps allocate requested resources.

By default, two pools are defined – default and internal for system resources, and only default for system external resources.

2016-08-18 22_19_51-SQLQuery4.sql - SICN-KASTRUN.RevoTestDB (SPAR_si01017988 (54))_ - Microsoft SQL

Resource governor has resource pools, workload groups that are groups of similar load within same classification criteria. When defining external resource pool we will be defining the limitations for CPU and/or memory consumption for external resources. And R Services (using external procedure sp_execute_external_script) falls under external resources. According to MSDN, with external resource pool for R services will govern “rterm.exe, BxlServer.exe, and other processes spawned by them (quoted: link)”

First of all, you need to have resource governor enabled:

-- Enable Resource Governor
ALTER RESOURCE GOVERNOR RECONFIGURE;  
GO

Once you have this, let us set the query for test so we can measure and see how the resource governor will behave.

I will be using RevoScaleR sample data, that come with each R Service installation of SQL Server 2016. This sample data should be available on following location:

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\R_SERVICES\library\RevoScaleR\SampleData

2016-08-18 22_41_08-SampleData

And I will be using AirlineDemoSmall.csv  – CSV dataset, roughly 14MB size. Please note, that path to the SampleData might differ, based on your SQLServer 2016 installation folder.

In your database create and import data to your SQL Server table:

CREATE TABLE AirlineDemoSmall(
     ArrDelay varchar(100) NOT NULL
    ,CRSDepTime float NOT NULL
    ,[DayOfWeek] varchar(12) NOT NULL  
)
GO

-- this file should be at your location! so no need to download it
BULK INSERT AirlineDemoSmall
FROM 'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\
R_SERVICES\library\RevoScaleR\SampleData\AirlineDemoSmall.csv'
WITH (
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '\n',
    FIRSTROW = 2 -- Skip header
)
*Please note; both statements (CREATE TABLE and BULK INSERT) are part of RevoScaleR ImportAirlineDB.sql file. I am just reusing a section of this code. Whole file will be added to Github.

Once you have your data imported, we will run a simple linear regression using RevoScale library.

EXECUTE  sp_execute_external_script
                 @language = N'R'
                ,@script = N'
            library(RevoScaleR)
            f <- formula(as.numeric(ArrDelay) ~ as.numeric(DayOfWeek) + CRSDepTime)
            s <- system.time(mod <- rxLinMod(formula = f, data = AirLine))
            OutputDataSet <-  data.frame(system_time = s[3]);'
                ,@input_data_1 = N'SELECT * FROM AirlineDemoSmall'
                ,@input_data_1_name = N'AirLine'
-- WITH RESULT SETS UNDEFINED
WITH RESULT SETS 
            ((
                 Elapsed_time FLOAT
            ));

 

So I am only exporting R system time function and only Total elapsed time (hence 3 attribute from vector result of system.time function). Usual representation of this function is following format: user time, system time and elapsed time.

2016-08-18 22_51_34-RStudio

And will only take elapsed time as a measure and comparison between resource pool enabled and disabled. Results from our control environment (before setting the resource pool):

2016-08-18 22_53_45-SQLQuery4.sql - SICN-KASTRUN.RevoTestDB (SPAR_si01017988 (54))_ - Microsoft SQL

and computation time is 0.058 of a second where as elapsed time  – value returned from our sp_execute_external_script – is 0,63 of a second.

2016-08-18 22_58_47-SQLQuery4.sql - SICN-KASTRUN.RevoTestDB (SPAR_si01017988 (54))_ - Microsoft SQL

Now we will create a new pool and limit the CPU and memory available for such workloads:

-- Default value
ALTER EXTERNAL RESOURCE POOL [default] 
WITH (AFFINITY CPU = AUTO)
GO

CREATE EXTERNAL RESOURCE POOL RService_Resource_Pool  
WITH (  
     MAX_CPU_PERCENT = 10  
    ,MAX_MEMORY_PERCENT = 5
);  

ALTER RESOURCE POOL [default] WITH (max_memory_percent = 60, max_cpu_percent=90);  
ALTER EXTERNAL RESOURCE POOL [default] WITH (max_memory_percent = 40, max_cpu_percent=10);  
ALTER RESOURCE GOVERNOR reconfigure;

 

New pool – called: RService_Resource_Pool is created  with new values set.

2016-08-18 23_09_23-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

And obligatory reconfiguration:

ALTER RESOURCE GOVERNOR RECONFIGURE;  
GO

This configuration can also be found using this system configuration table:

-- Check configuration
SELECT * FROM sys.resource_governor_external_resource_pools

2016-08-18 23_36_54-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

Please note that only 20 workers are dedicated to be used for R external processes.

Capture

With each run of RevoScaleR library functions an increment will appear for DMV statistics on external script execution. Also, only RevoScaleR functions are counted here. Any other function (to my knowledge) is not counted here (as of SQL Server 2016 version).

SELECT * FROM sys.dm_external_script_execution_stats

2016-08-18 23_34_01-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

Once you have this, we need to set the Classification function as well:

CREATE WORKLOAD GROUP R_workgroup WITH (importance = medium) USING "default", 
EXTERNAL "RService_Resource_Pool";  

ALTER RESOURCE GOVERNOR WITH (classifier_function = NULL);  
ALTER RESOURCE GOVERNOR reconfigure;  

USE master  
GO  
CREATE FUNCTION RG_Class_function()  
RETURNS sysname  
WITH schemabinding  
AS  
BEGIN  
    IF program_name() in ('Microsoft R Host', 'RStudio') RETURN 'R_workgroup';  
    RETURN 'default'  
    END;  
GO  

ALTER RESOURCE GOVERNOR WITH  (classifier_function = dbo.RG_Class_function);  
ALTER RESOURCE GOVERNOR reconfigure;  
go

 

On MSDN you will find a post on setting up resource governor for R Services and due to security reasons, “However, direct management of the Launchpad service by using Resource Governor is not supported.” So, this practically means that you can also create a login with one of twenty R User Names (MSSQLServer01 … MSSQLServer20) in group of SQLRUserGroup in order for Resource Governor to work.

So, much coding needed at this point. I will be using User MSSQLSERVER01 and create all the necessary things: logins, permissions, etc.After setting up MSSQLSERVER01 windows user and using EXECUTE AS LOGIN command prior to running sp_execute_external_script.

Running the command:

SELECT * FROM sys.resource_governor_workload_groups;

you will now  see all the workload groups.

And now, finally we can rerun the same procedure:

-- We will run same query

EXECUTE  sp_execute_external_script
                 @language = N'R'
                ,@script = N'
   library(RevoScaleR)
   f <- formula(as.numeric(ArrDelay) ~ as.numeric(DayOfWeek) + CRSDepTime)
    s <- system.time(mod <- rxLinMod(formula = f, data = AirLine))
                            OutputDataSet <-  data.frame(system_time = s[3]);'
                ,@input_data_1 = N'SELECT * FROM AirlineDemoSmall'
                ,@input_data_1_name = N'AirLine'
-- WITH RESULT SETS UNDEFINED
WITH RESULT SETS 
            ((
                 Elapsed_time FLOAT
            ));

 

And CPU consumption will stay within the limits:

2016-08-19 00_40_57-Task Manager

and the elapsed time will be higher due to lower CPU available.

2016-08-19 00_41_54-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

Happy R-Tsqling! 🙂

Code Is available at Github.

 

NOTE!

CPU consumption will rise after the execution of external execution command due to the worker MSSQLSERVER01 cleaning the session files and log files!

2016-08-19 00_44_26-MSSQLSERVER01

Since log and session files are part of Launchpad service, Resource Governor will not be able to limit the CPU, Memory of I/O consumption.

 

Handling required and missing R packages in Microsoft R Services

I have seen several times, that execution of the R code with procedure sp_execute_external_script was not valid due to missing library or library dependencies.

Problem is – in general – not solved out of the box. But can be solved using and maintaining a list of installed libraries used by Microsoft R services or by simply create a “pre-code” R code to do a check for you.

In both cases, user will end up with additional code, but it might be a good check if you are installing library in production and you run such check prior to running any relevant R code.

Let’s start with simple R code:

USE WideWorldImporters;
GO

EXECUTE sp_execute_external_script    
           @language = N'R'
          ,@script=N'library(Hmisc) 
                    df <- data.frame(rcorr(as.matrix(sp_RStats_query), 
                            type="pearson")$P)
                    OutputDataSet<-df'
          ,@input_data_1 = N'SELECT 
                     SupplierID
                    ,UnitPackageID
                    ,OuterPackageID
                FROM [Warehouse].[StockItems]'
          ,@input_data_1_name = N'sp_RStats_query'
    WITH RESULT SETS ((SupplierID NVARCHAR(200)
                    ,UnitPackageID NVARCHAR(200)
                    ,OuterPackageID NVARCHAR(200)));

 

This code will in my case return error message, that R Service is missing a specific library in order to execute R code – in my case library Hmisc.

Error_installed_library_01.Wide

So the first step to solve this issue is to declare variable for R script and parametrize  @script parameter for procedure sp_execute_external_script.

DECLARE @OutScript NVARCHAR(MAX) 
SET @OutScript =N'library(Hmisc) 
                    df <- data.frame(rcorr(as.matrix(sp_RStats_query), 
                     type="pearson")$P)
                    OutputDataSet<-df'

EXECUTE sp_execute_external_script    
           @language = N'R'
          ,@script= @OutScript
          ,@input_data_1 = N'SELECT 
                     SupplierID
                    ,UnitPackageID
                    ,OuterPackageID
                FROM [Warehouse].[StockItems]'
          ,@input_data_1_name = N'sp_RStats_query'
    WITH RESULT SETS ((
                         SupplierID    NVARCHAR(200)
                        ,UnitPackageID NVARCHAR(200)
                        ,OuterPackageID NVARCHAR(200)
                    ));

Now we need to do a string search for following patterns:

  • library (c(package1,package2))
  • library (package1),  library(package2)
  • library(package1)

All these reserved R words denote the need for particular library/libraries installation.

DECLARE @OutScript NVARCHAR(MAX) 
SET @OutScript =N'library(Hmisc)
                  library(test123) 
                    df <- data.frame(rcorr(as.matrix(sp_RStats_query), 
                      type="pearson")$P)
                    OutputDataSet<-df'

/*
***************************************************************
START: CODE FOR CHECKING AND INSTALLING MISSING R LIBRARIES
***************************************************************
*/


DECLARE @Tally TABLE (num TINYINT,R_Code NVARCHAR(MAX))
INSERT INTO @Tally VALUES (1,@OutScript)
DECLARE @libstatement NVARCHAR(MAX)
DECLARE @cmdstatement NVARCHAR(MAX)

;WITH CTE_R(num,R_Code, libname)
AS
(
SELECT
      1 AS num,
      RIGHT(R_Code, LEN(R_Code) - CHARINDEX(')', R_Code, 0)) AS  R_Code, 
      substring(R_Code, CHARINDEX('library(', R_Code, 0) + 0, CHARINDEX(')',
 R_Code, 0) - CHARINDEX('library(', R_Code, 0) + 1) AS libname
FROM @Tally
WHERE  
        CHARINDEX('(', R_Code, 0) > 0 
    AND CHARINDEX('library(',R_Code,0) > 0

UNION ALL

SELECT
     1 AS num,
     RIGHT(R_Code, LEN(R_Code) - CHARINDEX(')', R_Code, 0)) AS  R_Code,
     substring(R_Code, CHARINDEX('library(', R_Code, 0) + 0, CHARINDEX(')', 
R_Code, 0) - CHARINDEX('library(', R_Code, 0) + 1) AS libname
FROM CTE_R
WHERE 
    CHARINDEX('(', R_Code, 0) > 0 
AND CHARINDEX('library(',R_Code,0) > 0

)
, fin AS
(
SELECT TOP 1 stuff((SELECT ' install.packages(''''' + 
REPLACE(REPLACE(REPLACE(c1.libname,'library',''),')',''),'(','') + '''''
           , dependencies = T)'
              FROM CTE_R AS c1 
              WHERE 
                    c1.num = c2.num
              FOR XML PATH (''),TYPE).value('.','NVARCHAR(MAX)'),1,1,'') AS lib_stat
              FROM CTE_R AS c2
)
SELECT 
        @libstatement = lib_stat 
FROM fin

SET @cmdstatement = 'EXEC xp_cmdshell ''"C:\Program Files\Microsoft SQL Server\
MSSQL13.MSSQLSERVER\R_SERVICES\bin\R.EXE" cmd -e ' + @libstatement + ''''
EXEC SP_EXECUTESQL @cmdstatement

/*
***************************************************************
END: CODE FOR CHECKING AND INSTALLING MISSING R LIBRARIES
***************************************************************
*/


EXECUTE sp_execute_external_script    
           @language = N'R'
          ,@script= @OutScript
          ,@input_data_1 = N'SELECT 
                     SupplierID
                    ,UnitPackageID
                    ,OuterPackageID
                FROM [Warehouse].[StockItems]'
          ,@input_data_1_name = N'sp_RStats_query'
    WITH RESULT SETS ((
                         SupplierID    NVARCHAR(200)
                        ,UnitPackageID NVARCHAR(200)
                        ,OuterPackageID NVARCHAR(200)
                    ));

 

Result in this case will be successful with correct R results and sp_execute_external_script will not return error for missing libraries.

I added a “fake” library called test123 for testing purposes if all the libraries will be installed successfully.

At the end the script generated xp_cmdshell command (in one line):

EXEC xp_cmdshell '"C:\Program Files\Microsoft SQL Server\
MSSQL13.MSSQLSERVER\R_SERVICES\bin\R.EXE" cmd 
-e install.packages(''Hmisc'') install.packages(''test123'')'

You might also experience the following error in the output of xp_cmdshell command:

Error_installed_library_02.Wide

In this case, go to the following location and enable write permission for this folder and subfolders.

Error_installed_library_03.Wide

For the end, you will always find more elegant and easy way to write R code that will inadvertently check the installation (as well as version and dependencies for library) with following R code:

if(!is.element("Hmisc", installed.packages()))
{install.packages("Hmisc", dependencies = T)
}else{library("Hmisc")}

So the original code can simply be changed to:

USE WideWorldImporters;
GO

EXECUTE sp_execute_external_script    
           @language = N'R'
          ,@script=N'if(!is.element("Hmisc", installed.packages()))
                      {install.packages("Hmisc", dependencies = T)
                        }else{library("Hmisc")}
                    df <- data.frame(rcorr(as.matrix(sp_RStats_query), 
 type="pearson")$P)
                    OutputDataSet<-df'
          ,@input_data_1 = N'SELECT 
                     SupplierID
                    ,UnitPackageID
                    ,OuterPackageID
                FROM [Warehouse].[StockItems]'
          ,@input_data_1_name = N'sp_RStats_query'
    WITH RESULT SETS ((SupplierID NVARCHAR(200)
                    ,UnitPackageID NVARCHAR(200)
                    ,OuterPackageID NVARCHAR(200)));

But unfortunately, one can not always count on the consistence of developer or data scientists or author of the code, that they will always add a simple check for library installation.

Code is available at GitHub.

Happy R-TSQLing!

Installing R packages in SQL Server using only T-SQL

Installing R packages in SSMS using T-SQL can take some time, especially when you need to switch between R Tools for Visual Studio or R Studio or your favorite R GUI tool (or R CMD). But so far, installing any additional library using sp_execute_external_script is just not working.

--InstallPackage using sp_execute_external_script
EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'install.packages("ggplot2")'
WITH RESULT SETS (( ResultSet VARCHAR(50)));

-- using Download.file command
EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'download.file("https://cran.r-project.org/bin/windows/contrib/
3.4/ggplot2_2.1.0.zip","ggplot2")
                 install.packages("ggplot2", repos = NULL, type = "source")'
WITH RESULT SETS (( ResultSet VARCHAR(50)));

Both executions yield in error messages. Especially the first one would be great way of installation. At this point, I would not find a way of just passing this command to install the package.

Julie Koesmarno made a great post on installing R packages. Please follow this post. Also Microsoft suggests the following way to install R packages on MSDN.

Since I wanted to be able to have packages installed directly from SQL Server Management Studio (SSMS) here is yet another way to do it. I have used  xp_cmdshell to install any additional package for my R (optionally you can set EXECUTE AS USER).

USE WideWorldImporters;
GO
-- enable xp_cmdshell
EXECUTE SP_CONFIGURE 'xp_cmdshell','1';
GO

RECONFIGURE;
GO

EXEC xp_cmdshell '"C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\
R_SERVICES\bin\R.EXE" cmd -e install.packages(''tree'')';   GO

in this case I have installed package tree by calling Microsoft R Services with -e switch and adding straight R code. Please double check path to your R Engine.

Results are:

Capture.PNG

It might not be the proposed way, but when completing T-SQL and R code it is just a nice way with little annoyances.  I would still suggest the way Microsoft proposed or Julie described on her blog, but if you feel confident, you can use this method.

Code is available at Github.

Happy R-SQLing!

Data science show at SQL Server Radio

SQL Server Radio with Matan Yungman (info | twitter| company ) and Guy Glantser (info | twitter | company ) was hosting me for SQL Server radio 52th series. Main topic was discussion on  data science, statistics, who data scientist is and what SQL Server offers with R integration, SSRS, PowerBI, Excel and other tools.

cover170x170

45 minutes of talk and laughter with Matan and Guy. Both are very funny and highly professional and our discussion was full of interesting topics and questions, anecdotes and discussing use cases from experiences.

Listen to podcast of 52th show of SQL Server radio and hear what we were laughing and discussing about.

To recap what we were nibbling and chatting about (well, what I was chatting and I am still laughing hearing the show he he):

  • what data science is who data scientist is
  • Problems with outsourcing data scientists
  • What tools are commonly used in data science
  • What Microsoft is offering with SQL Server 2016
  • Purpose and goals of data science
  • Predictions, models and analytics
  • R language and R integration
  • Use cases from the field

 

Enjoy listening and thank you guys Matan and Guy for this show, I had a great time!

First thoughs on R integration in SQL Server 2016

This blog post is a very personal confession, might be delicate but it must be understood as a  proposal or plan for improvement.

Update #1: Typing error in title: correct title is First thoughts on R integration in SQL Server 2016 (update: June 24, 2016)

Before the start, let me put some background and context. I have been working with statistics and database for quite some time now and used different tools for statistical purposes. Starting with SPSS, Lisrel, Pajek, continuing with SPSS Modeler and then going to Weka and R together, very heavly used SQL Server SSAS  and working with Orange and in past 2 years working with SAS  and SAP Hana Analytics. With all the products I have used – and I am not here rank or make comparison which one is better or which one I prefer – some things remain the same. Getting to know your data to a detail is the most important one, where as  applying the statistical knowledge to it the result of it.

So I have had opportunity to work with different data sets, for different companies, on different projects and with different tools. And I am – looking back – very happy to have done it. So this thoughts are not me complaining about any of the products or solutions, but merely looking on improvements based on my year long experiences.

Working with R in SQL Server has been my personal thing for couple of years. The story goes back 4 years ago, when I started working on building a web based engine for creating marketing campaigns based on input parameters and desired goals. Once the business/marketing department decided on their new campaign, using a web based (in house build) program, they selected product groups and customer segment and some minor additional information (time frame, validity of campaign, legal stuff,..) they clicked okey and that was it. But in the background the stored procedure generated long T-SQL that has pushed (1) data into SQL Server engine and (2) selected proper model that feed the data to R engine.  So going back to year 2012, I have build framework for communication SQL Server with R engine. Most of the framework is available at SQLServerCentral web page.  The framework had some security flaws and issues (calling external procedure, storing data in external files) and mostly performance issues (scaling out wasn’t possible, R was solemnly dependent on RAM of the server), but overall the framework was working with putting it into production and project put into production.

Back then I wrote my framework in a way that I would be used as easily as possible, storing and handling R code and T-SQL code in SQL Server database. Most of the problems I had was, that SSMS couldn’t debug or validate R code and to some extent, converting R results back to table was also proven not to be as easy as one might think. But overall, I was still in the middle – between the platform and putting the solution into production – mainly to control the input data and results (outputted data in form of recommendations or campaign lists).

After working with R integration in SQL Server 2016 for the past couple of months, I have to admit, I was happy that security issues were solved. And performance issues were solved with RevoScale library. Both are very very welcoming features. Come to think of it, two main issues did Microsoft solved. And looking from my aspect of building my own framework, security issues were really and badly needed to be solved. Same logic applies to workload and working with large datasets. Performance has proven really great. In addition, you have to keep in mind that with R Integration Microsoft is – rather then keeping them separated – bringing different roles of people even closer. So DBA, Analysts and developers are now and can now be all part of one data process. Integration with Azure is also very positive and good thing, come to think of it as an extension to your flavor of landscape. Support to different databases is also what will in future be very helpful – but admitting, I haven’t test it yet.

What I expect in the future for R integration that Microsoft will do is:

  • support R in Analysis Services
  • better support R in Integration Services
  • stronger R integration with DQS and MDM
  • native integration of R into transact SQL (calling SP every time for every kind of statistics might be tiring)
  • have some DMV created for R profiler, for R debugging and for R performance monitoring
  • bring support for HTML files and JS / JQuery scripts into SSRS or PowerBI
  • improve usage of R exports with file table
  • open server Endpoints Objects for R in order to use full extension of R.
  • improve Resource Governor for R in Standard edition
  • questions on auditing, profiling of R code from T-SQL and
  • security (couple of questions as well)

I have most probably forgot on couple of points for improvement. But this was more technical, now let’s talk about user experience point of view:

  • think about the support of R debugger / validation of code in SSMS
  • in fact, bring R into SSMS (!) instead of VS or create RTMS (R Tools for SQL Server Management Studio)
  • not limit R exports only on data frames or images but support also vectors, lists, etc

Of course, not to hasten myself too much, I forgot to ask myself, what actually the purpose of integration was.

2016-06-23 11_48_40-sql-server-2016-r-services-7-638.jpg (JPEG Image, 656 × 333 pixels)

Is it really only about making the super predictive analytics or is it making statistical tool as well? If latter, than I would strongly suggest to recap my open suggestions.

So, for the end, I will serve you with a hint 🙂 I am preparing and building bunch of scripts (from my previous experience and from the framework that I have built) to help all the data scientists and developers to use R much easier. But (!) – mathematical and statistical background still is very much a something I would put as highly recommended to have when digging into R.

Follow the blog where I will keep you updated on the scripts and where the code will be available.

 

Happy R-SQLing 🙂