Microsoft R articles on sqlservercentral.com

I am preparing a series of articles relating  Microsoft R Server and R related topics.The first article has been just published, couple are already scheduled to be published and more are coming.

Title line up is:

  1. Using Microsoft R in enterprise environment
  2. General introduction to Microsoft R Service in SQL Server 2016
  3. Installing R Packages for SQL Server R Services
  4. Using SQL Server and R services for analyzing sales data
  5. Using SQL Server and R services on customer data and predictions
  6. Analyze system and event logs with SQL Server using R
  7. Custom logs monitoring with SQL Server R Services
  8. Creating data visualization using SQL Server tools with SQL Server R Services

Some of the titles can still be a subject of changes.

Happy reading!

 

Advertisements

Resource governor and external resource pool for Microsoft R Services

Setting external resource pool for execution of R commands using sp_execute_external_script has proven extremely useful, especially in cases where you have other workers present, when you don’t want to overdo on data analysis and get to much resources from others (especially when running data analysis in production environment) or when you know that your data analysis will require maximum CPU and memory for a specific period. In such cases using and defining external resource pool is not only useful but highly recommended.

Resource governor is a feature that enables you to manage SQL Server workload and system resource consumption their limits. Limits can be configures for any workload in terms of CPU, Memory and I/O consumption. Where you have many different workloads on the same SQL Server, resource Governor helps allocate requested resources.

By default, two pools are defined – default and internal for system resources, and only default for system external resources.

2016-08-18 22_19_51-SQLQuery4.sql - SICN-KASTRUN.RevoTestDB (SPAR_si01017988 (54))_ - Microsoft SQL

Resource governor has resource pools, workload groups that are groups of similar load within same classification criteria. When defining external resource pool we will be defining the limitations for CPU and/or memory consumption for external resources. And R Services (using external procedure sp_execute_external_script) falls under external resources. According to MSDN, with external resource pool for R services will govern “rterm.exe, BxlServer.exe, and other processes spawned by them (quoted: link)”

First of all, you need to have resource governor enabled:

-- Enable Resource Governor
ALTER RESOURCE GOVERNOR RECONFIGURE;  
GO

Once you have this, let us set the query for test so we can measure and see how the resource governor will behave.

I will be using RevoScaleR sample data, that come with each R Service installation of SQL Server 2016. This sample data should be available on following location:

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\R_SERVICES\library\RevoScaleR\SampleData

2016-08-18 22_41_08-SampleData

And I will be using AirlineDemoSmall.csv  – CSV dataset, roughly 14MB size. Please note, that path to the SampleData might differ, based on your SQLServer 2016 installation folder.

In your database create and import data to your SQL Server table:

CREATE TABLE AirlineDemoSmall(
     ArrDelay varchar(100) NOT NULL
    ,CRSDepTime float NOT NULL
    ,[DayOfWeek] varchar(12) NOT NULL  
)
GO

-- this file should be at your location! so no need to download it
BULK INSERT AirlineDemoSmall
FROM 'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\
R_SERVICES\library\RevoScaleR\SampleData\AirlineDemoSmall.csv'
WITH (
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '\n',
    FIRSTROW = 2 -- Skip header
)
*Please note; both statements (CREATE TABLE and BULK INSERT) are part of RevoScaleR ImportAirlineDB.sql file. I am just reusing a section of this code. Whole file will be added to Github.

Once you have your data imported, we will run a simple linear regression using RevoScale library.

EXECUTE  sp_execute_external_script
                 @language = N'R'
                ,@script = N'
            library(RevoScaleR)
            f <- formula(as.numeric(ArrDelay) ~ as.numeric(DayOfWeek) + CRSDepTime)
            s <- system.time(mod <- rxLinMod(formula = f, data = AirLine))
            OutputDataSet <-  data.frame(system_time = s[3]);'
                ,@input_data_1 = N'SELECT * FROM AirlineDemoSmall'
                ,@input_data_1_name = N'AirLine'
-- WITH RESULT SETS UNDEFINED
WITH RESULT SETS 
            ((
                 Elapsed_time FLOAT
            ));

 

So I am only exporting R system time function and only Total elapsed time (hence 3 attribute from vector result of system.time function). Usual representation of this function is following format: user time, system time and elapsed time.

2016-08-18 22_51_34-RStudio

And will only take elapsed time as a measure and comparison between resource pool enabled and disabled. Results from our control environment (before setting the resource pool):

2016-08-18 22_53_45-SQLQuery4.sql - SICN-KASTRUN.RevoTestDB (SPAR_si01017988 (54))_ - Microsoft SQL

and computation time is 0.058 of a second where as elapsed time  – value returned from our sp_execute_external_script – is 0,63 of a second.

2016-08-18 22_58_47-SQLQuery4.sql - SICN-KASTRUN.RevoTestDB (SPAR_si01017988 (54))_ - Microsoft SQL

Now we will create a new pool and limit the CPU and memory available for such workloads:

-- Default value
ALTER EXTERNAL RESOURCE POOL [default] 
WITH (AFFINITY CPU = AUTO)
GO

CREATE EXTERNAL RESOURCE POOL RService_Resource_Pool  
WITH (  
     MAX_CPU_PERCENT = 10  
    ,MAX_MEMORY_PERCENT = 5
);  

ALTER RESOURCE POOL [default] WITH (max_memory_percent = 60, max_cpu_percent=90);  
ALTER EXTERNAL RESOURCE POOL [default] WITH (max_memory_percent = 40, max_cpu_percent=10);  
ALTER RESOURCE GOVERNOR reconfigure;

 

New pool – called: RService_Resource_Pool is created  with new values set.

2016-08-18 23_09_23-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

And obligatory reconfiguration:

ALTER RESOURCE GOVERNOR RECONFIGURE;  
GO

This configuration can also be found using this system configuration table:

-- Check configuration
SELECT * FROM sys.resource_governor_external_resource_pools

2016-08-18 23_36_54-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

Please note that only 20 workers are dedicated to be used for R external processes.

Capture

With each run of RevoScaleR library functions an increment will appear for DMV statistics on external script execution. Also, only RevoScaleR functions are counted here. Any other function (to my knowledge) is not counted here (as of SQL Server 2016 version).

SELECT * FROM sys.dm_external_script_execution_stats

2016-08-18 23_34_01-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

Once you have this, we need to set the Classification function as well:

CREATE WORKLOAD GROUP R_workgroup WITH (importance = medium) USING "default", 
EXTERNAL "RService_Resource_Pool";  

ALTER RESOURCE GOVERNOR WITH (classifier_function = NULL);  
ALTER RESOURCE GOVERNOR reconfigure;  

USE master  
GO  
CREATE FUNCTION RG_Class_function()  
RETURNS sysname  
WITH schemabinding  
AS  
BEGIN  
    IF program_name() in ('Microsoft R Host', 'RStudio') RETURN 'R_workgroup';  
    RETURN 'default'  
    END;  
GO  

ALTER RESOURCE GOVERNOR WITH  (classifier_function = dbo.RG_Class_function);  
ALTER RESOURCE GOVERNOR reconfigure;  
go

 

On MSDN you will find a post on setting up resource governor for R Services and due to security reasons, “However, direct management of the Launchpad service by using Resource Governor is not supported.” So, this practically means that you can also create a login with one of twenty R User Names (MSSQLServer01 … MSSQLServer20) in group of SQLRUserGroup in order for Resource Governor to work.

So, much coding needed at this point. I will be using User MSSQLSERVER01 and create all the necessary things: logins, permissions, etc.After setting up MSSQLSERVER01 windows user and using EXECUTE AS LOGIN command prior to running sp_execute_external_script.

Running the command:

SELECT * FROM sys.resource_governor_workload_groups;

you will now  see all the workload groups.

And now, finally we can rerun the same procedure:

-- We will run same query

EXECUTE  sp_execute_external_script
                 @language = N'R'
                ,@script = N'
   library(RevoScaleR)
   f <- formula(as.numeric(ArrDelay) ~ as.numeric(DayOfWeek) + CRSDepTime)
    s <- system.time(mod <- rxLinMod(formula = f, data = AirLine))
                            OutputDataSet <-  data.frame(system_time = s[3]);'
                ,@input_data_1 = N'SELECT * FROM AirlineDemoSmall'
                ,@input_data_1_name = N'AirLine'
-- WITH RESULT SETS UNDEFINED
WITH RESULT SETS 
            ((
                 Elapsed_time FLOAT
            ));

 

And CPU consumption will stay within the limits:

2016-08-19 00_40_57-Task Manager

and the elapsed time will be higher due to lower CPU available.

2016-08-19 00_41_54-Resource_governor_Microsoft_R_services.sql - SICN-KASTRUN.RevoTestDB (SPAR_si010

Happy R-Tsqling! 🙂

Code Is available at Github.

 

NOTE!

CPU consumption will rise after the execution of external execution command due to the worker MSSQLSERVER01 cleaning the session files and log files!

2016-08-19 00_44_26-MSSQLSERVER01

Since log and session files are part of Launchpad service, Resource Governor will not be able to limit the CPU, Memory of I/O consumption.

 

Handling required and missing R packages in Microsoft R Services

I have seen several times, that execution of the R code with procedure sp_execute_external_script was not valid due to missing library or library dependencies.

Problem is – in general – not solved out of the box. But can be solved using and maintaining a list of installed libraries used by Microsoft R services or by simply create a “pre-code” R code to do a check for you.

In both cases, user will end up with additional code, but it might be a good check if you are installing library in production and you run such check prior to running any relevant R code.

Let’s start with simple R code:

USE WideWorldImporters;
GO

EXECUTE sp_execute_external_script    
           @language = N'R'
          ,@script=N'library(Hmisc) 
                    df <- data.frame(rcorr(as.matrix(sp_RStats_query), 
                            type="pearson")$P)
                    OutputDataSet<-df'
          ,@input_data_1 = N'SELECT 
                     SupplierID
                    ,UnitPackageID
                    ,OuterPackageID
                FROM [Warehouse].[StockItems]'
          ,@input_data_1_name = N'sp_RStats_query'
    WITH RESULT SETS ((SupplierID NVARCHAR(200)
                    ,UnitPackageID NVARCHAR(200)
                    ,OuterPackageID NVARCHAR(200)));

 

This code will in my case return error message, that R Service is missing a specific library in order to execute R code – in my case library Hmisc.

Error_installed_library_01.Wide

So the first step to solve this issue is to declare variable for R script and parametrize  @script parameter for procedure sp_execute_external_script.

DECLARE @OutScript NVARCHAR(MAX) 
SET @OutScript =N'library(Hmisc) 
                    df <- data.frame(rcorr(as.matrix(sp_RStats_query), 
                     type="pearson")$P)
                    OutputDataSet<-df'

EXECUTE sp_execute_external_script    
           @language = N'R'
          ,@script= @OutScript
          ,@input_data_1 = N'SELECT 
                     SupplierID
                    ,UnitPackageID
                    ,OuterPackageID
                FROM [Warehouse].[StockItems]'
          ,@input_data_1_name = N'sp_RStats_query'
    WITH RESULT SETS ((
                         SupplierID    NVARCHAR(200)
                        ,UnitPackageID NVARCHAR(200)
                        ,OuterPackageID NVARCHAR(200)
                    ));

Now we need to do a string search for following patterns:

  • library (c(package1,package2))
  • library (package1),  library(package2)
  • library(package1)

All these reserved R words denote the need for particular library/libraries installation.

DECLARE @OutScript NVARCHAR(MAX) 
SET @OutScript =N'library(Hmisc)
                  library(test123) 
                    df <- data.frame(rcorr(as.matrix(sp_RStats_query), 
                      type="pearson")$P)
                    OutputDataSet<-df'

/*
***************************************************************
START: CODE FOR CHECKING AND INSTALLING MISSING R LIBRARIES
***************************************************************
*/


DECLARE @Tally TABLE (num TINYINT,R_Code NVARCHAR(MAX))
INSERT INTO @Tally VALUES (1,@OutScript)
DECLARE @libstatement NVARCHAR(MAX)
DECLARE @cmdstatement NVARCHAR(MAX)

;WITH CTE_R(num,R_Code, libname)
AS
(
SELECT
      1 AS num,
      RIGHT(R_Code, LEN(R_Code) - CHARINDEX(')', R_Code, 0)) AS  R_Code, 
      substring(R_Code, CHARINDEX('library(', R_Code, 0) + 0, CHARINDEX(')',
 R_Code, 0) - CHARINDEX('library(', R_Code, 0) + 1) AS libname
FROM @Tally
WHERE  
        CHARINDEX('(', R_Code, 0) > 0 
    AND CHARINDEX('library(',R_Code,0) > 0

UNION ALL

SELECT
     1 AS num,
     RIGHT(R_Code, LEN(R_Code) - CHARINDEX(')', R_Code, 0)) AS  R_Code,
     substring(R_Code, CHARINDEX('library(', R_Code, 0) + 0, CHARINDEX(')', 
R_Code, 0) - CHARINDEX('library(', R_Code, 0) + 1) AS libname
FROM CTE_R
WHERE 
    CHARINDEX('(', R_Code, 0) > 0 
AND CHARINDEX('library(',R_Code,0) > 0

)
, fin AS
(
SELECT TOP 1 stuff((SELECT ' install.packages(''''' + 
REPLACE(REPLACE(REPLACE(c1.libname,'library',''),')',''),'(','') + '''''
           , dependencies = T)'
              FROM CTE_R AS c1 
              WHERE 
                    c1.num = c2.num
              FOR XML PATH (''),TYPE).value('.','NVARCHAR(MAX)'),1,1,'') AS lib_stat
              FROM CTE_R AS c2
)
SELECT 
        @libstatement = lib_stat 
FROM fin

SET @cmdstatement = 'EXEC xp_cmdshell ''"C:\Program Files\Microsoft SQL Server\
MSSQL13.MSSQLSERVER\R_SERVICES\bin\R.EXE" cmd -e ' + @libstatement + ''''
EXEC SP_EXECUTESQL @cmdstatement

/*
***************************************************************
END: CODE FOR CHECKING AND INSTALLING MISSING R LIBRARIES
***************************************************************
*/


EXECUTE sp_execute_external_script    
           @language = N'R'
          ,@script= @OutScript
          ,@input_data_1 = N'SELECT 
                     SupplierID
                    ,UnitPackageID
                    ,OuterPackageID
                FROM [Warehouse].[StockItems]'
          ,@input_data_1_name = N'sp_RStats_query'
    WITH RESULT SETS ((
                         SupplierID    NVARCHAR(200)
                        ,UnitPackageID NVARCHAR(200)
                        ,OuterPackageID NVARCHAR(200)
                    ));

 

Result in this case will be successful with correct R results and sp_execute_external_script will not return error for missing libraries.

I added a “fake” library called test123 for testing purposes if all the libraries will be installed successfully.

At the end the script generated xp_cmdshell command (in one line):

EXEC xp_cmdshell '"C:\Program Files\Microsoft SQL Server\
MSSQL13.MSSQLSERVER\R_SERVICES\bin\R.EXE" cmd 
-e install.packages(''Hmisc'') install.packages(''test123'')'

You might also experience the following error in the output of xp_cmdshell command:

Error_installed_library_02.Wide

In this case, go to the following location and enable write permission for this folder and subfolders.

Error_installed_library_03.Wide

For the end, you will always find more elegant and easy way to write R code that will inadvertently check the installation (as well as version and dependencies for library) with following R code:

if(!is.element("Hmisc", installed.packages()))
{install.packages("Hmisc", dependencies = T)
}else{library("Hmisc")}

So the original code can simply be changed to:

USE WideWorldImporters;
GO

EXECUTE sp_execute_external_script    
           @language = N'R'
          ,@script=N'if(!is.element("Hmisc", installed.packages()))
                      {install.packages("Hmisc", dependencies = T)
                        }else{library("Hmisc")}
                    df <- data.frame(rcorr(as.matrix(sp_RStats_query), 
 type="pearson")$P)
                    OutputDataSet<-df'
          ,@input_data_1 = N'SELECT 
                     SupplierID
                    ,UnitPackageID
                    ,OuterPackageID
                FROM [Warehouse].[StockItems]'
          ,@input_data_1_name = N'sp_RStats_query'
    WITH RESULT SETS ((SupplierID NVARCHAR(200)
                    ,UnitPackageID NVARCHAR(200)
                    ,OuterPackageID NVARCHAR(200)));

But unfortunately, one can not always count on the consistence of developer or data scientists or author of the code, that they will always add a simple check for library installation.

Code is available at GitHub.

Happy R-TSQLing!

Installing R packages in SQL Server using only T-SQL

Installing R packages in SSMS using T-SQL can take some time, especially when you need to switch between R Tools for Visual Studio or R Studio or your favorite R GUI tool (or R CMD). But so far, installing any additional library using sp_execute_external_script is just not working.

--InstallPackage using sp_execute_external_script
EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'install.packages("ggplot2")'
WITH RESULT SETS (( ResultSet VARCHAR(50)));

-- using Download.file command
EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'download.file("https://cran.r-project.org/bin/windows/contrib/
3.4/ggplot2_2.1.0.zip","ggplot2")
                 install.packages("ggplot2", repos = NULL, type = "source")'
WITH RESULT SETS (( ResultSet VARCHAR(50)));

Both executions yield in error messages. Especially the first one would be great way of installation. At this point, I would not find a way of just passing this command to install the package.

Julie Koesmarno made a great post on installing R packages. Please follow this post. Also Microsoft suggests the following way to install R packages on MSDN.

Since I wanted to be able to have packages installed directly from SQL Server Management Studio (SSMS) here is yet another way to do it. I have used  xp_cmdshell to install any additional package for my R (optionally you can set EXECUTE AS USER).

USE WideWorldImporters;
GO
-- enable xp_cmdshell
EXECUTE SP_CONFIGURE 'xp_cmdshell','1';
GO

RECONFIGURE;
GO

EXEC xp_cmdshell '"C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\
R_SERVICES\bin\R.EXE" cmd -e install.packages(''tree'')';   GO

in this case I have installed package tree by calling Microsoft R Services with -e switch and adding straight R code. Please double check path to your R Engine.

Results are:

Capture.PNG

It might not be the proposed way, but when completing T-SQL and R code it is just a nice way with little annoyances.  I would still suggest the way Microsoft proposed or Julie described on her blog, but if you feel confident, you can use this method.

Code is available at Github.

Happy R-SQLing!

Using parameter and multiparameters with sp_execute_external_script

With RTM version of SQL Server 2016, sp_execute_external_script stored procedure has undergone couple of changes prior to it’s final outlook. The parametrization of this external procedure somehow resembles a typical extended stored procedure.

Indeed, sp_execute_external_script is an extended stored procedure written using CLR (whereas stored procedures are natively written in T-SQL) and their main purpose it that they run external commands that a normal T-SQL stored procedure could not handle.

Those who are (have been) working with any kind of external stored procedure or stored procedure using

AS { EXTERNAL NAME assembly_name.class_name.method_name }

you will be familiar with the sp_execute_external_script notation.

  EXECUTE sys.sp_execute_external_script
          @language = 
         ,@script = 
         ,@input_data_1 = 
         ,@input_data_1_name =
         ,@output_data_1_name =
         ,@parallel =
         ,@params = 
         ,@parameter1 =

 

Parameters @params and @parameter1 are interesting, but what might be a bit puzzling are numbers at the end of the names of @input_data_1, @input_data_1_name,… They have  no technical meaning (as far as I have found out) since they don’t enumerate anything and if you by common sense create @input_data_2 parameter, you will get an error in return. In a way this error would have been expected, since joining two SQL Statements into one R dataset would just be nonsense. It is more likely that numbers just denote data columns or data parameters can be enumerated within string value of a particular input parameter and that you need at least one of the items if you are using this parameter.

So parameters with enumerator number in the names, these parameters can hold more values and both parameters @params and @parameter1 are paired:

@params is  list of input parameter declarations  and

@parameter1 is list of values for the input parameters

just like for @input_data_1 and @input_data_1_name parameters.

Simple example would be getting Chi-Square value and statistical significance in one run out of R:

USE WideWorldImporters;
GO

DECLARE @F_Value VARCHAR(1000)
DECLARE @Signif VARCHAR(1000)


  EXECUTE sys.sp_execute_external_script
          @language = N'R'
         ,@script = N'mytable <- table(WWI_OrdersPerCustomer$CustomerID, WWI_OrdersPerCustomer$Nof_Orders) 
                     data.frame(margin.table(mytable, 2))
                     Ch <- unlist(chisq.test(mytable))
                     F_Val <- as.character(Ch[1])
                     Sig <- as.character(Ch[3])'
         ,@input_data_1 = N'select TOP 10 CustomerID, count(*) as Nof_Orders 
from [Sales].[Orders] GROUP BY CustomerID'
         ,@input_data_1_name = N'WWI_OrdersPerCustomer'
         ,@params = N' @F_Val VARCHAR(1000) OUTPUT, @Sig VARCHAR(1000) OUTPUT'
         ,@F_Val = @F_Value OUTPUT
         ,@Sig = @Signif OUTPUT


SELECT 
       @F_Value AS CHI_Value
      ,@Signif AS CHI_Square_SIGNIFICANCE;
GO

 

With @param and @parameter1 I was able to get two separate values from a list of a statistical test (against some sample data) in one run. Of course, the result of unlist function can be added to data.frame and easier parsed but what if I wanted to have data displayed as a frequencies and also test of statistical significance, I can simply do:

 

USE WideWorldImporters;
GO

DECLARE @F_Value VARCHAR(1000)
DECLARE @Signif VARCHAR(1000)


  EXECUTE sys.sp_execute_external_script
          @language = N'R'
         ,@script = N'mytable <- table(WWI_OrdersPerCustomer$CustomerID, 
WWI_OrdersPerCustomer$Nof_Orders) 
                     data.frame(margin.table(mytable, 2))
                     Ch <- unlist(chisq.test(mytable))
                     F_Val <- as.character(Ch[1])
                     Sig <- as.character(Ch[3])
                     OutputDataSet<-data.frame(margin.table(mytable, 2))'
         ,@input_data_1 = N'select TOP 10 CustomerID, count(*) as Nof_Orders 
from [Sales].[Orders] GROUP BY CustomerID'
         ,@input_data_1_name = N'WWI_OrdersPerCustomer'
         ,@params = N' @F_Val VARCHAR(1000) OUTPUT, @Sig VARCHAR(1000) OUTPUT'
         ,@F_Val = @F_Value OUTPUT 
         ,@Sig = @Signif OUTPUT
 WITH RESULT SETS(
                  (Cust_data INT
                  ,Freq INT)
                  )

SELECT @F_Value AS CHI_Value
    ,@Signif AS CHI_Square_SIGNIFICANCE

 

As you can see, there is result set clauses added and R script has 3 outputs defined; 1 for the data.frame output and 2 variables through parameters for statistical significance; as shown on print-screen:

Capture

Such export of the results is always very useful. In Reporting Services, in Power BI or simply in SSMS when running the resulsts.

Code available at Github.

Happy R-SQLing!