Running multiple correlations with R and T-SQL

Getting to know the data is always an interesting part of data science. With R integration into SQL Server, the exploration part is still part of the game.

Usual way to get some statistics out of the dataset is to run some frequencies, descriptive statistics and nevertheless correlations.

Running correlations against a set of variables in T-SQL might be a bit of a drag, hence using R code with SP_EXECUTE_EXTERNAL_SCRIPT is just as easy as following:

USE WideWorldImporters;

 SET @sql = 'SELECT 
                    , UnitPackageID
                    , OuterPackageID
                    , LeadTimeDays
                    , QuantityPerOuter
                    , TaxRate
                    , UnitPrice
                    , RecommendedRetailPrice
                    , TypicalWeightPerUnit
                FROM [Warehouse].[StockItems]'

SET @Rscript = N'df <- data.frame(cor(Stock, use="complete.obs", method="pearson"))

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@input_data_1 = @sql
      ,@input_data_1_name = N'Stock'
                     SupplierID NVARCHAR(100)
                    ,UnitPackageID NVARCHAR(100)
                    ,OuterPackageID NVARCHAR(100)
                    ,LeadTimeDays NVARCHAR(100)
                    ,QuantityPerOuter NVARCHAR(100)
                    ,TaxRate NVARCHAR(100)
                    ,UnitPrice NVARCHAR(100)
                    ,RecommendedRetailPrice NVARCHAR(100)
                    ,TypicalWeightPerUnit NVARCHAR(100)

I am using WideWorldImporters; (GitHub or at Codeplex);  new Demo database from Microsoft that was released just this month, beginning of June 2016.

By running this query with correlations R returns a dataframe that T-SQL is able to interpret and output the results in SSMS in following format. Very cool.

2016-06-26 07_31_07-SQLQuery1.sql - SICN-00031_SQLSERVER2016RC3.WideWorldImporters (SPAR_si01017988

The outlook is very similar to one for example in SPSS:

2016-06-26 09_06_37-_Output1 [Document1] - IBM SPSS Statistics Viewer

Numbers are matching (!) and the outline is relatively the same; very clear and easily readable. One thing is missing – SPSS delivers statistical significance (p-value) whereas R only delivers value of Pearson correlation coefficient. For that matter we need to run additional T-SQL / R procedure in order to get p-values.

SET @sql = 'SELECT 
                FROM [Warehouse].[StockItems]'

SET @Rscript = N'
                df <- data.frame(rcorr(as.matrix(Stock), type="pearson")$P)

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@input_data_1 = @sql
      ,@input_data_1_name = N'Stock'
                     SupplierID DECIMAL(10,5)
                    ,UnitPackageID DECIMAL(10,5)
                    ,OuterPackageID DECIMAL(10,5)
                    ,LeadTimeDays DECIMAL(10,5)
                    ,QuantityPerOuter DECIMAL(10,5)
                    ,TaxRate DECIMAL(10,5)
                    ,UnitPrice DECIMAL(10,5)
                    ,RecommendedRetailPrice DECIMAL(10,5)
                    ,TypicalWeightPerUnit DECIMAL(10,5)

So we have now statistical significance of our correlation matrix. I used using library Hmisc and function rcorr.

2016-06-26 09_34_04-SQLQuery1.sql - SICN-00031_SQLSERVER2016RC3.WideWorldImporters (SPAR_si01017988

Rcorr function has very little options to be set. So results may vary when compared to other (by default) functions. You can also use cor.test function:

data.frame(p_value = cor.test(df$my_var1,df$my_var2,use="complete.obs", 
method="pearson")$p.value, var1= "my_var1", var2= "my_var2")

but since the function can not deal with matrix / dataframe, a loop function to go through every combination of variables and store the results with variable names into dataframe. The rcorr function will do the trick, for now.

The final step would be (hint) to combine both sp_execute_external_script into one stored procedure, store both results from R, combine the coefficients with significance level and export only one table with all the information needed. This is already prepared as part of my R scripts.

Happy R-SQLing!

Tagged with: , , , , , ,
Posted in Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Follow TomazTsql on

Tomaz doing BI and DEV with SQL Server and R

attaining enlightenment with sql server, .net, biztalk, windows and linux

SQL DBA with A Beard

He's a SQL DBA and he has a beard

DB NewsFeed

Matan Yungman's SQL Server blog

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Clocksmith Games

We make games we love to play

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R

Search Msdn

Tomaz doing BI and DEV with SQL Server and R


Tomaz doing BI and DEV with SQL Server and R

Ms SQL Girl

Julie Koesmarno's Journey In Data, BI and SQL World


R news and tutorials contributed by (750) R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

SQL Server, Azure and .net in a nutshell :D

Paul te Braak

Business Intelligence Blog

Sql Server Insane Asylum (A Blog by Pat Wright)

Information about SQL Server from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...

SQLPam's Blog

Life changes fast and this is where I occasionally take time to ponder what I have learned and experienced. A lot of focus will be on SQL and the SQL community – but life varies.

William Durkin

William Durkin a blog on SQL Server, Replication, Performance Tuning and whatever else.

$hell Your Experience !!!

As aventuras de um DBA usando o Poder do $hell

%d bloggers like this: