SQL Saturday statistics – Web Scraping with R and SQL Server

I wanted to check a simple query: How many times has a particular topic been presented and from how many different presenters.

Sounds interesting, tackling the problem should not be a problem, just that the end numbers may vary, since there will be some text analysis included.

First of all, some web scraping and getting the information from Sqlsaturday web page. Reading the information from the website, and with R/Python integration into SQL Server, this is fairly straightforward task:

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
 library(rvest)
 library(XML)
 library(dplyr)

#URL to schedule
 url_schedule <- ''http://www.sqlsaturday.com/687/Sessions/Schedule.aspx''

#Read HTML
 webpage <- read_html(url_schedule)

# Event schedule
 schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK

# Extracting HTML content
 ht <- html_text(schedule_info)

df <- data.frame(data=ht)

#create empty DF
 df_res <- data.frame(title=c(), speaker=c())

for (i in 1:nrow(df)){
 #print(df[i])
 if (i %% 2 != 0) #odd flow
 print(paste0("title is: ", df$data[i]))
 if (i %% 2 == 0) #even flow
 print(paste0("speaker is: ", df$data[i]))
 df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1]))
 }
df_res_new = df_res[seq(1, nrow(df_res), 2), ]
OutputDataSet <- df_res_new'

Python offers Beautifulsoup library that will do pretty much the same (or even better) job as rvest and XML packages combined. Nevertheless, once we have the data from a test page out (in this case I am reading the Slovenian SQLSaturday 2017 schedule, simply because, it is awesome), we can “walk though” the whole web page and generate all the needed information.

SQLSaturday website has every event enumerated, making it very easy to parametrize the web scrapping process:

2017-11-12 13_13_30-SQLSaturday #687 - Slovenia 2017 _ Sessions _ Schedule

So we will scrape through last 100 events, by simply incrementing the integer of the event; so input parameter will be parsed as:

http://www.sqlsaturday.com/600/Sessions/Schedule.aspx

http://www.sqlsaturday.com/601/Sessions/Schedule.aspx

http://www.sqlsaturday.com/602/Sessions/Schedule.aspx

and so on, regardless of the fact if the website functions or not. Results will be returned back to the SQL Server database.

Creating stored procedure will go the job:

USE SqlSaturday;
GO

CREATE OR ALTER PROCEDURE GetSessions
 @eventID SMALLINT
AS

DECLARE @URL VARCHAR(500)
SET @URL = 'http://www.sqlsaturday.com/' +CAST(@eventID AS NVARCHAR(5)) + '/Sessions/Schedule.aspx'

PRINT @URL

DECLARE @TEMP TABLE
(
 SqlSatTitle NVARCHAR(500)
 ,SQLSatSpeaker NVARCHAR(200)
)

DECLARE @RCODE NVARCHAR(MAX)
SET @RCODE = N' 
 library(rvest)
 library(XML)
 library(dplyr)
 library(httr)
 library(curl)
 library(selectr)
 
 #URL to schedule
 url_schedule <- "'
 
DECLARE @RCODE2 NVARCHAR(MAX) 
SET @RCODE2 = N'"
 #Read HTML
 webpage <- html_session(url_schedule) %>%
 read_html()

# Event schedule
 schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK

# Extracting HTML content
 ht <- html_text(schedule_info)

df <- data.frame(data=ht)

#create empty DF
 df_res <- data.frame(title=c(), speaker=c())

for (i in 1:nrow(df)){
 #print(df[i])
 if (i %% 2 != 0) #odd flow
 print(paste0("title is: ", df$data[i]))
 if (i %% 2 == 0) #even flow
 print(paste0("speaker is: ", df$data[i]))
 df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1]))
 }

df_res_new = df_res[seq(1, nrow(df_res), 2), ]
 OutputDataSet <- df_res_new ';

DECLARE @FINAL_RCODE NVARCHAR(MAX)
SET @FINAL_RCODE = CONCAT(@RCODE, @URL, @RCODE2)

INSERT INTO @Temp
EXEC sp_execute_external_script
 @language = N'R'
 ,@script = @FINAL_RCODE


INSERT INTO SQLSatSessions (sqlSat,SqlSatTitle,SQLSatSpeaker)
SELECT 
 @EventID AS sqlsat
 ,SqlSatTitle
 ,SqlSatSpeaker
FROM @Temp

 

Before you run this, just a little environement setup:

USE [master];
GO

CREATE DATABASe SQLSaturday;
GO

USE SQLSaturday;
GO

CREATE TABLE SQLSatSessions
(
 id SMALLINT IDENTITY(1,1) NOT NULL
,SqlSat SMALLINT NOT NULL
,SqlSatTitle NVARCHAR(500) NOT NULL
,SQLSatSpeaker NVARCHAR(200) NOT NULL
)

 

There you go! Now you can run a stored procedure for a particular event (in this case SQL Saturday Slovenia 2017):

EXECUTE GetSessions @eventID = 687

or you can run this procedure against multiple SQLSaturday events and web scrape data from SQLSaturday.com website instantly.

For Slovenian SQLSaturday, I get the following sessions and speakers list:

2017-11-13 19_19_46-49_blog_post.sql - SICN-KASTRUN.SQLSaturday (SPAR_si01017988 (57))_ - Microsoft .png

Please note that you are running this code behind the firewall and proxy, so some additional changes for the proxy or firewall might be needed!

So going to original question, how many times has the query store been presented on SQL Saturdays (from SQLSat600 until  SqlSat690), here is the frequency table:

2017-11-13 19_57_04-Statistics_on_web_scraping_results.sql - SICN-KASTRUN.SQLSaturday (SPAR_si010179

Or presented with pandas graph:

session_stats

Query store is popular, beyond all R, Python or Azure ML topics, but Powershell is gaining its popularity like crazy. Good work PowerShell people! ūüôā

UPDATE #1: More statistics; in general PowerShell session is presented on every second SQLSaturday, Query Store on every third, whereas there are minimum 2 topics related to Azure on every SQLSat event (relevant for SqlSat events ranging from SqlSat600 to SqlSat690).

As always, code is available at Github.

 

Advertisements

Robust IRIS Dataset?

This blog post was born out of pure curiosity about the robustness of the IRIS Dataset. Biological datasets do not need to be that big in comparison to datasets of customers, consumption, stock and anything that might be volatile.

When still at the university, on one occasion I can remember, we were measuring the length of the frog legs and other frogy features. And after just a couple of measures, the further prediction was steady. Also, any kind of sampling was (RS and SRS, cluster/stratified sampling, sampling with replacements and many other creative ways of sampling) proven to be rigid, robust and would converge quickly to a good result.

Therefore, I have decided to put the IRIS dataset to the test, using a simple classification method. Calculating first the simple euclidian distance, following by finding the neighbour and based on that checking the membership of the type of flowers with the labels.

Accuracy of the prediction was tested by mapping the original species with predicted ones. And the test was, how large can a train dataset be in order to still get a good result.

After some Python and R code, the results were in.

I have tested following pairs (train:test sample size):

  • 80% – 20%
  • 60% – 40%
  • 50% – 50%
  • 30% – 70%
  • 10% – 90%

Note, that the IRIS dataset has 150 observations, each evenly distributed among three species. Following Python code loop through the calculation of euclidean distance.

for x in range(3000):
    exec(open("./classification.py").read(), globals())
    x += 1

At the end I have generated the file:

predictions

With these results, simple R code to generate the scatter plot was used:

library(ggplot2)
setwd("C:\\Predictions\\")
df_pred <- data.frame(read.table("results_split.txt", sep=";"))
p <- ggplot(df_pred, aes(df_pred$V3, df_pred$V1)) 
p <- p + geom_point(aes(df_pred$V3))
p <- p + labs(x="Accuracy (%) of predictions", y="Size of training subset")
p

Which resulted as:

Plot

The graph clearly shows that 10% of training set (10% out of 150 observations) can generate very accurate predictions every 1,35x times.

Other pairs, when taking 30% or 50% of training set, will for sure give close to 100% accuracy almost every time.

Snippet of Python code to generate euclidean distance:

def eucl_dist(set1, set2, length):
    distance = 0
    for x in range(length):
        distance += pow(set1[x] - set2[x], 2)
    return math.sqrt(distance)

and neighbours:

def find_neighbors(train, test, nof_class):
    distances = []
    length_dist = len(test) - 1
    for x in range(len(train)):
        dist = eucl_dist(test, train[x], length_dist)
        distances.append((train[x],dist))
    distances.sort(key=operator.itemgetter(1))
    neighbour = []
    for x in range(nof_class):
        neighbour.append(distances[x][0])
    return neighbour

 

Conclusion, IRIS dataset is – due to the nature of the measurments and observations – robust and rigid; one can get very good accuracy results on a small training set. Everything beyond 30% for training the model, is for this particular case, just additional overload.

Happy R & Python coding! ūüôā

Using R sp_execute_external_script with JSON

JSON has become part of the SQL Server in the same version as R. Both were very highly anticipated and awaited from the community.

JSON has very powerful statements for converting to and from JSON for storing into / from SQL Server engine (FOR JSON and JSON VALUE, etc).  And since it is gaining popularity for data exchange, I was curious to give it a try with R combination.

I will simply convert a system table into array using for json clause.

SELECT top 10 object_id  FROM sys.objects FOR JSON AUTO;

and it gives back the result:

[{"object_id":3},{"object_id":5},{"object_id":6},{"object_id":7},{"object_id":8},
{"object_id":9},{"object_id":17},{"object_id":18},{"object_id":19},{"object_id":20}]

And sp_execute_external_script query without JSON would look like:

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'OutputDataSet <- InputDataSet'
      ,@input_data_1 = N'SELECT top 10 object_id  FROM sys.objects'
WITH RESULT SETS ((nr INT));

Now, let’s suppose we want to use JSON result directly into T-SQL using sp_execute_external_script. Yes, imagine getting results from an API and you want to push the results immediately into R for analysis. Very straight-forward package in R is called jsonlite (also available is rjson). Query would be as following:

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'library(jsonlite)
                OutputDataSet <- data.frame(fromJSON(InputDataSet))'
      ,@input_data_1 = N'SELECT top 10 object_id  FROM sys.objects FOR JSON AUTO'
WITH RESULT SETS ((nr INT));

Nope!

Msg 39004, Level 16, State 20, Line 15
A 'R' script error occurred during execution of 'sp_execute_external_script' 
with HRESULT 0x80004004.
Msg 39019, Level 16, State 1, Line 15
An external script error occurred: 
Error: Argument 'txt' must be a JSON string, URL or file.

So the argument ‘txt’ must be a JSON string, URL or file. Khm…very “useful” error message, but problem is, that data from T-SQL is stored and presented as data.frame to R environment (Launchpad), because the data type passed to R is array of objects. And would look something like:

2017-01-08-22_04_46-rstudio

Running this query in native (R) environment, we at least get the idea where and how to tackle the problem. So we need to convert the data.frame to a charaters using toJSON and as.character, so that the end T-SQL query would look like:

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'
                library(jsonlite)
                js <- InputDataSet
                js2 <- as.character(toJSON(js))
                OutputDataSet <- data.frame(fromJSON(js2))'
      ,@input_data_1 = N'SELECT top 10 object_id  FROM sys.objects FOR JSON AUTO'
WITH RESULT SETS ((nr INT));

Now we get the correct results (as if we would not used JSON):

2017-01-08-22_22_40-sqlquery7-sql-sicn-kastrun-wideworldimportersdw-spar_si01017988-60_-micr

So R is ready for JSON and JSON is also ready for R.

Happy R+JSON+SQLing!

Clustering executed SQL Server queries using R as tool for

When query execution performance analysis is to be done, there are many ways to find which queries might cause any unwanted load or cause stall on the server.

By encouraging DBA community to start practicing the advantage or R Language and world of data science, I have created a demo to show, how statistics on numerous queries can be stored for later analysis. And this demo has unsupervised (or undirected) method for grouping similar query statistics (or queries) to easier track and find where and which queries might be potential system stoppers.

Before we run some queries to generate the running statistics for these queries, we clean the cache and prepare the table for storing statistics from sys.dm_exec_query_stats.

-- drop statistics table
DROP TABLE IF EXISTS  query_stats_LOG_2;
DROP PROCEDURE IF EXISTS AbtQry;
DROP PROCEDURE IF EXISTS SalQry;
DROP PROCEDURE IF EXISTS PrsQry;
DROP PROCEDURE IF EXISTS OrdQry;
DROP PROCEDURE IF EXISTS PurQry;
GO

-- Clean all the stuff only for the selected db
DECLARE @dbid INTEGER
SELECT @dbid = [dbid] 
FROM master..sysdatabases 
WHERE name = 'WideWorldImportersDW'
DBCC FLUSHPROCINDB (@dbid);
GO

-- for better sample, check that you have Query store turned off
ALTER DATABASE WideWorldImportersDW SET  QUERY_STORE = OFF;
GO

We generate some fake data:

USE WideWorldImportersDW;
GO

-- CREATE Procedures 

CREATE PROCEDURE AbtQry
(@AMNT AS INTEGER) 
AS

-- an arbitrary query
SELECT 
  cu.[Customer Key] AS CustomerKey
  ,cu.Customer
  ,ci.[City Key] AS CityKey
  ,ci.City
  ,ci.[State Province] AS StateProvince
  ,ci.[Sales Territory] AS SalesTeritory
  ,d.Date
  ,d.[Calendar Month Label] AS CalendarMonth
  ,s.[Stock Item Key] AS StockItemKey
  ,s.[Stock Item] AS Product
  ,s.Color
  ,e.[Employee Key] AS EmployeeKey
  ,e.Employee
  ,f.Quantity
  ,f.[Total Excluding Tax] AS TotalAmount
  ,f.Profit
 
FROM Fact.Sale AS f
  INNER JOIN Dimension.Customer AS cu
    ON f.[Customer Key] = cu.[Customer Key]
  INNER JOIN Dimension.City AS ci
    ON f.[City Key] = ci.[City Key]
  INNER JOIN Dimension.[Stock Item] AS s
    ON f.[Stock Item Key] = s.[Stock Item Key]
  INNER JOIN Dimension.Employee AS e
    ON f.[Salesperson Key] = e.[Employee Key]
  INNER JOIN Dimension.Date AS d
    ON f.[Delivery Date Key] = d.Date
WHERE
    f.[Total Excluding Tax] BETWEEN 10 AND @AMNT;
GO

CREATE PROCEDURE SalQry
(@Q1 AS INTEGER
,@Q2 AS INTEGER)

AS
-- FactSales Query
SELECT * FROM Fact.Sale
WHERE
    Quantity BETWEEN @Q1 AND @Q2;
GO

CREATE PROCEDURE PrsQry
(@CID AS INTEGER )
AS

-- Person Query
SELECT * 
    FROM [Dimension].[Customer]
    WHERE [Buying Group] <> 'Tailspin Toys' 
    /* OR [WWI Customer ID] > 500 */
    AND [WWI Customer ID] BETWEEN 400 AND  @CID
ORDER BY [Customer],[Bill To Customer];
GO


CREATE PROCEDURE OrdQry
(@CK AS INTEGER)
AS

-- FactSales Query
SELECT 
    * 
    FROM [Fact].[Order] AS o
    INNER JOIN [Fact].[Purchase] AS p 
    ON o.[Order Key] = p.[WWI Purchase Order ID]
WHERE
    o.[Customer Key] = @CK;
GO

CREATE PROCEDURE PurQry
(@Date AS SMALLDATETIME)
AS

-- FactPurchase Query
SELECT *
    FROM [Fact].[Purchase]
        WHERE
        [Date Key] = @Date;
    --[Date KEy] = '2015/01/01'
GO

Now we run procedures couple of times:

DECLARE @ra DECIMAL(10,2)
SET @ra = RAND()
SELECT CAST(@ra*10 AS INT)

IF @ra  < 0.3333
    BEGIN
       -- SELECT 'RAND < 0.333', @ra
       DECLARE @AMNT_i1 INT = 100*CAST(@ra*10 AS INT)
       EXECUTE AbtQry @AMNT = @AMNT_i1
       EXECUTE PurQry @DAte = '2015/10/01'
       EXECUTE PrsQry @CID = 480
       EXECUTE OrdQry @CK = 0
       DECLARE @Q1_i1 INT = 1*CAST(@ra*10 AS INT)
       DECLARE @Q2_i1 INT = 20*CAST(@ra*10 AS INT)
       EXECUTE SalQry @Q1 = @Q1_i1, @Q2 = @Q2_i1

    END
ELSE 
    IF @ra  > 0.3333 AND @ra < 0.6667
    BEGIN
        -- SELECT 'RAND > 0.333 | < 0.6667', @ra
        DECLARE @AMNT_i2 INT = 500*CAST(@ra*10 AS INT)
        EXECUTE AbtQry @AMNT = @AMNT_i2
        EXECUTE PurQry @DAte = '2016/04/29'
        EXECUTE PrsQry @CID = 500
        EXECUTE OrdQry @CK = 207
        DECLARE @Q1_i2 INT = 2*CAST(@ra*10 AS INT)
        DECLARE @Q2_i2 INT = 10*CAST(@ra*10 AS INT)
        EXECUTE SalQry @Q1 = @Q1_i2, @Q2 = @Q2_i2

    END
ELSE
    BEGIN
        -- SELECT 'RAND > 0.6667', @ra
        DECLARE @AMNT_i3 INT = 800*CAST(@ra*10 AS INT)
        EXECUTE AbtQry @AMNT = @AMNT_i3
        EXECUTE PurQry @DAte = '2015/08/13'
        EXECUTE PrsQry @CID = 520
        EXECUTE OrdQry @CK = 5
        DECLARE @Q2_i3 INT = 60*CAST(@ra*10 AS INT)
        EXECUTE SalQry @Q1 = 25, @Q2 = @Q2_i3
    END
GO 10

And with this data, we proceed to run the logging T-SQL (query source).

SELECT
    (total_logical_reads + total_logical_writes) AS total_logical_io
    ,(total_logical_reads / execution_count) AS avg_logical_reads
    ,(total_logical_writes / execution_count) AS avg_logical_writes
    ,(total_physical_reads / execution_count) AS avg_phys_reads
    ,substring(st.text,(qs.statement_start_offset / 2) + 1,  
      ((CASE qs.statement_end_offset WHEN - 1 THEN datalength(st.text) 
   qs.statement_end_offset END  - qs.statement_start_offset) / 2) + 1) 
AS statement_text
    ,*
INTO query_stats_LOG_2
FROM
        sys.dm_exec_query_stats AS qs
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY total_logical_io DESC

Now that we have gathered some test data, we can proceed to do clustering analysis.

Since I don’t know how many clusters can there be, and I can imagine, a¬† DBA would also be pretty much clueless, I will explore number of clusters. Following R code:

library(RODBC)
myconn <-odbcDriverConnect("driver={SQL Server};Server=SICN-KASTRUN;
database=WideWorldImportersDW;trusted_connection=true")

query.data <- sqlQuery(myconn, "
                     SELECT 
                                 [total_logical_io]
                      ,[avg_logical_reads]
                      ,[avg_phys_reads]
                      ,execution_count
                      ,[total_physical_reads]
                      ,[total_elapsed_time]
                      ,total_dop
                      ,[text]
                      ,CASE WHEN LEFT([text],70) LIKE '%AbtQry%' THEN 'AbtQry'
                       WHEN LEFT([text],70) LIKE '%OrdQry%' THEN 'OrdQry'
                       WHEN LEFT([text],70) LIKE '%PrsQry%' THEN 'PrsQry'
                       WHEN LEFT([text],70) LIKE '%SalQry%' THEN 'SalQry'
                       WHEN LEFT([text],70) LIKE '%PurQry%' THEN 'PurQry'
                       HEN LEFT([text],70) LIKE '%@BatchID%' THEN 'System'
                                  ELSE 'Others' END AS label_graph 
                      FROM query_stats_LOG_2")

close(myconn) 
library(cluster)

#qd <- query.data[,c(1,2,3,5,6)]
qd <- query.data[,c(1,2,6)]

## hierarchical clustering
qd.use <- query.data[,c(1,2,6)]
medians <- apply(qd.use,2,median)
#mads <- apply(qd.use,2,mad)
qd.use <- scale(qd.use,center=medians) #,scale=mads)

#calculate distances
query.dist <- dist(qd.use)

# hierarchical clustering
query.hclust <- hclust(query.dist)

# plotting solution
op <- par(bg = "lightblue")
plot(query.hclust,labels=query.data$label_graph,
main='Query Hierarchical Clustering', ylab = 'Distance', 
xlab = ' ', hang = -1, sub = "" )
# in addition to circle queries within cluster
rect.hclust(query.hclust, k=3, border="DarkRed")        

And produces the following plot:

2017-01-08-10_46_05-plot-zoom

Graph itself is self explanatory and based on the gathered statistics and queries executed against the system, you receive the groups of queries where your DBA can easily and fast track down what might be causing some issues. I added some labels to the query for the graph to look neater, but it is up to you.

I have also changed the type to “triangle” to get the following plot:

2017-01-08-10_54_55-plot-zoom

And both show same information.

So the R code said that, there are three clusters generating And I used medians to generate data around it. In addition I have also tested the result with Partitioning around medoids (which is opposite to hierarchical clustering) and the results from both techniques yield clean clusters.

2017-01-08-10_46_47-rstudio

Also, the data sample is relatively small, but you are very welcome to test this idea into your environment. Just easy with freeproccache and flushprocindb commands!

This blog post was meant as a teaser,  to gather opinion from the readers. Couple of more additional approaches will be part of two articles, that I am currently working on.

As always, code is available at the Github.

Happy SQLinRg!

SQL Saturday Vienna 2017 #sqlsatVienna

SQL Saturday Vienna 2017 is just around the corner. On Friday, January 20, 2017, a lot of local and international speakers will gather to deliver sessions relating to SQL Server and all related services. With great agenda Рavailable  here, the attendees will surely enjoy different topics as well as have the opportunity to talk to Austrian PASS and SQL community as well as speakers, as well as SQL Server MVP.

2016-12-29-18_53_40-sqlsaturday-579-vienna-2017-_-event-home

 

My session at SQL Sat Vienna 2017 will be focused on what can database administrators gain from R integration with SQL Server 2016. Said that, we will look how statistics  from main DBA tasks can be gathered, stored and later analyzed for better prediction, for uncovering patters in baseline that might be overlooked and of course how to play with information gathered from Query store and DMV query plans. Session will also serve with field examples each enterprise can have.

This year, I will have the pleasure to do the pre-con on Thursday, January 19, 2017 at the JUFA Hotel in Vienna. A full day pre-con workshop on SQL Server and R integration with all the major topics covered, where and how to start using R, how the R integration works, deep dive into R package for high performance work and dive into statistics – from uni-variate to multi-variate as well as methods for data mining and machine learning. Everybody welcome to join, it will be a great day for a workshop! ūüôā

 

2016-12-29 19_01_20-BI and Analytics with SQL Server and R - Tomaz Kastrun Tickets, Thu, 19 Jan 2017.png

Tickets are available here via Eventbrite.

 

Falco is already playing Vienna Calling¬†¬† ūüôā

 

 

Association Rules on WideWorldImporters and SQL Server R Services

Association rules are very handy for analyzing Retail data. And WWI database has really neat set of invoices that can be used to make a primer.

Starting with following T-SQL query:

USE WideWorldIMportersDW;
GO

;WITH PRODUCT
AS
(
SELECT 
  [Stock Item Key]
 ,[WWI Stock Item ID]
 ,[Stock Item] 
 ,LEFT([Stock Item], 8) AS L8DESC 
 ,ROW_NUMBER() OVER (PARTITION BY LEFT([Stock Item], 8) ORDER BY ([Stock Item])) AS RN_ID_PR
 ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP
 FROM [Dimension].[Stock Item]
)

SELECT
 O.[WWI Order ID]
,O.[Order Key]
,O.[Stock Item Key]
,P.PRODUCT_GROUP
,O.[Description]

FROM [Fact].[Order] AS O
JOIN PRODUCT AS P
    ON P.[Stock Item Key] = O.[Stock Item Key]

ORDER BY 
    O.[WWI Order ID]
    ,O.[Order Key]

 

I have created very simple product group that will neglect distinction between product variants and treat them  as one. For example:

Stock Item Key    WWI Stock Item ID    Stock Item
54                166                  10 mm Anti static bubble wrap (Blue) 20m
53                167                  10 mm Anti static bubble wrap (Blue) 50m

Both Products are initially the same just the product variant can change; color, size, cap, volume, etc. Product group denotes main products, “without” the product variants. I am doing this simplification out of practical reason, because of a smaller dataset.

So new version of product groups (variable ProductGroup) would be like:

Stock Item Key    WWI Stock Item ID    Stock Item      ProductGroup
54                166                  10 mm Anti      2
53                167                  10 mm Anti      2

So incorporating R code for analyzing association rules in sp_execute_external_procedure is what following code does:

-- Getting Association Rules into T-SQL
DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'WITH PRODUCT
                        AS
                      (
                      SELECT
                      [Stock Item Key]
                      ,[WWI Stock Item ID]
                      ,[Stock Item] 
                      ,LEFT([Stock Item], 8) AS L8DESC 
                      ,ROW_NUMBER() OVER (PARTITION BY LEFT([Stock Item], 8) ORDER BY ([Stock Item])) AS RN_ID_PR
                      ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP
                      FROM [Dimension].[Stock Item]
                      )
                      
                      SELECT
                      O.[WWI Order ID] AS OrderID
                      -- ,O.[Order Key]   AS OrderLineID
                      -- ,O.[Stock Item Key] AS ProductID
                      ,P.PRODUCT_GROUP AS ProductGroup
                      -- ,O.[Description] AS ProductDescription
                      ,LEFT([Stock Item],8) AS ProductDescription
                      
                      FROM [Fact].[Order] AS O
                      JOIN PRODUCT AS P
                      ON P.[Stock Item Key] = O.[Stock Item Key]
                      GROUP BY
                       O.[WWI Order ID]
                      ,P.PRODUCT_GROUP 
                      ,LEFT([Stock Item],8) 
                      ORDER BY 
                      O.[WWI Order ID]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'
                library(arules)
                cust.data <- InputDataSet
                cd_f <- data.frame(OrderID=as.factor(cust.data$OrderID),
ProductGroup=as.factor(cust.data$ProductGroup))
                cd_f2_tran  <- as(split(cd_f[,"ProductGroup"], cd_f[,"OrderID"]), 
"transactions")
                rules <- apriori(cd_f2_tran, parameter=list(support=0.01, 
confidence=0.1))
                OutputDataSet <- data.frame(inspect(rules))'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     lhs NVARCHAR(500)
    ,[Var.2] NVARCHAR(10)
    ,rhs NVARCHAR(500)
    ,support DECIMAL(18,3)
    ,confidence DECIMAL(18,3)
    ,lift DECIMAL(18,3)
                 ));

 

Result is retrieving rules of association between products from transaction that build up support and eventually give lift for any predictions.

By executing this R code:

# chart if needed
plot(rules, method="grouped", control=list(k=20));

one can generate also graphical view of the rules and associations between products.

2016-10-14 22_15_27-Plot Zoom.png

And finally to retrieve information on support for each of the ProductGroup (which is my case), I would execute this R code embedded into T-SQL:

DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'WITH PRODUCT
                        AS
                      (
                      SELECT
                      [Stock Item Key]
                      ,[WWI Stock Item ID]
                      ,[Stock Item] 
                      ,LEFT([Stock Item], 8) AS L8DESC 
                      ,ROW_NUMBER() OVER (PARTITION BY LEFT([Stock Item], 8) ORDER BY ([Stock Item])) AS RN_ID_PR
                      ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP
                      FROM [Dimension].[Stock Item]
                      )
                      
                      SELECT
                      O.[WWI Order ID] AS OrderID
                      -- ,O.[Order Key]   AS OrderLineID
                      -- ,O.[Stock Item Key] AS ProductID
                      ,P.PRODUCT_GROUP AS ProductGroup
                      -- ,O.[Description] AS ProductDescription
                      ,LEFT([Stock Item],8) AS ProductDescription
                      
                      FROM [Fact].[Order] AS O
                      JOIN PRODUCT AS P
                      ON P.[Stock Item Key] = O.[Stock Item Key]
                      GROUP BY
                       O.[WWI Order ID]
                      ,P.PRODUCT_GROUP 
                      ,LEFT([Stock Item],8) 
                      ORDER BY 
                      O.[WWI Order ID]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'
                library(arules)
                cust.data <- InputDataSet
                cd_f <- data.frame(OrderID=as.factor(cust.data$OrderID),
ProductGroup=as.factor(cust.data$ProductGroup))
                cd_f2_tran  <- as(split(cd_f[,"ProductGroup"], cd_f[,"OrderID"]),
 "transactions")
                PgroupSets <- eclat(cd_f2_tran, parameter = list(support = 0.05), 
control = list(verbose=FALSE))
                normalizedGroups <- PgroupSets[size(items(PgroupSets)) == 1]
                eachSupport <- quality(normalizedGroups)$support
                GroupName <- unlist(LIST(items(normalizedGroups), decode = FALSE))
                OutputDataSet <- data.frame(GroupName, eachSupport);'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     ProductGroup NVARCHAR(500)
    ,support DECIMAL(18,3)
                 ));

This ProductGroupID can be joined with T-SQL

2016-10-14-22_18_45-association_rules_tsql-sql-sicn-kastrun-wideworldimportersdw-spar_si01017988

in order to receive labels:

 SELECT 
 LEFT([Stock Item], 8) AS L8DESC 
 ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP
 FROM [Dimension].[Stock Item]

GROUP BY  LEFT([Stock Item], 8)

Pros and cons

Biggest pro is the ability to integrate association rules with T-SQL and to have all R code working as it should be.  This gives data wrangles, data scientiest and data managers to workout the rules that are hidden in transactional/basket data. Working out with different types of outputs (support, confidence, lift) user get to see immediately what works with what. In my case you see and tell that the amount of original data (little over 73K transactions and little over 200K rows) is sometimes not enough to generate meaningful rules that have relevant content. If dataset would have been 100x times bigger, I am sure this would not be a case.

Data size falls under the con. Having larger dataset to be analysed, this would be a performance drawback in terms of memory consumption (sp_execute_external_script procedure is not being able to use RevoScaleR package and *.xdf data file) and speed.  If RevoScaleR Package would have a function to support this calculation, I am very much confident that there would only be pros to Association Rules learning algorithm.

To sum up, association rules is a great and powerful algorithm for finding the correlations between items and the fact that you can use this straight from SSMS, it just gives me goosebumps. Currently just the performance is a bit of a drawback. Also comparing this algorithm to Analysis services (SSAS) association rules, there are many advantages on R side, because of maneuverability and extracting the data to T-SQL, but keep in mind, SSAS is still very awesome and powerful tool for statistical analysis and data predictions.

Code is available at Github.

Happy R-TSQLing!