Association Rules on WideWorldImporters and SQL Server R Services

Association rules are very handy for analyzing Retail data. And WWI database has really neat set of invoices that can be used to make a primer.

Starting with following T-SQL query:

USE WideWorldIMportersDW;
GO

;WITH PRODUCT
AS
(
SELECT 
  [Stock Item Key]
 ,[WWI Stock Item ID]
 ,[Stock Item] 
 ,LEFT([Stock Item], 8) AS L8DESC 
 ,ROW_NUMBER() OVER (PARTITION BY LEFT([Stock Item], 8) ORDER BY ([Stock Item])) AS RN_ID_PR
 ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP
 FROM [Dimension].[Stock Item]
)

SELECT
 O.[WWI Order ID]
,O.[Order Key]
,O.[Stock Item Key]
,P.PRODUCT_GROUP
,O.[Description]

FROM [Fact].[Order] AS O
JOIN PRODUCT AS P
    ON P.[Stock Item Key] = O.[Stock Item Key]

ORDER BY 
    O.[WWI Order ID]
    ,O.[Order Key]

 

I have created very simple product group that will neglect distinction between product variants and treat them  as one. For example:

Stock Item Key    WWI Stock Item ID    Stock Item
54                166                  10 mm Anti static bubble wrap (Blue) 20m
53                167                  10 mm Anti static bubble wrap (Blue) 50m

Both Products are initially the same just the product variant can change; color, size, cap, volume, etc. Product group denotes main products, “without” the product variants. I am doing this simplification out of practical reason, because of a smaller dataset.

So new version of product groups (variable ProductGroup) would be like:

Stock Item Key    WWI Stock Item ID    Stock Item      ProductGroup
54                166                  10 mm Anti      2
53                167                  10 mm Anti      2

So incorporating R code for analyzing association rules in sp_execute_external_procedure is what following code does:

-- Getting Association Rules into T-SQL
DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'WITH PRODUCT
                        AS
                      (
                      SELECT
                      [Stock Item Key]
                      ,[WWI Stock Item ID]
                      ,[Stock Item] 
                      ,LEFT([Stock Item], 8) AS L8DESC 
                      ,ROW_NUMBER() OVER (PARTITION BY LEFT([Stock Item], 8) ORDER BY ([Stock Item])) AS RN_ID_PR
                      ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP
                      FROM [Dimension].[Stock Item]
                      )
                      
                      SELECT
                      O.[WWI Order ID] AS OrderID
                      -- ,O.[Order Key]   AS OrderLineID
                      -- ,O.[Stock Item Key] AS ProductID
                      ,P.PRODUCT_GROUP AS ProductGroup
                      -- ,O.[Description] AS ProductDescription
                      ,LEFT([Stock Item],8) AS ProductDescription
                      
                      FROM [Fact].[Order] AS O
                      JOIN PRODUCT AS P
                      ON P.[Stock Item Key] = O.[Stock Item Key]
                      GROUP BY
                       O.[WWI Order ID]
                      ,P.PRODUCT_GROUP 
                      ,LEFT([Stock Item],8) 
                      ORDER BY 
                      O.[WWI Order ID]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'
                library(arules)
                cust.data <- InputDataSet
                cd_f <- data.frame(OrderID=as.factor(cust.data$OrderID),
ProductGroup=as.factor(cust.data$ProductGroup))
                cd_f2_tran  <- as(split(cd_f[,"ProductGroup"], cd_f[,"OrderID"]), 
"transactions")
                rules <- apriori(cd_f2_tran, parameter=list(support=0.01, 
confidence=0.1))
                OutputDataSet <- data.frame(inspect(rules))'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     lhs NVARCHAR(500)
    ,[Var.2] NVARCHAR(10)
    ,rhs NVARCHAR(500)
    ,support DECIMAL(18,3)
    ,confidence DECIMAL(18,3)
    ,lift DECIMAL(18,3)
                 ));

 

Result is retrieving rules of association between products from transaction that build up support and eventually give lift for any predictions.

By executing this R code:

# chart if needed
plot(rules, method="grouped", control=list(k=20));

one can generate also graphical view of the rules and associations between products.

2016-10-14 22_15_27-Plot Zoom.png

And finally to retrieve information on support for each of the ProductGroup (which is my case), I would execute this R code embedded into T-SQL:

DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'WITH PRODUCT
                        AS
                      (
                      SELECT
                      [Stock Item Key]
                      ,[WWI Stock Item ID]
                      ,[Stock Item] 
                      ,LEFT([Stock Item], 8) AS L8DESC 
                      ,ROW_NUMBER() OVER (PARTITION BY LEFT([Stock Item], 8) ORDER BY ([Stock Item])) AS RN_ID_PR
                      ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP
                      FROM [Dimension].[Stock Item]
                      )
                      
                      SELECT
                      O.[WWI Order ID] AS OrderID
                      -- ,O.[Order Key]   AS OrderLineID
                      -- ,O.[Stock Item Key] AS ProductID
                      ,P.PRODUCT_GROUP AS ProductGroup
                      -- ,O.[Description] AS ProductDescription
                      ,LEFT([Stock Item],8) AS ProductDescription
                      
                      FROM [Fact].[Order] AS O
                      JOIN PRODUCT AS P
                      ON P.[Stock Item Key] = O.[Stock Item Key]
                      GROUP BY
                       O.[WWI Order ID]
                      ,P.PRODUCT_GROUP 
                      ,LEFT([Stock Item],8) 
                      ORDER BY 
                      O.[WWI Order ID]'

DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'
                library(arules)
                cust.data <- InputDataSet
                cd_f <- data.frame(OrderID=as.factor(cust.data$OrderID),
ProductGroup=as.factor(cust.data$ProductGroup))
                cd_f2_tran  <- as(split(cd_f[,"ProductGroup"], cd_f[,"OrderID"]),
 "transactions")
                PgroupSets <- eclat(cd_f2_tran, parameter = list(support = 0.05), 
control = list(verbose=FALSE))
                normalizedGroups <- PgroupSets[size(items(PgroupSets)) == 1]
                eachSupport <- quality(normalizedGroups)$support
                GroupName <- unlist(LIST(items(normalizedGroups), decode = FALSE))
                OutputDataSet <- data.frame(GroupName, eachSupport);'

EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
          
WITH RESULT SETS ((
     ProductGroup NVARCHAR(500)
    ,support DECIMAL(18,3)
                 ));

This ProductGroupID can be joined with T-SQL

2016-10-14-22_18_45-association_rules_tsql-sql-sicn-kastrun-wideworldimportersdw-spar_si01017988

in order to receive labels:

 SELECT 
 LEFT([Stock Item], 8) AS L8DESC 
 ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP
 FROM [Dimension].[Stock Item]

GROUP BY  LEFT([Stock Item], 8)

Pros and cons

Biggest pro is the ability to integrate association rules with T-SQL and to have all R code working as it should be.  This gives data wrangles, data scientiest and data managers to workout the rules that are hidden in transactional/basket data. Working out with different types of outputs (support, confidence, lift) user get to see immediately what works with what. In my case you see and tell that the amount of original data (little over 73K transactions and little over 200K rows) is sometimes not enough to generate meaningful rules that have relevant content. If dataset would have been 100x times bigger, I am sure this would not be a case.

Data size falls under the con. Having larger dataset to be analysed, this would be a performance drawback in terms of memory consumption (sp_execute_external_script procedure is not being able to use RevoScaleR package and *.xdf data file) and speed.  If RevoScaleR Package would have a function to support this calculation, I am very much confident that there would only be pros to Association Rules learning algorithm.

To sum up, association rules is a great and powerful algorithm for finding the correlations between items and the fact that you can use this straight from SSMS, it just gives me goosebumps. Currently just the performance is a bit of a drawback. Also comparing this algorithm to Analysis services (SSAS) association rules, there are many advantages on R side, because of maneuverability and extracting the data to T-SQL, but keep in mind, SSAS is still very awesome and powerful tool for statistical analysis and data predictions.

Code is available at Github.

Happy R-TSQLing!

Tagged with: , , , ,
Posted in Uncategorized
6 comments on “Association Rules on WideWorldImporters and SQL Server R Services
  1. […] article was first published on R – TomazTsql, and kindly contributed to […]

    Like

  2. […] article was first published on R – TomazTsql, and kindly contributed to […]

    Like

  3. Hi Tomasz,
    Very interesting article. I have been using Association rules to mine frequent patterns of customers in banking sectors with R for some months.
    It is very interesting to use R services. So, would you mind to share some part of data in your WideWorldImporters database so that readers can make a reproduce your analysis process?

    Like

  4. tomaztsql says:

    Hi,

    WideWorldImportersDW is a sample database and can be downloaded from github: https://github.com/Microsoft/sql-server-samples/releases/tag/wide-world-importers-v1.0 . Download and restore: WideWorldImportersDW-Full.bak and you will be able to reproduce the results.
    Best, tomaž

    Like

  5. […] Tomaz Kastrun discusses product variants: […]

    Like

Leave a comment

Follow TomazTsql on WordPress.com
Programs I Use: SQL Search
Programs I Use: R Studio
Programs I Use: Plan Explorer
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
TomazTsql

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Discover WordPress

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

tenbulls.co.uk

tenbulls.co.uk - attaining enlightenment with the Microsoft Data and Cloud Platforms with a sprinkling of Open Source and supporting technologies!

SQL DBA with A Beard

He's a SQL DBA and he has a beard

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

DevOps could be the disease you die with, but don’t die of.

Paul te Braak

Business Intelligence Blog

Sql Insane Asylum (A Blog by Pat Wright)

Information about SQL (PostgreSQL & SQL Server) from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...