rxNeuralNet vs. xgBoost vs. H2O

Recently, I did a session at local user group in Ljubljana, Slovenija, where I introduced the new algorithms that are available with MicrosoftML package for Microsoft R Server 9.0.3.

For dataset, I have used two from (still currently) running sessions from Kaggle. In the last part, I did image detection and prediction of MNIST dataset and compared the performance and accuracy between.

MNIST Handwritten digit database is available here.

picture2

Starting off with rxNeuralNet, we have to build a NET# model or Neural network to work it’s way.

Model for Neural network:

const { T = true; F = false; }

input Picture [28, 28];

hidden C1 [5 * 13^2]
from Picture convolve {
InputShape  = [28, 28];
UpperPad    = [ 1,  1];
KernelShape = [ 5,  5];
Stride      = [ 2,  2];
MapCount = 5;
}

hidden C2 [50, 5, 5]
from C1 convolve {
InputShape  = [ 5, 13, 13];
KernelShape = [ 1,  5,  5];
Stride      = [ 1,  2,  2];
Sharing     = [ F,  T,  T];
MapCount = 10;
}

hidden H3 [100]
from C2 all;

// Output layer definition.
output Result [10]
from H3 all;

Once we have this, we can work out with rxNeuralNet algorithm:

model_DNN_GPU <- rxNeuralNet(label ~.
      ,data = dataTrain
      ,type = "multi"
      ,numIterations = 10
      ,normalize = "no"
      #,acceleration = "gpu" #enable this if you have CUDA driver
      ,miniBatchSize = 64 #set to 1 else set to 64 if you have CUDA driver problem 
      ,netDefinition = netDefinition
      ,optimizer = sgd(learningRate = 0.1, lRateRedRatio = 0.9, lRateRedFreq = 10)
)

Then do the prediction and calculate accuracy matrix:

DNN_GPU_score <- rxPredict(model_DNN_GPU, dataTest, extraVarsToWrite = "label")
sum(Score_DNN$Label == DNN_GPU_score$PredictedLabel)/dim(DNN_GPU_score)[1]

Accuracy for this model is:

[1] 0.9789

 

When working with H2O package, the following code was executed to get same paramethers for Neural network:

model_h20 <- h2o.deeplearning(x = 2:785
                     ,y = 1   # label for label
                     ,training_frame = train_h2o
                     ,activation = "RectifierWithDropout"
                     ,input_dropout_ratio = 0.2 # % of inputs dropout
                     ,hidden_dropout_ratios = c(0.5,0.5) # % for nodes dropout
                     ,balance_classes = TRUE 
                     ,hidden = c(50,100,100) 
                     ,momentum_stable = 0.99
                     ,nesterov_accelerated_gradient = T # use it for speed
                     ,epochs = 15)

When results of test dataset against the learned model is executed:

h2o.confusionMatrix(model_h20)
100-(416/9978)*100

the  result is confusion matrix for accuracy of predicted values with value of:

# [1] 95.83083

 

For comparison, I have added xgBoost (eXtrem Gradient Boosting), but this time, I will not focus on this one.

Time comparison against the packages (in seconds), from left to right are: H20, MicrosoftML with GPU acceleration, MicrosoftML without GPU acceleration and xgBoost.

picture1

As for the accuracy of the trained model, here are results (based on my tests):

MicrosoftML – Neural Network – 97,8%

H20 – Deep Learning – 95,3 %

xgBoost – 94,9 %

 

As always, code and dataset are available at GitHub.

Happy R-ing 🙂

 

 

First @SLODUG Meeting in 2017

We had our first SQL Server User Group SLODUG meeting in this year. Event took place at Microsoft Slovenija, 09.Feb.2017 with cca 15 people showing up. Along 15 people we had 8 pizzas and some 20 beers 🙂

Scheduled were two topics:

17:15 – 18:00 Let’s use Microsoft R Server 9 for entering Kaggle competition (Tomaž Kaštrun)
18:10 – 19:30 Forecasting with MS BI Suite (Dejan Sarka)

with two beautiful presenters:

slodug20170209

Not to mention outstanding statistics about presenters:

Average gender: Male
Maximum eye color: Yes
Beer moving average: coffee

And a printscreen from the SLODUG Blog:

2017-02-10-14_29_25-slodug-srecanje-v-cetrtek-9-2-novice-slodug-slodug

Keep the community spirit up!

R and SQL Server articles

In past couple of months, I have prepared several articles on R and SQL Server that have been published on SQL Server Central.

The idea was, to have couple of articles covering the introduction to R, to basics on R Server, to some practical cases on R with SQL Server.

1) Using Microsoft R in Enterprise Environments

Article covers the concepts on Microsoft R Server, where and how to start with Microsoft R in enterprise environment and give answers to most common concerns people might have when introducing R language into corporation.

1

Link to article: http://www.sqlservercentral.com/articles/R+Language/140422/

 

2) Introduction to Microsoft R Services in SQL Server 2016

Integration and architecture on Microsoft R Services is main focus of this article. It outlinesdifferent flavors of R (Open, Client, Server, Services, Hadoop, etc.), how to deal with installation and basic overview and explanation on extended stored procedure SP_EXECUTE_EXTERNAL_SCRIPT.

2.png

Link to article: http://www.sqlservercentral.com/articles/Microsoft/145393/

 

3) Installing R packages in SQL Server R Services

Expand the functionality of R by adding new packages. Covers many ways how to install and add additional packages to your R environment.

3

Link to article: http://www.sqlservercentral.com/articles/R+Package/145571/

 

4) Using SQL Server and R Services for analyzing Sales data

Providing use cases on analyzing sales data was focus of this article with goal to show readers and users how to ope rationalize and bring R code into use in any enterprise (small or big) environment.

4

Link to article: http://www.sqlservercentral.com/articles/R+Services/145649/

 

5) Using Power BI and SSRS for visualizing SQL Server and R data

Visualizing the data for any use case, is also important aspect of understanding data insights. Article covers Power BI and SSRS visualization and how to embed R code in both tools.

5

Link to article: http://www.sqlservercentral.com/articles/R+Language/151358/

6) Using SQL Server and R Services for analyzing DBA Tasks

Broadening the use of Microsoft R for the DBA tasks was the main goal of this article. With simulation of  the disk usage, showing R example how to switch from monitoring the usage to predicting the usage of disk space. Clustering executed queries to narrow down performance issues and visualizing Query store information with heatmap were also introduced in article.

6

Link to article: http://www.sqlservercentral.com/articles/R+Language/151405/

 

More articles will follow, so stick around.

Happy R-SQLing!

 

 

Clustering executed SQL Server queries using R as tool for

When query execution performance analysis is to be done, there are many ways to find which queries might cause any unwanted load or cause stall on the server.

By encouraging DBA community to start practicing the advantage or R Language and world of data science, I have created a demo to show, how statistics on numerous queries can be stored for later analysis. And this demo has unsupervised (or undirected) method for grouping similar query statistics (or queries) to easier track and find where and which queries might be potential system stoppers.

Before we run some queries to generate the running statistics for these queries, we clean the cache and prepare the table for storing statistics from sys.dm_exec_query_stats.

-- drop statistics table
DROP TABLE IF EXISTS  query_stats_LOG_2;
DROP PROCEDURE IF EXISTS AbtQry;
DROP PROCEDURE IF EXISTS SalQry;
DROP PROCEDURE IF EXISTS PrsQry;
DROP PROCEDURE IF EXISTS OrdQry;
DROP PROCEDURE IF EXISTS PurQry;
GO

-- Clean all the stuff only for the selected db
DECLARE @dbid INTEGER
SELECT @dbid = [dbid] 
FROM master..sysdatabases 
WHERE name = 'WideWorldImportersDW'
DBCC FLUSHPROCINDB (@dbid);
GO

-- for better sample, check that you have Query store turned off
ALTER DATABASE WideWorldImportersDW SET  QUERY_STORE = OFF;
GO

We generate some fake data:

USE WideWorldImportersDW;
GO

-- CREATE Procedures 

CREATE PROCEDURE AbtQry
(@AMNT AS INTEGER) 
AS

-- an arbitrary query
SELECT 
  cu.[Customer Key] AS CustomerKey
  ,cu.Customer
  ,ci.[City Key] AS CityKey
  ,ci.City
  ,ci.[State Province] AS StateProvince
  ,ci.[Sales Territory] AS SalesTeritory
  ,d.Date
  ,d.[Calendar Month Label] AS CalendarMonth
  ,s.[Stock Item Key] AS StockItemKey
  ,s.[Stock Item] AS Product
  ,s.Color
  ,e.[Employee Key] AS EmployeeKey
  ,e.Employee
  ,f.Quantity
  ,f.[Total Excluding Tax] AS TotalAmount
  ,f.Profit
 
FROM Fact.Sale AS f
  INNER JOIN Dimension.Customer AS cu
    ON f.[Customer Key] = cu.[Customer Key]
  INNER JOIN Dimension.City AS ci
    ON f.[City Key] = ci.[City Key]
  INNER JOIN Dimension.[Stock Item] AS s
    ON f.[Stock Item Key] = s.[Stock Item Key]
  INNER JOIN Dimension.Employee AS e
    ON f.[Salesperson Key] = e.[Employee Key]
  INNER JOIN Dimension.Date AS d
    ON f.[Delivery Date Key] = d.Date
WHERE
    f.[Total Excluding Tax] BETWEEN 10 AND @AMNT;
GO

CREATE PROCEDURE SalQry
(@Q1 AS INTEGER
,@Q2 AS INTEGER)

AS
-- FactSales Query
SELECT * FROM Fact.Sale
WHERE
    Quantity BETWEEN @Q1 AND @Q2;
GO

CREATE PROCEDURE PrsQry
(@CID AS INTEGER )
AS

-- Person Query
SELECT * 
    FROM [Dimension].[Customer]
    WHERE [Buying Group] <> 'Tailspin Toys' 
    /* OR [WWI Customer ID] > 500 */
    AND [WWI Customer ID] BETWEEN 400 AND  @CID
ORDER BY [Customer],[Bill To Customer];
GO


CREATE PROCEDURE OrdQry
(@CK AS INTEGER)
AS

-- FactSales Query
SELECT 
    * 
    FROM [Fact].[Order] AS o
    INNER JOIN [Fact].[Purchase] AS p 
    ON o.[Order Key] = p.[WWI Purchase Order ID]
WHERE
    o.[Customer Key] = @CK;
GO

CREATE PROCEDURE PurQry
(@Date AS SMALLDATETIME)
AS

-- FactPurchase Query
SELECT *
    FROM [Fact].[Purchase]
        WHERE
        [Date Key] = @Date;
    --[Date KEy] = '2015/01/01'
GO

Now we run procedures couple of times:

DECLARE @ra DECIMAL(10,2)
SET @ra = RAND()
SELECT CAST(@ra*10 AS INT)

IF @ra  < 0.3333
    BEGIN
       -- SELECT 'RAND < 0.333', @ra
       DECLARE @AMNT_i1 INT = 100*CAST(@ra*10 AS INT)
       EXECUTE AbtQry @AMNT = @AMNT_i1
       EXECUTE PurQry @DAte = '2015/10/01'
       EXECUTE PrsQry @CID = 480
       EXECUTE OrdQry @CK = 0
       DECLARE @Q1_i1 INT = 1*CAST(@ra*10 AS INT)
       DECLARE @Q2_i1 INT = 20*CAST(@ra*10 AS INT)
       EXECUTE SalQry @Q1 = @Q1_i1, @Q2 = @Q2_i1

    END
ELSE 
    IF @ra  > 0.3333 AND @ra < 0.6667
    BEGIN
        -- SELECT 'RAND > 0.333 | < 0.6667', @ra
        DECLARE @AMNT_i2 INT = 500*CAST(@ra*10 AS INT)
        EXECUTE AbtQry @AMNT = @AMNT_i2
        EXECUTE PurQry @DAte = '2016/04/29'
        EXECUTE PrsQry @CID = 500
        EXECUTE OrdQry @CK = 207
        DECLARE @Q1_i2 INT = 2*CAST(@ra*10 AS INT)
        DECLARE @Q2_i2 INT = 10*CAST(@ra*10 AS INT)
        EXECUTE SalQry @Q1 = @Q1_i2, @Q2 = @Q2_i2

    END
ELSE
    BEGIN
        -- SELECT 'RAND > 0.6667', @ra
        DECLARE @AMNT_i3 INT = 800*CAST(@ra*10 AS INT)
        EXECUTE AbtQry @AMNT = @AMNT_i3
        EXECUTE PurQry @DAte = '2015/08/13'
        EXECUTE PrsQry @CID = 520
        EXECUTE OrdQry @CK = 5
        DECLARE @Q2_i3 INT = 60*CAST(@ra*10 AS INT)
        EXECUTE SalQry @Q1 = 25, @Q2 = @Q2_i3
    END
GO 10

And with this data, we proceed to run the logging T-SQL (query source).

SELECT
    (total_logical_reads + total_logical_writes) AS total_logical_io
    ,(total_logical_reads / execution_count) AS avg_logical_reads
    ,(total_logical_writes / execution_count) AS avg_logical_writes
    ,(total_physical_reads / execution_count) AS avg_phys_reads
    ,substring(st.text,(qs.statement_start_offset / 2) + 1,  
      ((CASE qs.statement_end_offset WHEN - 1 THEN datalength(st.text) 
   qs.statement_end_offset END  - qs.statement_start_offset) / 2) + 1) 
AS statement_text
    ,*
INTO query_stats_LOG_2
FROM
        sys.dm_exec_query_stats AS qs
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY total_logical_io DESC

Now that we have gathered some test data, we can proceed to do clustering analysis.

Since I don’t know how many clusters can there be, and I can imagine, a  DBA would also be pretty much clueless, I will explore number of clusters. Following R code:

library(RODBC)
myconn <-odbcDriverConnect("driver={SQL Server};Server=SICN-KASTRUN;
database=WideWorldImportersDW;trusted_connection=true")

query.data <- sqlQuery(myconn, "
                     SELECT 
                                 [total_logical_io]
                      ,[avg_logical_reads]
                      ,[avg_phys_reads]
                      ,execution_count
                      ,[total_physical_reads]
                      ,[total_elapsed_time]
                      ,total_dop
                      ,[text]
                      ,CASE WHEN LEFT([text],70) LIKE '%AbtQry%' THEN 'AbtQry'
                       WHEN LEFT([text],70) LIKE '%OrdQry%' THEN 'OrdQry'
                       WHEN LEFT([text],70) LIKE '%PrsQry%' THEN 'PrsQry'
                       WHEN LEFT([text],70) LIKE '%SalQry%' THEN 'SalQry'
                       WHEN LEFT([text],70) LIKE '%PurQry%' THEN 'PurQry'
                       HEN LEFT([text],70) LIKE '%@BatchID%' THEN 'System'
                                  ELSE 'Others' END AS label_graph 
                      FROM query_stats_LOG_2")

close(myconn) 
library(cluster)

#qd <- query.data[,c(1,2,3,5,6)]
qd <- query.data[,c(1,2,6)]

## hierarchical clustering
qd.use <- query.data[,c(1,2,6)]
medians <- apply(qd.use,2,median)
#mads <- apply(qd.use,2,mad)
qd.use <- scale(qd.use,center=medians) #,scale=mads)

#calculate distances
query.dist <- dist(qd.use)

# hierarchical clustering
query.hclust <- hclust(query.dist)

# plotting solution
op <- par(bg = "lightblue")
plot(query.hclust,labels=query.data$label_graph,
main='Query Hierarchical Clustering', ylab = 'Distance', 
xlab = ' ', hang = -1, sub = "" )
# in addition to circle queries within cluster
rect.hclust(query.hclust, k=3, border="DarkRed")        

And produces the following plot:

2017-01-08-10_46_05-plot-zoom

Graph itself is self explanatory and based on the gathered statistics and queries executed against the system, you receive the groups of queries where your DBA can easily and fast track down what might be causing some issues. I added some labels to the query for the graph to look neater, but it is up to you.

I have also changed the type to “triangle” to get the following plot:

2017-01-08-10_54_55-plot-zoom

And both show same information.

So the R code said that, there are three clusters generating And I used medians to generate data around it. In addition I have also tested the result with Partitioning around medoids (which is opposite to hierarchical clustering) and the results from both techniques yield clean clusters.

2017-01-08-10_46_47-rstudio

Also, the data sample is relatively small, but you are very welcome to test this idea into your environment. Just easy with freeproccache and flushprocindb commands!

This blog post was meant as a teaser,  to gather opinion from the readers. Couple of more additional approaches will be part of two articles, that I am currently working on.

As always, code is available at the Github.

Happy SQLinRg!

SQL Saturday Vienna 2017 #sqlsatVienna

SQL Saturday Vienna 2017 is just around the corner. On Friday, January 20, 2017, a lot of local and international speakers will gather to deliver sessions relating to SQL Server and all related services. With great agenda – available  here, the attendees will surely enjoy different topics as well as have the opportunity to talk to Austrian PASS and SQL community as well as speakers, as well as SQL Server MVP.

2016-12-29-18_53_40-sqlsaturday-579-vienna-2017-_-event-home

 

My session at SQL Sat Vienna 2017 will be focused on what can database administrators gain from R integration with SQL Server 2016. Said that, we will look how statistics  from main DBA tasks can be gathered, stored and later analyzed for better prediction, for uncovering patters in baseline that might be overlooked and of course how to play with information gathered from Query store and DMV query plans. Session will also serve with field examples each enterprise can have.

This year, I will have the pleasure to do the pre-con on Thursday, January 19, 2017 at the JUFA Hotel in Vienna. A full day pre-con workshop on SQL Server and R integration with all the major topics covered, where and how to start using R, how the R integration works, deep dive into R package for high performance work and dive into statistics – from uni-variate to multi-variate as well as methods for data mining and machine learning. Everybody welcome to join, it will be a great day for a workshop! 🙂

 

2016-12-29 19_01_20-BI and Analytics with SQL Server and R - Tomaz Kastrun Tickets, Thu, 19 Jan 2017.png

Tickets are available here via Eventbrite.

 

Falco is already playing Vienna Calling   🙂

 

 

Detecting outliers and fraud with R and SQL Server on my bank account data – Part 1

Detecting outliers and fraudulent behaviour (transactions, purchases, events, actions, triggers, etc.) takes a large amount of experiences and statistical/mathetmatical background.

One of the samples Microsoft provided with release of new SQL Server 2016 was using simple logic of Benford’s law. This law works great with naturally occurring numbers and can be applied across any kind of problem. By naturally occurring, it is meant a number that is not generated generically such as a page number in a book, incremented number in your SQL Table, sequence number of any kind, but numbers that are occurring irrespective from each other, in nature (length or width of trees, mountains, rivers), length of the roads in the cities, addresses in your home town, city/country populations, etc. The law calculates the log distribution of numbers from 1 to 9 and stipulates that number one will occur 30% of times, number two will occur 17% of time, number three will occur 12% of the time and so on. Randomly generated numbers will most certainly generate distribution for each number from 1 to 9 with probability of 1/9. It might also not work with restrictions; for example height expressed in inches will surely not produce Benford function. My height is 188 which is 74 inches or 6ft2. All three numbers will not generate correct distribution, even though height is natural phenomena.

So Probability of number starting with number n equals to log(n+1) – log(n) with base 10. Keeping in mind this formula. So what is probability that a number starts with 29, is log(30) – log(29) = 1.4771 – 1.4623 = 0.0148. But keep in mind that this law applies only to numbers where number 1 appears approx. 30% of the time.

So just a quick example to support this theory, we will take

benfordL <- function(d){return(log(1+1/(1:9))/log(10))}
#distribution of numbers according to formula
benfordL(1:9)
#plot with numbers
plot(benfordL(1:9))

Scatterplot with  log distribution of numbers from 1 to 9:

2016-10-31-09_29_58-plot-zoom

Code for Benford’s Law is available at RosettaCode for your favorite programming language.

Now I want to check this rule agains the data in WideWorldImporters. I will use the following query to test Benford’s Law:

SELECT TransactionAmount
FROM [WideWorldImporters].[Sales].[CustomerTransactions]
WHERE
    TransactionAmount > 0

For this manner I will execute R Script:

DECLARE @RScript NVARCHAR(MAX)
SET @RScript = N'
            WWWTrans <- InputDataSet
            get_first_num <- function(number){return(as.numeric(substr(number,1,1)))}
            lead_dig <- mapply(get_first_num,WWWTrans$TransactionAmount)
            obs_freq_WWWTrans <- table(lead_dig)/1000
            OutputDataSet <- data.frame(obs_freq_WWWTrans)'

DECLARE @SQLScript NVARCHAR(MAX)
SET @SQLScript = N'
        SELECT 
            TransactionAmount
        FROM [WideWorldImporters].[Sales].[CustomerTransactions]
        WHERE 
            TransactionAmount > 0'

 EXECUTE sp_execute_external_script
          @language = N'R'
        , @script = @RScript
        , @input_data_1 = @SQLScript
        
WITH result SETS ((
                     lead_dig INT
                    ,Freq DECIMAL(10,2)
));

 

By comparing distribution of general Benford’s Law and TransactionAmount there are some discrepancies but in general they can follow same log distribution.

2016-10-31-11_06_22-plot-zoom

In Part 2 I will cover outlier detection with GLM and Random forest using my bank account dataset.

For Benford’s Law there are couple of R packages available Benford.Analysis and BenfordTests.

Microsoft MVP Award

On October 1st, 2016 humbled to be received Microsoft MVP Award. And today, October 17th, a special package awaited me at the post office. The award it self.

The actual certificate to the MVP award for the Microsoft Data Platform category (SQL Server).

2016-10-17 19_29_21-20161017_182352_1476725338721.jpg ‎- Photos.png

Besides the award, many goodies were also included in the package. One of them are stickers.  I am sure there will be plenty of computers to put these stickers on. And bikes, fixed geared bikes, coffee mugs, etc. 🙂

2016-10-17 19_30_01-20161017_182324_1476725338890.jpg ‎- Photos.png

 

What does the award mean to me:

  1. it means to work with passion
  2. it means combining  the love of the work with  expertise / knowledge
  3. it is gratitude to what I have been doing on the Microsoft SQL Server platform
  4. it is a confirmation and – yes – an award to what I have been doing so far.

 

In terms of changes, nothing has changed in particular with this award. Keeping passion in what I was doing so far is also my future plan.I will continue to strive for better and keep the improvements up with everything that I do. Helping others and taking care (remember sharing is caring) of community and continue spreading the passion for T-SQL and statistics, of course.

Thanks to everyone,  world-wide SQL Family and SQL, T-SQL, NoSQL, USQL, PowerShell, DataMining, Python, MachineLearning, Azure, Spark, .NET, C#, well you get the picture… friends, people inspiring me, people pushing me and a special thanks to Microsoft for this award. I have to stop before it gets really corny.

Unfortunately no T-SQL or no R code is available for this blog post. 🙂