Animated barplot and google map with R

It might happen that you will need a animated graph of any kind. For purposes of plotting demographic data and changes through the years, Google Maps and plotting maps, merging and converting jpg files into a animated gif, sure will give a nice visualization effect.

Here is a sample of changes over the time period of three years on some dataset of my home town and graph can tell little bit more as numbers ūüôā

ggmap_anim

So besides plotting visualizations based on geographical data, additional plots available in any of R packages can be visualized in a similar way.

Playing and goofing around, I created a sample dataset:

# dataset
d <- data.frame(val=c(2,3,4,3,4,3,4,3,2,1), 
                year=c(2015,2016,2017,2015,2016,2015,2015,2016,2017,2015)
                )

Once that, is created, a function would do the trick:

setwd("C:/DataTK")
library(ggplot2)
library(dplyr)

plot.bar.2 <- function(df_plot, xvar, yvar, fill)
{
  require(ggplot2)
  require(dplyr)
  attach(df_plot)
  #number of steps - based on values in X-axis
  x_unique <- unique(df_plot$xvar)   #xvar = year
  nof_steps <- as.integer(length(df_plot$x_unique))

  for (i in 1:nof_steps) 
  {
    x <- as.integer(x_unique[i])
    
    d1 <- df_plot %>%
          arrange(xvar) %>%
          filter(xvar<=x) %>%
          mutate(new_val = val) %>%
          select(xvar, new_val)
    
    d2 <- df_plot %>%
          arrange(xvar) %>%
          filter(xvar>x) %>%
          mutate(new_val = 0) %>%
          select(xvar,new_val)
    
    dfinal <- union_all(d1, d2)
    dfinal <- data.frame(dfinal)
    colnames(dfinal)[1] <- "x"
    colnames(dfinal)[2] <- "y"
    
    name <- paste('barplot00',i,'.png',sep="")
    png(name)
    ggplot(data=dfinal, aes(x=x, y=y, fill=x)) + geom_bar(stat="identity") + 
        guides(fill=FALSE)
    dev.off()
    rm(d1,d2,x,dfinal) 
  }

  system("magick -delay 150 -loop 0 *.png GeomBar_plot.gif")
  file.remove(list.files(pattern=".png"))
  rm(x_unique, nof_steps)   
  detach(df_plot)
}

plot.bar.2(d, d$year, d$val, d$year)

 

A bit more “interactive” or animated graph is created.

barplot

In comparison to static graph:

barplot.png

In addition to R code, the ImageMagic program needs to be installed on your machine, as well. Also the speed, quality and many other parameters can be set, when creating animated gif.

Animated gif can be also included into your SSRS report, your Sharepoint site or any other site – like my blog ūüôā and it will stay interactive. In Power BI, importing animated gif as a picture, unfortunately will not work.

As always, code is available at GitHub.

Happy R-coding!

Advertisements

Using R sp_execute_external_script with JSON

JSON has become part of the SQL Server in the same version as R. Both were very highly anticipated and awaited from the community.

JSON has very powerful statements for converting to and from JSON for storing into / from SQL Server engine (FOR JSON and JSON VALUE, etc).  And since it is gaining popularity for data exchange, I was curious to give it a try with R combination.

I will simply convert a system table into array using for json clause.

SELECT top 10 object_id  FROM sys.objects FOR JSON AUTO;

and it gives back the result:

[{"object_id":3},{"object_id":5},{"object_id":6},{"object_id":7},{"object_id":8},
{"object_id":9},{"object_id":17},{"object_id":18},{"object_id":19},{"object_id":20}]

And sp_execute_external_script query without JSON would look like:

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'OutputDataSet <- InputDataSet'
      ,@input_data_1 = N'SELECT top 10 object_id  FROM sys.objects'
WITH RESULT SETS ((nr INT));

Now, let’s suppose we want to use JSON result directly into T-SQL using sp_execute_external_script. Yes, imagine getting results from an API and you want to push the results immediately into R for analysis. Very straight-forward package in R is called jsonlite (also available is rjson). Query would be as following:

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'library(jsonlite)
                OutputDataSet <- data.frame(fromJSON(InputDataSet))'
      ,@input_data_1 = N'SELECT top 10 object_id  FROM sys.objects FOR JSON AUTO'
WITH RESULT SETS ((nr INT));

Nope!

Msg 39004, Level 16, State 20, Line 15
A 'R' script error occurred during execution of 'sp_execute_external_script' 
with HRESULT 0x80004004.
Msg 39019, Level 16, State 1, Line 15
An external script error occurred: 
Error: Argument 'txt' must be a JSON string, URL or file.

So the argument ‘txt’ must be a JSON string, URL or file. Khm…very “useful” error message, but problem is, that data from T-SQL is stored and presented as data.frame to R environment (Launchpad), because the data type passed to R is array of objects. And would look something like:

2017-01-08-22_04_46-rstudio

Running this query in native (R) environment, we at least get the idea where and how to tackle the problem. So we need to convert the data.frame to a charaters using toJSON and as.character, so that the end T-SQL query would look like:

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'
                library(jsonlite)
                js <- InputDataSet
                js2 <- as.character(toJSON(js))
                OutputDataSet <- data.frame(fromJSON(js2))'
      ,@input_data_1 = N'SELECT top 10 object_id  FROM sys.objects FOR JSON AUTO'
WITH RESULT SETS ((nr INT));

Now we get the correct results (as if we would not used JSON):

2017-01-08-22_22_40-sqlquery7-sql-sicn-kastrun-wideworldimportersdw-spar_si01017988-60_-micr

So R is ready for JSON and JSON is also ready for R.

Happy R+JSON+SQLing!

Clustering executed SQL Server queries using R as tool for

When query execution performance analysis is to be done, there are many ways to find which queries might cause any unwanted load or cause stall on the server.

By encouraging DBA community to start practicing the advantage or R Language and world of data science, I have created a demo to show, how statistics on numerous queries can be stored for later analysis. And this demo has unsupervised (or undirected) method for grouping similar query statistics (or queries) to easier track and find where and which queries might be potential system stoppers.

Before we run some queries to generate the running statistics for these queries, we clean the cache and prepare the table for storing statistics from sys.dm_exec_query_stats.

-- drop statistics table
DROP TABLE IF EXISTS  query_stats_LOG_2;
DROP PROCEDURE IF EXISTS AbtQry;
DROP PROCEDURE IF EXISTS SalQry;
DROP PROCEDURE IF EXISTS PrsQry;
DROP PROCEDURE IF EXISTS OrdQry;
DROP PROCEDURE IF EXISTS PurQry;
GO

-- Clean all the stuff only for the selected db
DECLARE @dbid INTEGER
SELECT @dbid = [dbid] 
FROM master..sysdatabases 
WHERE name = 'WideWorldImportersDW'
DBCC FLUSHPROCINDB (@dbid);
GO

-- for better sample, check that you have Query store turned off
ALTER DATABASE WideWorldImportersDW SET  QUERY_STORE = OFF;
GO

We generate some fake data:

USE WideWorldImportersDW;
GO

-- CREATE Procedures 

CREATE PROCEDURE AbtQry
(@AMNT AS INTEGER) 
AS

-- an arbitrary query
SELECT 
  cu.[Customer Key] AS CustomerKey
  ,cu.Customer
  ,ci.[City Key] AS CityKey
  ,ci.City
  ,ci.[State Province] AS StateProvince
  ,ci.[Sales Territory] AS SalesTeritory
  ,d.Date
  ,d.[Calendar Month Label] AS CalendarMonth
  ,s.[Stock Item Key] AS StockItemKey
  ,s.[Stock Item] AS Product
  ,s.Color
  ,e.[Employee Key] AS EmployeeKey
  ,e.Employee
  ,f.Quantity
  ,f.[Total Excluding Tax] AS TotalAmount
  ,f.Profit
 
FROM Fact.Sale AS f
  INNER JOIN Dimension.Customer AS cu
    ON f.[Customer Key] = cu.[Customer Key]
  INNER JOIN Dimension.City AS ci
    ON f.[City Key] = ci.[City Key]
  INNER JOIN Dimension.[Stock Item] AS s
    ON f.[Stock Item Key] = s.[Stock Item Key]
  INNER JOIN Dimension.Employee AS e
    ON f.[Salesperson Key] = e.[Employee Key]
  INNER JOIN Dimension.Date AS d
    ON f.[Delivery Date Key] = d.Date
WHERE
    f.[Total Excluding Tax] BETWEEN 10 AND @AMNT;
GO

CREATE PROCEDURE SalQry
(@Q1 AS INTEGER
,@Q2 AS INTEGER)

AS
-- FactSales Query
SELECT * FROM Fact.Sale
WHERE
    Quantity BETWEEN @Q1 AND @Q2;
GO

CREATE PROCEDURE PrsQry
(@CID AS INTEGER )
AS

-- Person Query
SELECT * 
    FROM [Dimension].[Customer]
    WHERE [Buying Group] <> 'Tailspin Toys' 
    /* OR [WWI Customer ID] > 500 */
    AND [WWI Customer ID] BETWEEN 400 AND  @CID
ORDER BY [Customer],[Bill To Customer];
GO


CREATE PROCEDURE OrdQry
(@CK AS INTEGER)
AS

-- FactSales Query
SELECT 
    * 
    FROM [Fact].[Order] AS o
    INNER JOIN [Fact].[Purchase] AS p 
    ON o.[Order Key] = p.[WWI Purchase Order ID]
WHERE
    o.[Customer Key] = @CK;
GO

CREATE PROCEDURE PurQry
(@Date AS SMALLDATETIME)
AS

-- FactPurchase Query
SELECT *
    FROM [Fact].[Purchase]
        WHERE
        [Date Key] = @Date;
    --[Date KEy] = '2015/01/01'
GO

Now we run procedures couple of times:

DECLARE @ra DECIMAL(10,2)
SET @ra = RAND()
SELECT CAST(@ra*10 AS INT)

IF @ra  < 0.3333
    BEGIN
       -- SELECT 'RAND < 0.333', @ra
       DECLARE @AMNT_i1 INT = 100*CAST(@ra*10 AS INT)
       EXECUTE AbtQry @AMNT = @AMNT_i1
       EXECUTE PurQry @DAte = '2015/10/01'
       EXECUTE PrsQry @CID = 480
       EXECUTE OrdQry @CK = 0
       DECLARE @Q1_i1 INT = 1*CAST(@ra*10 AS INT)
       DECLARE @Q2_i1 INT = 20*CAST(@ra*10 AS INT)
       EXECUTE SalQry @Q1 = @Q1_i1, @Q2 = @Q2_i1

    END
ELSE 
    IF @ra  > 0.3333 AND @ra < 0.6667
    BEGIN
        -- SELECT 'RAND > 0.333 | < 0.6667', @ra
        DECLARE @AMNT_i2 INT = 500*CAST(@ra*10 AS INT)
        EXECUTE AbtQry @AMNT = @AMNT_i2
        EXECUTE PurQry @DAte = '2016/04/29'
        EXECUTE PrsQry @CID = 500
        EXECUTE OrdQry @CK = 207
        DECLARE @Q1_i2 INT = 2*CAST(@ra*10 AS INT)
        DECLARE @Q2_i2 INT = 10*CAST(@ra*10 AS INT)
        EXECUTE SalQry @Q1 = @Q1_i2, @Q2 = @Q2_i2

    END
ELSE
    BEGIN
        -- SELECT 'RAND > 0.6667', @ra
        DECLARE @AMNT_i3 INT = 800*CAST(@ra*10 AS INT)
        EXECUTE AbtQry @AMNT = @AMNT_i3
        EXECUTE PurQry @DAte = '2015/08/13'
        EXECUTE PrsQry @CID = 520
        EXECUTE OrdQry @CK = 5
        DECLARE @Q2_i3 INT = 60*CAST(@ra*10 AS INT)
        EXECUTE SalQry @Q1 = 25, @Q2 = @Q2_i3
    END
GO 10

And with this data, we proceed to run the logging T-SQL (query source).

SELECT
    (total_logical_reads + total_logical_writes) AS total_logical_io
    ,(total_logical_reads / execution_count) AS avg_logical_reads
    ,(total_logical_writes / execution_count) AS avg_logical_writes
    ,(total_physical_reads / execution_count) AS avg_phys_reads
    ,substring(st.text,(qs.statement_start_offset / 2) + 1,  
      ((CASE qs.statement_end_offset WHEN - 1 THEN datalength(st.text) 
   qs.statement_end_offset END  - qs.statement_start_offset) / 2) + 1) 
AS statement_text
    ,*
INTO query_stats_LOG_2
FROM
        sys.dm_exec_query_stats AS qs
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY total_logical_io DESC

Now that we have gathered some test data, we can proceed to do clustering analysis.

Since I don’t know how many clusters can there be, and I can imagine, a¬† DBA would also be pretty much clueless, I will explore number of clusters. Following R code:

library(RODBC)
myconn <-odbcDriverConnect("driver={SQL Server};Server=SICN-KASTRUN;
database=WideWorldImportersDW;trusted_connection=true")

query.data <- sqlQuery(myconn, "
                     SELECT 
                                 [total_logical_io]
                      ,[avg_logical_reads]
                      ,[avg_phys_reads]
                      ,execution_count
                      ,[total_physical_reads]
                      ,[total_elapsed_time]
                      ,total_dop
                      ,[text]
                      ,CASE WHEN LEFT([text],70) LIKE '%AbtQry%' THEN 'AbtQry'
                       WHEN LEFT([text],70) LIKE '%OrdQry%' THEN 'OrdQry'
                       WHEN LEFT([text],70) LIKE '%PrsQry%' THEN 'PrsQry'
                       WHEN LEFT([text],70) LIKE '%SalQry%' THEN 'SalQry'
                       WHEN LEFT([text],70) LIKE '%PurQry%' THEN 'PurQry'
                       HEN LEFT([text],70) LIKE '%@BatchID%' THEN 'System'
                                  ELSE 'Others' END AS label_graph 
                      FROM query_stats_LOG_2")

close(myconn) 
library(cluster)

#qd <- query.data[,c(1,2,3,5,6)]
qd <- query.data[,c(1,2,6)]

## hierarchical clustering
qd.use <- query.data[,c(1,2,6)]
medians <- apply(qd.use,2,median)
#mads <- apply(qd.use,2,mad)
qd.use <- scale(qd.use,center=medians) #,scale=mads)

#calculate distances
query.dist <- dist(qd.use)

# hierarchical clustering
query.hclust <- hclust(query.dist)

# plotting solution
op <- par(bg = "lightblue")
plot(query.hclust,labels=query.data$label_graph,
main='Query Hierarchical Clustering', ylab = 'Distance', 
xlab = ' ', hang = -1, sub = "" )
# in addition to circle queries within cluster
rect.hclust(query.hclust, k=3, border="DarkRed")        

And produces the following plot:

2017-01-08-10_46_05-plot-zoom

Graph itself is self explanatory and based on the gathered statistics and queries executed against the system, you receive the groups of queries where your DBA can easily and fast track down what might be causing some issues. I added some labels to the query for the graph to look neater, but it is up to you.

I have also changed the type to “triangle” to get the following plot:

2017-01-08-10_54_55-plot-zoom

And both show same information.

So the R code said that, there are three clusters generating And I used medians to generate data around it. In addition I have also tested the result with Partitioning around medoids (which is opposite to hierarchical clustering) and the results from both techniques yield clean clusters.

2017-01-08-10_46_47-rstudio

Also, the data sample is relatively small, but you are very welcome to test this idea into your environment. Just easy with freeproccache and flushprocindb commands!

This blog post was meant as a teaser,  to gather opinion from the readers. Couple of more additional approaches will be part of two articles, that I am currently working on.

As always, code is available at the Github.

Happy SQLinRg!