Using R in SQL Server Reporting Services (SSRS)

SQL Server Reporting services (SSRS) is an outstanding tool for creating, deploying and managing paginated, mobile, KPI reports as well as Power BI reports. Tool provides simple way to share and get data insights in your corporate environment.

ss-reporting-services-all-together

(Photo source: Microsoft docs)

Using the privileges of R language to enrich your data, your statistical analysis or visualization is a simple task to get more out of your reports.

The best practice to embed R code into SSRS report is to create stored procedure and output the results to report. To demonstrate this, we will create two reports; one that will take two input parameters and the second one to demonstrate the usage of R visualization.

First, We will create a designated database and generate some sample data:

CREATE DATABASE R_SSRS;
GO
USE R_SSRS;
GO

CREATE TABLE R_data (
 v1 INT ,v2 INT,v_low INT,v_high INT,letter CHAR(1));
GO

CREATE OR ALTER PROCEDURE generate_data
AS
BEGIN
  INSERT INTO R_data(v1,v2,v_low,v_high,letter)
  SELECT TOP 10
    CAST(v1.number*RAND() AS INT) AS v1
   ,CAST(v2.number*RAND() AS INT) AS v2
   ,v1.low AS v_low
   ,v1.high AS v_high
   ,SUBSTRING(CONVERT(varchar(40), NEWID()),0,2) AS letter
  FROM master..spt_values AS v1
  CROSS JOIN master..spt_values AS v2
  WHERE  v1.[type] = 'P' AND v2.[type] = 'P'
 ORDER BY NEWID() ASC;
END
GO
EXEC generate_data;
GO 10

For the first report we will create a dynamic report, where user will be able to select two parameters. Based on select of the parameters, the resulted dataset will be passed to R, where with additional R code, we will generate summary statistics.

The idea is, to have the values “v1” and “letter” parametrized:

SELECT * FROM R_data
WHERE 
 v1 > 400 -- replace this with parameter; e.g.: @v1
AND letter IN ('1','2','3') -- replace this with parameter; e.g.: @letters

So in this case, we will create a stored procedure, that will get the data from T-SQL and call external procedure for execution of R Code. This procedure will be as:

CREATE OR ALTER PROCEDURE sp_R1(
   @v1 INT
  ,@lett VARCHAR(20)
) AS
BEGIN
 DECLARE @myQuery NVARCHAR(1000)

  CREATE TABLE #t (let CHAR(1))
  INSERT INTO #t
  SELECT value
  FROM STRING_SPLIT(@lett,',')
 
 SET @myQuery = N'
    SELECT * FROM R_data
    WHERE 
    v1 > '+CAST(@v1 AS VARCHAR(10))+'
    AND letter IN (SELECT * FROM #t)'

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
   df <- InputDataSet 
   OutputDataSet <- data.frame(summary(df))'
  ,@input_data_1 = @myQuery
WITH RESULT SETS
((v1 NVARCHAR(100)
 ,v2 NVARCHAR(100)
 ,freq NVARCHAR(100)))
END;
GO

By testing, we can see that passing both parameters (one single-valued and one multi-valued) will return result of R code:

EXEC sp_R1 @v1 = 200, @lett = '1,2,3'

Both of these parameters will be used in report for selection on data (or filtering the data). In SQL Server 2016 Report Builder, we create a blank report, add data source and create a new dataset, using stored procedure:

report1

Once the dataset is added and stored procedure tested, reporting builder will automatically create report parameters:

report2

In addition, we need to set up the parameters, so, adding additional dataset for the paramteres to have a controlled set of available values, and adding this pool of values to each of the parameters, by selecting the dataset, where the parameters will be taken from.

report3

For the multi-valued parameter, we also need to specify that user can select more than one value at the time. By letting the reporting builder know that “lett” parameter will be parsed back as a string of multiple values, we must select “Allow multiple values”.

report4

Once this is completed, both parameters will be shown as a drop-down selection boxes, where user will get the results back, based upon the selection made.

report5

At the end, report should look very similar to this, where the table is result generated by R code, the rest is T-SQL results and the beautiful Reporting Services.report6

For the second report, we will create a new stored procedure, that will generate  a boxplot for selected values retrieved from the user selection.

The procedure will roughly remain the same, just the R code will be changed in order to be able to return the graph (complete procedure is in the code):

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
  df <- InputDataSet 
  image_file <- tempfile()
  jpeg(filename = image_file, width = 400, height = 400)
  boxplot(df$v1~df$letter)
  dev.off()
  OutputDataSet <- data.frame(data=readBin(file(image_file, "rb"), what=raw(), n=1e6))'

 

In addition to that, on our new report, we will need to show the returned value of this procedure (or simply result) as a graph.

On report canvas drag an image and change the settings of this image as following:

report7

Users will be now able to selected the values and based on their selection, the R graph generated inside SSRS report will be changed respectively.

This next Figure show just this, how simple it is for the user to change the selection (add or remove new values) and the boxplot graph is redrawn with updated values.

report8

Both demos should help you leverage the usage of R scripts and R visualizations in your corporate reporting service and bring some additional user interface dynamics into it.

As always, the complete code and reports are available at Github. And code will also hold my random generated dataset, if you want to get same results as in this blogpost.

Happy RCoding ūüôā

Advertisements

Animated barplot and google map with R

It might happen that you will need a animated graph of any kind. For purposes of plotting demographic data and changes through the years, Google Maps and plotting maps, merging and converting jpg files into a animated gif, sure will give a nice visualization effect.

Here is a sample of changes over the time period of three years on some dataset of my home town and graph can tell little bit more as numbers ūüôā

ggmap_anim

So besides plotting visualizations based on geographical data, additional plots available in any of R packages can be visualized in a similar way.

Playing and goofing around, I created a sample dataset:

# dataset
d <- data.frame(val=c(2,3,4,3,4,3,4,3,2,1), 
                year=c(2015,2016,2017,2015,2016,2015,2015,2016,2017,2015)
                )

Once that, is created, a function would do the trick:

setwd("C:/DataTK")
library(ggplot2)
library(dplyr)

plot.bar.2 <- function(df_plot, xvar, yvar, fill)
{
  require(ggplot2)
  require(dplyr)
  attach(df_plot)
  #number of steps - based on values in X-axis
  x_unique <- unique(df_plot$xvar)   #xvar = year
  nof_steps <- as.integer(length(df_plot$x_unique))

  for (i in 1:nof_steps) 
  {
    x <- as.integer(x_unique[i])
    
    d1 <- df_plot %>%
          arrange(xvar) %>%
          filter(xvar<=x) %>%
          mutate(new_val = val) %>%
          select(xvar, new_val)
    
    d2 <- df_plot %>%
          arrange(xvar) %>%
          filter(xvar>x) %>%
          mutate(new_val = 0) %>%
          select(xvar,new_val)
    
    dfinal <- union_all(d1, d2)
    dfinal <- data.frame(dfinal)
    colnames(dfinal)[1] <- "x"
    colnames(dfinal)[2] <- "y"
    
    name <- paste('barplot00',i,'.png',sep="")
    png(name)
    ggplot(data=dfinal, aes(x=x, y=y, fill=x)) + geom_bar(stat="identity") + 
        guides(fill=FALSE)
    dev.off()
    rm(d1,d2,x,dfinal) 
  }

  system("magick -delay 150 -loop 0 *.png GeomBar_plot.gif")
  file.remove(list.files(pattern=".png"))
  rm(x_unique, nof_steps)   
  detach(df_plot)
}

plot.bar.2(d, d$year, d$val, d$year)

 

A bit more “interactive” or animated graph is created.

barplot

In comparison to static graph:

barplot.png

In addition to R code, the ImageMagic program needs to be installed on your machine, as well. Also the speed, quality and many other parameters can be set, when creating animated gif.

Animated gif can be also included into your SSRS report, your Sharepoint site or any other site – like my blog ūüôā and it will stay interactive. In Power BI, importing animated gif as a picture, unfortunately will not work.

As always, code is available at GitHub.

Happy R-coding!

Clustering executed SQL Server queries using R as tool for

When query execution performance analysis is to be done, there are many ways to find which queries might cause any unwanted load or cause stall on the server.

By encouraging DBA community to start practicing the advantage or R Language and world of data science, I have created a demo to show, how statistics on numerous queries can be stored for later analysis. And this demo has unsupervised (or undirected) method for grouping similar query statistics (or queries) to easier track and find where and which queries might be potential system stoppers.

Before we run some queries to generate the running statistics for these queries, we clean the cache and prepare the table for storing statistics from sys.dm_exec_query_stats.

-- drop statistics table
DROP TABLE IF EXISTS  query_stats_LOG_2;
DROP PROCEDURE IF EXISTS AbtQry;
DROP PROCEDURE IF EXISTS SalQry;
DROP PROCEDURE IF EXISTS PrsQry;
DROP PROCEDURE IF EXISTS OrdQry;
DROP PROCEDURE IF EXISTS PurQry;
GO

-- Clean all the stuff only for the selected db
DECLARE @dbid INTEGER
SELECT @dbid = [dbid] 
FROM master..sysdatabases 
WHERE name = 'WideWorldImportersDW'
DBCC FLUSHPROCINDB (@dbid);
GO

-- for better sample, check that you have Query store turned off
ALTER DATABASE WideWorldImportersDW SET  QUERY_STORE = OFF;
GO

We generate some fake data:

USE WideWorldImportersDW;
GO

-- CREATE Procedures 

CREATE PROCEDURE AbtQry
(@AMNT AS INTEGER) 
AS

-- an arbitrary query
SELECT 
  cu.[Customer Key] AS CustomerKey
  ,cu.Customer
  ,ci.[City Key] AS CityKey
  ,ci.City
  ,ci.[State Province] AS StateProvince
  ,ci.[Sales Territory] AS SalesTeritory
  ,d.Date
  ,d.[Calendar Month Label] AS CalendarMonth
  ,s.[Stock Item Key] AS StockItemKey
  ,s.[Stock Item] AS Product
  ,s.Color
  ,e.[Employee Key] AS EmployeeKey
  ,e.Employee
  ,f.Quantity
  ,f.[Total Excluding Tax] AS TotalAmount
  ,f.Profit
 
FROM Fact.Sale AS f
  INNER JOIN Dimension.Customer AS cu
    ON f.[Customer Key] = cu.[Customer Key]
  INNER JOIN Dimension.City AS ci
    ON f.[City Key] = ci.[City Key]
  INNER JOIN Dimension.[Stock Item] AS s
    ON f.[Stock Item Key] = s.[Stock Item Key]
  INNER JOIN Dimension.Employee AS e
    ON f.[Salesperson Key] = e.[Employee Key]
  INNER JOIN Dimension.Date AS d
    ON f.[Delivery Date Key] = d.Date
WHERE
    f.[Total Excluding Tax] BETWEEN 10 AND @AMNT;
GO

CREATE PROCEDURE SalQry
(@Q1 AS INTEGER
,@Q2 AS INTEGER)

AS
-- FactSales Query
SELECT * FROM Fact.Sale
WHERE
    Quantity BETWEEN @Q1 AND @Q2;
GO

CREATE PROCEDURE PrsQry
(@CID AS INTEGER )
AS

-- Person Query
SELECT * 
    FROM [Dimension].[Customer]
    WHERE [Buying Group] <> 'Tailspin Toys' 
    /* OR [WWI Customer ID] > 500 */
    AND [WWI Customer ID] BETWEEN 400 AND  @CID
ORDER BY [Customer],[Bill To Customer];
GO


CREATE PROCEDURE OrdQry
(@CK AS INTEGER)
AS

-- FactSales Query
SELECT 
    * 
    FROM [Fact].[Order] AS o
    INNER JOIN [Fact].[Purchase] AS p 
    ON o.[Order Key] = p.[WWI Purchase Order ID]
WHERE
    o.[Customer Key] = @CK;
GO

CREATE PROCEDURE PurQry
(@Date AS SMALLDATETIME)
AS

-- FactPurchase Query
SELECT *
    FROM [Fact].[Purchase]
        WHERE
        [Date Key] = @Date;
    --[Date KEy] = '2015/01/01'
GO

Now we run procedures couple of times:

DECLARE @ra DECIMAL(10,2)
SET @ra = RAND()
SELECT CAST(@ra*10 AS INT)

IF @ra  < 0.3333
    BEGIN
       -- SELECT 'RAND < 0.333', @ra
       DECLARE @AMNT_i1 INT = 100*CAST(@ra*10 AS INT)
       EXECUTE AbtQry @AMNT = @AMNT_i1
       EXECUTE PurQry @DAte = '2015/10/01'
       EXECUTE PrsQry @CID = 480
       EXECUTE OrdQry @CK = 0
       DECLARE @Q1_i1 INT = 1*CAST(@ra*10 AS INT)
       DECLARE @Q2_i1 INT = 20*CAST(@ra*10 AS INT)
       EXECUTE SalQry @Q1 = @Q1_i1, @Q2 = @Q2_i1

    END
ELSE 
    IF @ra  > 0.3333 AND @ra < 0.6667
    BEGIN
        -- SELECT 'RAND > 0.333 | < 0.6667', @ra
        DECLARE @AMNT_i2 INT = 500*CAST(@ra*10 AS INT)
        EXECUTE AbtQry @AMNT = @AMNT_i2
        EXECUTE PurQry @DAte = '2016/04/29'
        EXECUTE PrsQry @CID = 500
        EXECUTE OrdQry @CK = 207
        DECLARE @Q1_i2 INT = 2*CAST(@ra*10 AS INT)
        DECLARE @Q2_i2 INT = 10*CAST(@ra*10 AS INT)
        EXECUTE SalQry @Q1 = @Q1_i2, @Q2 = @Q2_i2

    END
ELSE
    BEGIN
        -- SELECT 'RAND > 0.6667', @ra
        DECLARE @AMNT_i3 INT = 800*CAST(@ra*10 AS INT)
        EXECUTE AbtQry @AMNT = @AMNT_i3
        EXECUTE PurQry @DAte = '2015/08/13'
        EXECUTE PrsQry @CID = 520
        EXECUTE OrdQry @CK = 5
        DECLARE @Q2_i3 INT = 60*CAST(@ra*10 AS INT)
        EXECUTE SalQry @Q1 = 25, @Q2 = @Q2_i3
    END
GO 10

And with this data, we proceed to run the logging T-SQL (query source).

SELECT
    (total_logical_reads + total_logical_writes) AS total_logical_io
    ,(total_logical_reads / execution_count) AS avg_logical_reads
    ,(total_logical_writes / execution_count) AS avg_logical_writes
    ,(total_physical_reads / execution_count) AS avg_phys_reads
    ,substring(st.text,(qs.statement_start_offset / 2) + 1,  
      ((CASE qs.statement_end_offset WHEN - 1 THEN datalength(st.text) 
   qs.statement_end_offset END  - qs.statement_start_offset) / 2) + 1) 
AS statement_text
    ,*
INTO query_stats_LOG_2
FROM
        sys.dm_exec_query_stats AS qs
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY total_logical_io DESC

Now that we have gathered some test data, we can proceed to do clustering analysis.

Since I don’t know how many clusters can there be, and I can imagine, a¬† DBA would also be pretty much clueless, I will explore number of clusters. Following R code:

library(RODBC)
myconn <-odbcDriverConnect("driver={SQL Server};Server=SICN-KASTRUN;
database=WideWorldImportersDW;trusted_connection=true")

query.data <- sqlQuery(myconn, "
                     SELECT 
                                 [total_logical_io]
                      ,[avg_logical_reads]
                      ,[avg_phys_reads]
                      ,execution_count
                      ,[total_physical_reads]
                      ,[total_elapsed_time]
                      ,total_dop
                      ,[text]
                      ,CASE WHEN LEFT([text],70) LIKE '%AbtQry%' THEN 'AbtQry'
                       WHEN LEFT([text],70) LIKE '%OrdQry%' THEN 'OrdQry'
                       WHEN LEFT([text],70) LIKE '%PrsQry%' THEN 'PrsQry'
                       WHEN LEFT([text],70) LIKE '%SalQry%' THEN 'SalQry'
                       WHEN LEFT([text],70) LIKE '%PurQry%' THEN 'PurQry'
                       HEN LEFT([text],70) LIKE '%@BatchID%' THEN 'System'
                                  ELSE 'Others' END AS label_graph 
                      FROM query_stats_LOG_2")

close(myconn) 
library(cluster)

#qd <- query.data[,c(1,2,3,5,6)]
qd <- query.data[,c(1,2,6)]

## hierarchical clustering
qd.use <- query.data[,c(1,2,6)]
medians <- apply(qd.use,2,median)
#mads <- apply(qd.use,2,mad)
qd.use <- scale(qd.use,center=medians) #,scale=mads)

#calculate distances
query.dist <- dist(qd.use)

# hierarchical clustering
query.hclust <- hclust(query.dist)

# plotting solution
op <- par(bg = "lightblue")
plot(query.hclust,labels=query.data$label_graph,
main='Query Hierarchical Clustering', ylab = 'Distance', 
xlab = ' ', hang = -1, sub = "" )
# in addition to circle queries within cluster
rect.hclust(query.hclust, k=3, border="DarkRed")        

And produces the following plot:

2017-01-08-10_46_05-plot-zoom

Graph itself is self explanatory and based on the gathered statistics and queries executed against the system, you receive the groups of queries where your DBA can easily and fast track down what might be causing some issues. I added some labels to the query for the graph to look neater, but it is up to you.

I have also changed the type to “triangle” to get the following plot:

2017-01-08-10_54_55-plot-zoom

And both show same information.

So the R code said that, there are three clusters generating And I used medians to generate data around it. In addition I have also tested the result with Partitioning around medoids (which is opposite to hierarchical clustering) and the results from both techniques yield clean clusters.

2017-01-08-10_46_47-rstudio

Also, the data sample is relatively small, but you are very welcome to test this idea into your environment. Just easy with freeproccache and flushprocindb commands!

This blog post was meant as a teaser,  to gather opinion from the readers. Couple of more additional approaches will be part of two articles, that I am currently working on.

As always, code is available at the Github.

Happy SQLinRg!

R graphs and tables in Power BI Desktop

Power BI Desktop enable users to use R script visual for adding custom visualization generated with R language – regardless of R package used. Before using R script visual, you will need to enable it by setting path to R Engine on your client in the global options. Once this is done, you will be able to enhance your Power BI reports using R visualizations.

1450757873470

My personal favorite R packages I like to use in Power BI are shiny, plotly, ggplot2, googleVis, ggiraph, rCharts and d3js.  Apart from finding the package for data visualization where you feel home, one of the very great thing I find with Power BI + R is the ability to have visualizations created dynamically (more or less like if you would be using plotly or shiny)  with slicers and selectors. And this makes data exploration, data wrangling so much easier.

With October 2016 Power BI custom visuals got additional 7 R visuals to the family (check Power BI Visuals Gallery and fall in love with your favorite one ūüôā ).

2016-12-18-09_26_55-visuals-gallery-_-microsoft-power-bi

Upon using any of the additional R visuals, Power BI will – prior to importing the additional visualizations – ask for installing all the missing packages, as well as inform you of any incompatibility between R engine and package.

So how easy is to create R visualization in Power BI?  Let us first import data to Power BI from SQL Server. I have generated a sample data-set for this demo. Data and all the samples are available at Github.

2016-12-18-11_32_26-preparing-data-sql-sicn-kastrun-plan-spar_si01017988-69_-microsoft-sql-s

With sample data-set I have created a simple data visualization with slicers using Power BI.

2016-12-18-11_34_01-sample_powerbi_r-power-bi-desktop

Now, the same visualization using R.

2016-12-18-11_34_01-sample_powerbi_r-power-bi-desktop_2

Note that I have used same slicers (Year = 2016, Month = 2) and R graph (using ggplot2) was able to dynamically  visualize the numbers. So this mean, that any R visualized graph can be navigated and dynamically updated, which is absolutely great, especially in cases when you do data exploration. Following R code was used:

library(ggplot2)
ggplot(data=dataset, aes(x=Day, y=Value)) +  geom_line(aes(color="red")) + 
ggtitle("Value by day") + theme(legend.position = "none") + 
theme(panel.background = element_rect(fill = "white", color = "lightgray", size = 1))

with additional dataset conversion to data.frame done automatically.

2016-12-18-11_57_51-sample_powerbi_r-power-bi-desktop_code

Where it get’s tricky is if you want to return the R data.frame into your Power BI page in form of a table (imagine you want some results to your graph or that you are calculating some statistics). Well, Power BI does not allow returning tables from R script visual. Unless, table is an image (that is, in a “form” of a visualized “graph”).

So I manage to create a table as a graph using R and gplot function from ggplot2 package.

2016-12-18-12_40_32-sample_powerbi_r-power-bi-desktop

The best part of this table is, that is dynamically changed based on selection on graphs or slicers. The downside is, that it can not be scrolled within the object. So if you end up with bigger table, you have to tweak the font size or show table aggregated or grouped.  In my opinion, I would not change R created table with Power BI table, because Power BI table is simply better and easier to create and maintain. But I would definitely recommend using R tables to support your R visuals. For example: If you are generating clusters using R in power BI, use R generated table to expose additional statistics on clusters.

The code for generating this R table is:

library(gridExtra)
library(ggplot2)

qplot(x=1:5, y=1:5, geom = "blank") + 
theme(line = element_blank(), text = element_blank()) + 
annotation_custom(grob=tableGrob(dataset))

Script generated blank plot, where custom annotation is added with function grob. And this is our data.frame from dataset and is displayed on top of empty graph.In addition to “remove” the background blank graph, blank line and text is added:

theme(line = element_blank(), text = element_blank())

 

Update #1 (19.12.2016)

After reading the comment to this blog post from reader Baptiste Auguie  (with link to his post on displaying tables as grid graphs), that using grid will return the same result, but very much a simpler solution to plotting / drawing table. Thank you Baptiste! I must have missed this one.

R code is far less simpler without  qplot function.

library(gridExtra)
grid.table(dataset)

And the printscreen from Power BI:

2016-12-19-06_13_19-sample_powerbi_r-power-bi-desktop

As always, sample, code and data-set is available at Github.

Happy PowerBI-Rrrring!