Is it possible to use RevoScaleR package in Power BI?

I was invited to deliver a session for Belgium User Group on SQL Server and R integration. After the session – which we did online using web based Citrix  – I got an interesting question: “Is it possible to use RevoScaleR performance computational functions within Power BI?“. My first answer was,  a sceptical yes. But I said, that I haven’t used it in this manner yet and that there might be some limitations.

The idea of having the scalable environment and the parallel computational package with all the predictive analytical functions in Power BI is absolutely great. But something tells me, that it will not be that straight forward.

So let’s start by taking a large (500 MB) txt file and create XDF file:

library(RevoScaleR)
file.name <- "YearPredictionMSD.txt"
rxOptions(sampleDataDir = "C:\\Files")
sampleDataDir

File is available on-line at this address with the zip file.

Getting data with R script

Open Power BI and choose Get Data -> R Script -> and copy/Paste the following slightly changed code:

library(RevoScaleR)
file.name <- "YearPredictionMSD.txt";
rxOptions(sampleDataDir = "C:\\Files");
sampleDataDir

With copy pasting and clicking OK,

2017-03-20 18_56_17-Untitled - Power BI Desktop

You will have to wait for the data to be read into the memory, the data models to be created and after monitoring the memory consumption and patiently waiting, you will notice, that this particular dataset (500 MB or 160 MB XDF), that minimum 3 GB of RAM will be consumed and you will end up with preview:

4 - 2017-03-20 19_01_53-

By now, you will also notice that after saving this Power BI document, it will take somewhere up to 700 MB of your disk space and all the data visualization will consume additional RAM and time. After you will close the Power BI document, you will notice a lot of RAM being released.

Using R Script in the visuals

When you create a new Power BI document, I will create new dataset by Entering data. I will create three “dummy” variables.

7 - 2017-03-20 19_18_23-

With these three variables I will try to inject the data returned from XDF data format and have data represented in Power BI.

After selecting the new visual and choosing R visual, I inserted following code:

library(RevoScaleR)
file.name <- "YearPredictionMSD.txt";
rxOptions(sampleDataDir = "C:\\Files");
sampleDataDir

And this time, the result is fascinating. R is plotting histogram in a split of a second, simply meaning it takes advantage of XDF file and inject it to Power BI.

8 - 2017-03-20 19_26_47-Untitled - Power BI Desktop

This is still – an outer file or dataset -, that Power BI does not have a clue about. Meaning, no slicers are available for dynamic change of the user selection.

Let’s try to insert the data into those three dummy variables, where the third one will be a factor that I have to pre-prepare. Since in this case factor is Year, it is relatively easy to do:

library(RevoScaleR)
library(gridExtra)
library(dplyr)
Year % filter(year == c("2000","2001","2002")))
grid.table(df_f %>% filter(year == Year))

Once I have this inserted in new R visualize, I just need to add a dummy slicer.

9 - 2017-03-20 20_52_52-RevoScale_and_PowerBI - Power BI Desktop

Now, I can easily change the years for my cross-tabulation (using rxCrosstab function). Since calculation is comprehended in the back on the whole dataset and using dplyr package just to omit or filter the results, it is also possible to use rxDatastep:

rxDataStep(inData=outputFile, outFile="C:\\Files\\YearPredictMSD_Year.xdf", 
             overwrite=TRUE, transforms=list(LateYears = V1 > 1999))
rxCrossTabs(V2~F(LateYears), data = "C:\\Files\\YearPredictMSD_Year.xdf")

In this way, you will be creating new XDF file through PowerBI with the transformation. Bear in mind, that this step might take some extra seconds to create new variable or to make a subset, if you would need. Again, this is up to  you to decide, based on the file size.

Using SQL Server procedure with R Script

This approach is not that uncommon, because it has been proven that using Stored Procedures with T-SQL and R code is useful and powerful way to use SQL Server and R integration within SSRS.  Changing the computational context is sure another way to make a work around.

Creating Stored procedure:

CREATE PROCEDURE [dbo].[SP_YearMSD_CrossTab]
AS
BEGIN
    DECLARE @RScript nvarchar(max)
        SET @RScript = N'
                library(RevoScaleR)
                sampleDataDir

Or by copying the T-SQL Code into the SQL Server Data Source, the result is the same.

10 -- 2017-03-20 21_51_04-RevoScale_and_PowerBI - Power BI Desktop

In both cases, you should have a cross-tabulational  representation of XDF dataset within Power BI. And now you can really use all the advantages of Power BI visuals, Slicers and as well any additional R predictions.

12 --- 2017-03-20 21_54_42-RevoScale_and_PowerBI - Power BI Desktop

There is a slight minus to this (if not all) approaches like this. You need to have many stored procedures or queries having generated like this. Also rxCube will help you to some extent, but repetitive work will not be avoided.

Using HDInsight or Hadoop?

Using XDF data files stored in HD-Insight or in Hadoop would generaly mean using same dataset and step as for SQL Server procedure. Just that you would need to – prior to executing T-SQL script, also change comptutational context:

# HD Insight - Spark - Azure
HDInsight mySshUsername = USNM,mySshHostname = HSTNM,
mySshSwitches= SWTCH) 
rxSetComputeContext("HDInsight")
## Hadoop
Hadoop mySshUsername = USNM,mySshHostname = HSTNM,
mySshSwitches= SWTCH)
rxSetComputeContext("Hadoop")

Verdict

I have explored couple of ways how to use the Power BI visuals and environment with RevoScaleR XDF (eXternal Data Frame) datafiles. I have to admit, I was surprised that there will be a way to do it in a relatively easy way, but from data scientist perspective, it is still some additional load and work before you can start with actual data analysis. Last two approaches (R script in Visuals and SQL Server Procedures) are by far the fastest and also take the advantage of using parallel and distributed computations that RevoScaleR package brings.

I would very strongly advise Microsoft and Power BI development team to add XDF plug-in to Power BI. Plug-in would work with metadata presentation of the data each time the computations should be used, the metadata would push the code against R Server to have results returned. This would, for sure be a great way to bring Big Data concept to Power BI Desktop.

As always, code and samples are available at GitHub.

Happy coding!

Advertisements

RevoScaleR package dependencies with graph visualization

MRAN currently holds 7520 R Packages. We can see this with usage of following command (stipulating that you are using MRAN R version. ):

library(tools)
df_ap <- data.frame(available.packages())
head(df_ap)

2017-03-13 19_38_02-RStudio

With importing package tools, we get many useful functions to find additional information on packages.

Function package.dependencies() parses and check dependencies of a package in current environment. Function package_dependencies()  (with underscore and not dot) will find all dependent and reverse dependent packages.

With following code I can extract the packages and their dependencies (this will perform a data normalization):

net <- data.frame(df_ap[,c(1,4)])
library(dplyr)
netN <- net %>% 
        mutate(Depends = strsplit(as.character(Depends), ",")) %>% 
        unnest(Depends)
netN

And the result is:

Source: local data frame [14,820 x 2]

   Package       Depends
    (fctr)         (chr)
1       A3 R (>= 2.15.0)
2       A3        xtable
3       A3       pbapply
4   abbyyR  R (>= 3.2.0)
5      abc   R (>= 2.10)
6      abc      abc.data
7      abc          nnet
8      abc      quantreg
9      abc          MASS
10     abc        locfit
..     ...           ...

Presented way needs to be further cleaned and prepared.

Once you have data normalized, we can use any of the network packages for visualizing the data. With use of igraph package, I created visual presentation of the RevoScaleR package; dependencies and imported packages.

With the code I filter out the RevoScaleR package and create visual:

library(igraph)
netN_g <- graph.data.frame(edges[edges$src %in% c('RevoScaleR', deptree), ])
plot(netN_g)

2017-03-15 17_01_14-Plot Zoom

 

Happy Ring!

 

 

rxNeuralNet vs. xgBoost vs. H2O

Recently, I did a session at local user group in Ljubljana, Slovenija, where I introduced the new algorithms that are available with MicrosoftML package for Microsoft R Server 9.0.3.

For dataset, I have used two from (still currently) running sessions from Kaggle. In the last part, I did image detection and prediction of MNIST dataset and compared the performance and accuracy between.

MNIST Handwritten digit database is available here.

picture2

Starting off with rxNeuralNet, we have to build a NET# model or Neural network to work it’s way.

Model for Neural network:

const { T = true; F = false; }

input Picture [28, 28];

hidden C1 [5 * 13^2]
from Picture convolve {
InputShape  = [28, 28];
UpperPad    = [ 1,  1];
KernelShape = [ 5,  5];
Stride      = [ 2,  2];
MapCount = 5;
}

hidden C2 [50, 5, 5]
from C1 convolve {
InputShape  = [ 5, 13, 13];
KernelShape = [ 1,  5,  5];
Stride      = [ 1,  2,  2];
Sharing     = [ F,  T,  T];
MapCount = 10;
}

hidden H3 [100]
from C2 all;

// Output layer definition.
output Result [10]
from H3 all;

Once we have this, we can work out with rxNeuralNet algorithm:

model_DNN_GPU <- rxNeuralNet(label ~.
      ,data = dataTrain
      ,type = "multi"
      ,numIterations = 10
      ,normalize = "no"
      #,acceleration = "gpu" #enable this if you have CUDA driver
      ,miniBatchSize = 64 #set to 1 else set to 64 if you have CUDA driver problem 
      ,netDefinition = netDefinition
      ,optimizer = sgd(learningRate = 0.1, lRateRedRatio = 0.9, lRateRedFreq = 10)
)

Then do the prediction and calculate accuracy matrix:

DNN_GPU_score <- rxPredict(model_DNN_GPU, dataTest, extraVarsToWrite = "label")
sum(Score_DNN$Label == DNN_GPU_score$PredictedLabel)/dim(DNN_GPU_score)[1]

Accuracy for this model is:

[1] 0.9789

 

When working with H2O package, the following code was executed to get same paramethers for Neural network:

model_h20 <- h2o.deeplearning(x = 2:785
                     ,y = 1   # label for label
                     ,training_frame = train_h2o
                     ,activation = "RectifierWithDropout"
                     ,input_dropout_ratio = 0.2 # % of inputs dropout
                     ,hidden_dropout_ratios = c(0.5,0.5) # % for nodes dropout
                     ,balance_classes = TRUE 
                     ,hidden = c(50,100,100) 
                     ,momentum_stable = 0.99
                     ,nesterov_accelerated_gradient = T # use it for speed
                     ,epochs = 15)

When results of test dataset against the learned model is executed:

h2o.confusionMatrix(model_h20)
100-(416/9978)*100

the  result is confusion matrix for accuracy of predicted values with value of:

# [1] 95.83083

 

For comparison, I have added xgBoost (eXtrem Gradient Boosting), but this time, I will not focus on this one.

Time comparison against the packages (in seconds), from left to right are: H20, MicrosoftML with GPU acceleration, MicrosoftML without GPU acceleration and xgBoost.

picture1

As for the accuracy of the trained model, here are results (based on my tests):

MicrosoftML – Neural Network – 97,8%

H20 – Deep Learning – 95,3 %

xgBoost – 94,9 %

 

As always, code and dataset are available at GitHub.

Happy R-ing 🙂

 

 

First @SLODUG Meeting in 2017

We had our first SQL Server User Group SLODUG meeting in this year. Event took place at Microsoft Slovenija, 09.Feb.2017 with cca 15 people showing up. Along 15 people we had 8 pizzas and some 20 beers 🙂

Scheduled were two topics:

17:15 – 18:00 Let’s use Microsoft R Server 9 for entering Kaggle competition (Tomaž Kaštrun)
18:10 – 19:30 Forecasting with MS BI Suite (Dejan Sarka)

with two beautiful presenters:

slodug20170209

Not to mention outstanding statistics about presenters:

Average gender: Male
Maximum eye color: Yes
Beer moving average: coffee

And a printscreen from the SLODUG Blog:

2017-02-10-14_29_25-slodug-srecanje-v-cetrtek-9-2-novice-slodug-slodug

Keep the community spirit up!

R and SQL Server articles

In past couple of months, I have prepared several articles on R and SQL Server that have been published on SQL Server Central.

The idea was, to have couple of articles covering the introduction to R, to basics on R Server, to some practical cases on R with SQL Server.

1) Using Microsoft R in Enterprise Environments

Article covers the concepts on Microsoft R Server, where and how to start with Microsoft R in enterprise environment and give answers to most common concerns people might have when introducing R language into corporation.

1

Link to article: http://www.sqlservercentral.com/articles/R+Language/140422/

 

2) Introduction to Microsoft R Services in SQL Server 2016

Integration and architecture on Microsoft R Services is main focus of this article. It outlinesdifferent flavors of R (Open, Client, Server, Services, Hadoop, etc.), how to deal with installation and basic overview and explanation on extended stored procedure SP_EXECUTE_EXTERNAL_SCRIPT.

2.png

Link to article: http://www.sqlservercentral.com/articles/Microsoft/145393/

 

3) Installing R packages in SQL Server R Services

Expand the functionality of R by adding new packages. Covers many ways how to install and add additional packages to your R environment.

3

Link to article: http://www.sqlservercentral.com/articles/R+Package/145571/

 

4) Using SQL Server and R Services for analyzing Sales data

Providing use cases on analyzing sales data was focus of this article with goal to show readers and users how to ope rationalize and bring R code into use in any enterprise (small or big) environment.

4

Link to article: http://www.sqlservercentral.com/articles/R+Services/145649/

 

5) Using Power BI and SSRS for visualizing SQL Server and R data

Visualizing the data for any use case, is also important aspect of understanding data insights. Article covers Power BI and SSRS visualization and how to embed R code in both tools.

5

Link to article: http://www.sqlservercentral.com/articles/R+Language/151358/

6) Using SQL Server and R Services for analyzing DBA Tasks

Broadening the use of Microsoft R for the DBA tasks was the main goal of this article. With simulation of  the disk usage, showing R example how to switch from monitoring the usage to predicting the usage of disk space. Clustering executed queries to narrow down performance issues and visualizing Query store information with heatmap were also introduced in article.

6

Link to article: http://www.sqlservercentral.com/articles/R+Language/151405/

 

More articles will follow, so stick around.

Happy R-SQLing!

 

 

RevoScaleR package for Microsoft R

RevoscaleR Package for R language is  package for scalable, distributed and parallel computation, available along with Microsoft R Server (and in-Database R Services). It solves many of limitations that R language is facing when run from a client machine. RevoScaleR Package addresses several of these issues:

  • memory based data access model -> dataset can be bigger than the size of a RAM
  • lack of parallel computation -> offers distributed and parallel computation
  • data movement -> no more need for data movement due to ability to set computational context
  • duplication costs -> with computational context set and different R versions (Open, Client or Server) data reside on one place, making maintenance cheaper and no duplication on different locations are needed
  • governance and providence -> RevoscaleR offers oversight of both with setting and additional services in R Server
  • hybrid typologies and agile development -> on-premises + cloud + client combination allow hybrid environment development for faster time to production

 

Before continuing, make sure you have RevoScaleR package installed in your R environment. To check, which computational functions are available within this package, let us run following:

RevoInfo <-packageVersion("RevoScaleR")
RevoInfo

to see the version of RevoScaleR package. In this case it is:

[1] ‘9.0.1’

Now we will run command to get the list of all functions:

revoScaleR_objects <- ls("package:RevoScaleR")
revoScaleR_objects

Here is the list:

2017-01-30-21_19_51-rstudio

All RevoScaleR functions have prefix rx or Rx, so it is much easier to distinguish functions from functions available in other similar packages – for example rxKMeans and kmeans.

find("rxKmeans")
find("kmeans")

Showing results – name of the package where each function is based:

> find("rxKmeans")
[1] "package:RevoScaleR"
> find("kmeans")
[1] "package:stats"

The output or RevoScaleR object, shows 200 computational functions, but I will focus only on couple of them.

RevoScaleR package and computational function were designed for parallel computation with no memory limitation, mainly because this package introduced it’s own file format, called XDF. eXternal Data Frame was designed for fast processing of smaller chunks of data, and gains it’s efficiency when reading and writing the XDF data by loading chucks of data into RAM one by at a time and only what is needed. The way this is done, means no limitations for the size of RAM, computations run much faster (because it is using C++ to write these algorithms, which is faster than original, which were written in interpretative language). Data scientist still make a single R call, bur R will use distrubuteR component to determine, how many cores, sockets and threads are available and then launch smaller portion of load into each thread, analyze data a bit at a time. With XDF, data is retrieved many times, but since it is 5-10times smaller (as I have already shown in previous blog posts when compared to *.txt or *.csv files), and it is written and stored into XDF file the same way as it was extracted from the memory, it enables faster computations, because no parsing of data chunks is required and because of the way, how data is stored, is maximizes the retrieval time of the data.

Preparing and storing or importing your data into XDF is important part of achieving faster computational time. Download some sample data from revolution analytics blog. I will be taking some AirOnTime data, a CSV file from here.

With help of following functions will help you to, I will import file from csv into xdf format.

rxTextToXdf() – for importing data to .xdf format from a delimited text file or csv.

rxDataStepXdf() – for transforming and subseting data of variables and/or rows for data exploration and analysis.

 With following code:
setwd("C:/Users/Documents/33")
rxTextToXdf(inFile = "airOT201201.csv", outFile = "airOT201201.xdf",  
stringsAsFactors = T, rowsPerRead = 200000)
I have now converted csv file into xdf file within cca 13 seconds.
2017-02-03-23_42_32-rstudio
and files look like:
2017-02-03-23_44_08-33
which is from original 105MB to 15 MB, it is 7 times smaller data file.
For further information on data handling, a very nice blog post is available here.
Quick information on the data set can be done using:
rxGetInfo("airOT201201.xdf", getVarInfo = TRUE, numRows = 20)
2017-02-03-23_52_02-rstudio
but we can also use following functions to expore and wrangle the data:
rxSummary(), rxCube, rxCrossTabs() – summary statistics for column and compute correlations or crosstabulation between the columns
rxHistogram() – plot a histogram of a column (variable)
rxLinePlot() – plot a line or scatterplot from XDF file or from rxCube
Running summary statistics for column DAY_OF_WEEK:
rxSummary(~DAY_OF_WEEK, data="airOT201201.xdf")
#or for the whole dataset
rxSummary(~., data="airOT201201.xdf")
we see the execution time and results of this statistic:
Rows Read: 200000, Total Rows Processed: 200000, Total Chunk Time: 0.007 seconds
Rows Read: 200000, Total Rows Processed: 400000, Total Chunk Time: 0.002 seconds
Rows Read: 86133, Total Rows Processed: 486133, Total Chunk Time: 0.002 seconds 
Computation time: 0.018 seconds.
Call:
rxSummary(formula = ~DAY_OF_WEEK, data = "airOT201201.xdf")

Summary Statistics Results for: ~DAY_OF_WEEK
Data: "airOT201201.xdf" (RxXdfData Data Source)
File name: airOT201201.xdf
Number of valid observations: 486133 
 
 Name        Mean     StdDev   Min Max ValidObs MissingObs
 DAY_OF_WEEK 3.852806 2.064557 1   7   486133   0
And run now rxHistogram for selected column:
#histogram
rxHistogram(~DAY_OF_WEEK, data="airOT201201.xdf")

Rows Read: 200000, Total Rows Processed: 200000, Total Chunk Time: 0.007 seconds
Rows Read: 200000, Total Rows Processed: 400000, Total Chunk Time: 0.004 seconds
Rows Read: 86133, Total Rows Processed: 486133, Total Chunk Time: Less than .001 seconds 
Computation time: 0.019 seconds.
to get the results for histogram:
2017-02-04 00_01_14-RStudio.png

Some of the following algorithms for predictions are available (and many more in addition):

rxLinMod() – linear regression model for XDF file
rxLogit() – logistic regression model for XDF file
rxDTree() – classification tree for XDF file
rxNaiveBayes() – bayes classifier for XDF file
rxGlm() – group of general linear models for XDF file
rxPredict() – predictions and residuals computations
 Let’s create a bit larger regression decision tree on our sample data on departure delay, day of the week, distance and elapsed time.
Air_DTree <- rxDTree(DEP_DELAY_NEW ~ DAY_OF_WEEK + ACTUAL_ELAPSED_TIME +
 DISTANCE_GROUP,  maxDepth = 3, minBucket = 30000, data = "airOT201201.xdf")

Visualizing the tree data:

plotcp(rxAddInheritance(Air_DTree))
plot(rxAddInheritance(Air_DTree))
text(rxAddInheritance(Air_DTree))

2017-02-04-00_22_57-plot-zoom

or you can use the RevoTreeView package, which is even smarter:

library(RevoTreeView)
plot(createTreeView(Air_DTree))

we can visualize the tree:

2017-02-04-00_20_49-microsoft-corporation

Of course, pruning and checking for over-fitting must also be done.

When comparing – for example exDTrees to original function, the performance si much better in favor of R. And if you have the ability to use RevoScaleR package for computations on larger datasets or your client might be an issue, use this package. It sure will make your life easier.

 

Happy R-SQLing.

Animated barplot and google map with R

It might happen that you will need a animated graph of any kind. For purposes of plotting demographic data and changes through the years, Google Maps and plotting maps, merging and converting jpg files into a animated gif, sure will give a nice visualization effect.

Here is a sample of changes over the time period of three years on some dataset of my home town and graph can tell little bit more as numbers 🙂

ggmap_anim

So besides plotting visualizations based on geographical data, additional plots available in any of R packages can be visualized in a similar way.

Playing and goofing around, I created a sample dataset:

# dataset
d <- data.frame(val=c(2,3,4,3,4,3,4,3,2,1), 
                year=c(2015,2016,2017,2015,2016,2015,2015,2016,2017,2015)
                )

Once that, is created, a function would do the trick:

setwd("C:/DataTK")
library(ggplot2)
library(dplyr)

plot.bar.2 <- function(df_plot, xvar, yvar, fill)
{
  require(ggplot2)
  require(dplyr)
  attach(df_plot)
  #number of steps - based on values in X-axis
  x_unique <- unique(df_plot$xvar)   #xvar = year
  nof_steps <- as.integer(length(df_plot$x_unique))

  for (i in 1:nof_steps) 
  {
    x <- as.integer(x_unique[i])
    
    d1 <- df_plot %>%
          arrange(xvar) %>%
          filter(xvar<=x) %>%
          mutate(new_val = val) %>%
          select(xvar, new_val)
    
    d2 <- df_plot %>%
          arrange(xvar) %>%
          filter(xvar>x) %>%
          mutate(new_val = 0) %>%
          select(xvar,new_val)
    
    dfinal <- union_all(d1, d2)
    dfinal <- data.frame(dfinal)
    colnames(dfinal)[1] <- "x"
    colnames(dfinal)[2] <- "y"
    
    name <- paste('barplot00',i,'.png',sep="")
    png(name)
    ggplot(data=dfinal, aes(x=x, y=y, fill=x)) + geom_bar(stat="identity") + 
        guides(fill=FALSE)
    dev.off()
    rm(d1,d2,x,dfinal) 
  }

  system("magick -delay 150 -loop 0 *.png GeomBar_plot.gif")
  file.remove(list.files(pattern=".png"))
  rm(x_unique, nof_steps)   
  detach(df_plot)
}

plot.bar.2(d, d$year, d$val, d$year)

 

A bit more “interactive” or animated graph is created.

barplot

In comparison to static graph:

barplot.png

In addition to R code, the ImageMagic program needs to be installed on your machine, as well. Also the speed, quality and many other parameters can be set, when creating animated gif.

Animated gif can be also included into your SSRS report, your Sharepoint site or any other site – like my blog 🙂 and it will stay interactive. In Power BI, importing animated gif as a picture, unfortunately will not work.

As always, code is available at GitHub.

Happy R-coding!