Linear regression in “The Man who counted”

Recently, I got a book by Brasilian writer  Júlio César de Mello e Souza (published under pen name Malba Tahan), titled The Man who counted. Book is a collection of mathematical stories very similar to Scheherazada’s 1001 Nights, where mathematical story-telling is the center of book.

                                               2017-03-25 19_38_40-the man who counted - Google Search

In story 5“In so many words”, Malba describes a simple algebraic problem of proportion between price for lodging offered and price of jewel sold.

This man,” old Salim said pointing to the jeweler “came from Syria to sell 
precious stones in Baghdad. He promised he would pay 20 dinars for his 
lodgings if he sold all of his jewels for 100 dinars and 35 dinars if he 
sold them for 200. After several days of wandering about, he ended up selling 
all of them for 140 dinars. How much does he owe me according to our agreement?”

Both, jeweler and lodge owner,  calculate the result each using percent proportion problem, both ending with wrong results, that was each to favor each:

  1. Jeweler:
200 : 35 :: 140 : x

x = (35 x 140) / 200 = 24.5

2. Lodge Owner:

100 : 20 :: 140 : x

x = (20 x 140) / 100 = 28

 

With two different results, both would end up in argument, so the third needed to be calculated:

Sale Price        Price of Lodgings
200               35
-100             -20
-----            -------
100               15

 

So the difference between both calculations forms a proportion to calculate the new case, when Sale price for jewel is 140, the price of lodging would be 26.

100: 15:: 40: x

x = (15 x 40) / 100 = 6

 

Mathematically speaking, problem is very interesting to be solved also using Linear Regression, since the two pair of points [200, 35] and [100, 20] form a linear prediction function and we would need to predict what would be the price of lodging, when sale price for jewel is 140.

diamond <- c(100, 200)
sleep   <- c(20, 35)

# regression
sleep_model <- lm(sleep ~ diamond)

plot(x=diamond, y=sleep)
abline(lm(sleep ~ diamond))

2017-03-25 21_07_29-Plot Zoom

Now, we can call this a prediction, what actually Beremiz does by heart.

predict_data <- data.frame(diamond=140)
fit <- predict(sleep_model, predict_data, interval = "predict")

#new value for diamond=140
fit[1]

Result is 26, which is strictly algebraic and R-prediction speaking correct result.

In this case, linear regression does same as proportion calculation, but what strikes me is which calculation – not mathematically speaking – does make more sense? 26 or 24,5 or 28 ? And which method for calculating next price lodge should satisfy both jeweler and lodge owner.

Happy reading!

 

 

Is it possible to use RevoScaleR package in Power BI?

I was invited to deliver a session for Belgium User Group on SQL Server and R integration. After the session – which we did online using web based Citrix  – I got an interesting question: “Is it possible to use RevoScaleR performance computational functions within Power BI?“. My first answer was,  a sceptical yes. But I said, that I haven’t used it in this manner yet and that there might be some limitations.

The idea of having the scalable environment and the parallel computational package with all the predictive analytical functions in Power BI is absolutely great. But something tells me, that it will not be that straight forward.

So let’s start by taking a large (500 MB) txt file and create XDF file:

library(RevoScaleR)
file.name <- "YearPredictionMSD.txt"
rxOptions(sampleDataDir = "C:\\Files")
sampleDataDir

File is available on-line at this address with the zip file.

Getting data with R script

Open Power BI and choose Get Data -> R Script -> and copy/Paste the following slightly changed code:

library(RevoScaleR)
file.name <- "YearPredictionMSD.txt";
rxOptions(sampleDataDir = "C:\\Files");
sampleDataDir

With copy pasting and clicking OK,

2017-03-20 18_56_17-Untitled - Power BI Desktop

You will have to wait for the data to be read into the memory, the data models to be created and after monitoring the memory consumption and patiently waiting, you will notice, that this particular dataset (500 MB or 160 MB XDF), that minimum 3 GB of RAM will be consumed and you will end up with preview:

4 - 2017-03-20 19_01_53-

By now, you will also notice that after saving this Power BI document, it will take somewhere up to 700 MB of your disk space and all the data visualization will consume additional RAM and time. After you will close the Power BI document, you will notice a lot of RAM being released.

Using R Script in the visuals

When you create a new Power BI document, I will create new dataset by Entering data. I will create three “dummy” variables.

7 - 2017-03-20 19_18_23-

With these three variables I will try to inject the data returned from XDF data format and have data represented in Power BI.

After selecting the new visual and choosing R visual, I inserted following code:

library(RevoScaleR)
file.name <- "YearPredictionMSD.txt";
rxOptions(sampleDataDir = "C:\\Files");
sampleDataDir

And this time, the result is fascinating. R is plotting histogram in a split of a second, simply meaning it takes advantage of XDF file and inject it to Power BI.

8 - 2017-03-20 19_26_47-Untitled - Power BI Desktop

This is still – an outer file or dataset -, that Power BI does not have a clue about. Meaning, no slicers are available for dynamic change of the user selection.

Let’s try to insert the data into those three dummy variables, where the third one will be a factor that I have to pre-prepare. Since in this case factor is Year, it is relatively easy to do:

library(RevoScaleR)
library(gridExtra)
library(dplyr)
Year % filter(year == c("2000","2001","2002")))
grid.table(df_f %>% filter(year == Year))

Once I have this inserted in new R visualize, I just need to add a dummy slicer.

9 - 2017-03-20 20_52_52-RevoScale_and_PowerBI - Power BI Desktop

Now, I can easily change the years for my cross-tabulation (using rxCrosstab function). Since calculation is comprehended in the back on the whole dataset and using dplyr package just to omit or filter the results, it is also possible to use rxDatastep:

rxDataStep(inData=outputFile, outFile="C:\\Files\\YearPredictMSD_Year.xdf", 
             overwrite=TRUE, transforms=list(LateYears = V1 > 1999))
rxCrossTabs(V2~F(LateYears), data = "C:\\Files\\YearPredictMSD_Year.xdf")

In this way, you will be creating new XDF file through PowerBI with the transformation. Bear in mind, that this step might take some extra seconds to create new variable or to make a subset, if you would need. Again, this is up to  you to decide, based on the file size.

Using SQL Server procedure with R Script

This approach is not that uncommon, because it has been proven that using Stored Procedures with T-SQL and R code is useful and powerful way to use SQL Server and R integration within SSRS.  Changing the computational context is sure another way to make a work around.

Creating Stored procedure:

CREATE PROCEDURE [dbo].[SP_YearMSD_CrossTab]
AS
BEGIN
    DECLARE @RScript nvarchar(max)
        SET @RScript = N'
                library(RevoScaleR)
                sampleDataDir

Or by copying the T-SQL Code into the SQL Server Data Source, the result is the same.

10 -- 2017-03-20 21_51_04-RevoScale_and_PowerBI - Power BI Desktop

In both cases, you should have a cross-tabulational  representation of XDF dataset within Power BI. And now you can really use all the advantages of Power BI visuals, Slicers and as well any additional R predictions.

12 --- 2017-03-20 21_54_42-RevoScale_and_PowerBI - Power BI Desktop

There is a slight minus to this (if not all) approaches like this. You need to have many stored procedures or queries having generated like this. Also rxCube will help you to some extent, but repetitive work will not be avoided.

Using HDInsight or Hadoop?

Using XDF data files stored in HD-Insight or in Hadoop would generaly mean using same dataset and step as for SQL Server procedure. Just that you would need to – prior to executing T-SQL script, also change comptutational context:

# HD Insight - Spark - Azure
HDInsight mySshUsername = USNM,mySshHostname = HSTNM,
mySshSwitches= SWTCH) 
rxSetComputeContext("HDInsight")
## Hadoop
Hadoop mySshUsername = USNM,mySshHostname = HSTNM,
mySshSwitches= SWTCH)
rxSetComputeContext("Hadoop")

Verdict

I have explored couple of ways how to use the Power BI visuals and environment with RevoScaleR XDF (eXternal Data Frame) datafiles. I have to admit, I was surprised that there will be a way to do it in a relatively easy way, but from data scientist perspective, it is still some additional load and work before you can start with actual data analysis. Last two approaches (R script in Visuals and SQL Server Procedures) are by far the fastest and also take the advantage of using parallel and distributed computations that RevoScaleR package brings.

I would very strongly advise Microsoft and Power BI development team to add XDF plug-in to Power BI. Plug-in would work with metadata presentation of the data each time the computations should be used, the metadata would push the code against R Server to have results returned. This would, for sure be a great way to bring Big Data concept to Power BI Desktop.

As always, code and samples are available at GitHub.

Happy coding!

RevoScaleR package dependencies with graph visualization

MRAN currently holds 7520 R Packages. We can see this with usage of following command (stipulating that you are using MRAN R version. ):

library(tools)
df_ap <- data.frame(available.packages())
head(df_ap)

2017-03-13 19_38_02-RStudio

With importing package tools, we get many useful functions to find additional information on packages.

Function package.dependencies() parses and check dependencies of a package in current environment. Function package_dependencies()  (with underscore and not dot) will find all dependent and reverse dependent packages.

With following code I can extract the packages and their dependencies (this will perform a data normalization):

net <- data.frame(df_ap[,c(1,4)])
library(dplyr)
netN <- net %>% 
        mutate(Depends = strsplit(as.character(Depends), ",")) %>% 
        unnest(Depends)
netN

And the result is:

Source: local data frame [14,820 x 2]

   Package       Depends
    (fctr)         (chr)
1       A3 R (>= 2.15.0)
2       A3        xtable
3       A3       pbapply
4   abbyyR  R (>= 3.2.0)
5      abc   R (>= 2.10)
6      abc      abc.data
7      abc          nnet
8      abc      quantreg
9      abc          MASS
10     abc        locfit
..     ...           ...

Presented way needs to be further cleaned and prepared.

Once you have data normalized, we can use any of the network packages for visualizing the data. With use of igraph package, I created visual presentation of the RevoScaleR package; dependencies and imported packages.

With the code I filter out the RevoScaleR package and create visual:

library(igraph)
netN_g <- graph.data.frame(edges[edges$src %in% c('RevoScaleR', deptree), ])
plot(netN_g)

2017-03-15 17_01_14-Plot Zoom

 

Happy Ring!