RevoScaleR package for Microsoft R

Posted on February 3, 2017 by tomaztsql — 7 Comments

RevoscaleR Package for R language is package for scalable, distributed and parallel computation, available along with Microsoft R Server (and in-Database R Services). It solves many of limitations that R language is facing when run from a client machine. RevoScaleR Package addresses several of these issues:

memory based data access model -> dataset can be bigger than the size of a RAM
lack of parallel computation -> offers distributed and parallel computation
data movement -> no more need for data movement due to ability to set computational context
duplication costs -> with computational context set and different R versions (Open, Client or Server) data reside on one place, making maintenance cheaper and no duplication on different locations are needed
governance and providence -> RevoscaleR offers oversight of both with setting and additional services in R Server
hybrid typologies and agile development -> on-premises + cloud + client combination allow hybrid environment development for faster time to production

Before continuing, make sure you have RevoScaleR package installed in your R environment. To check, which computational functions are available within this package, let us run following:

RevoInfo <-packageVersion("RevoScaleR")
RevoInfo

to see the version of RevoScaleR package. In this case it is:

[1] ‘9.0.1’

Now we will run command to get the list of all functions:

revoScaleR_objects <- ls("package:RevoScaleR")
revoScaleR_objects

Here is the list:

2017-01-30-21_19_51-rstudio

All RevoScaleR functions have prefix rx or Rx, so it is much easier to distinguish functions from functions available in other similar packages – for example rxKMeans and kmeans.

find("rxKmeans")
find("kmeans")

Showing results – name of the package where each function is based:

> find("rxKmeans")
[1] "package:RevoScaleR"
> find("kmeans")
[1] "package:stats"

The output or RevoScaleR object, shows 200 computational functions, but I will focus only on couple of them.

RevoScaleR package and computational function were designed for parallel computation with no memory limitation, mainly because this package introduced it’s own file format, called XDF. eXternal Data Frame was designed for fast processing of smaller chunks of data, and gains it’s efficiency when reading and writing the XDF data by loading chucks of data into RAM one by at a time and only what is needed. The way this is done, means no limitations for the size of RAM, computations run much faster (because it is using C++ to write these algorithms, which is faster than original, which were written in interpretative language). Data scientist still make a single R call, bur R will use distrubuteR component to determine, how many cores, sockets and threads are available and then launch smaller portion of load into each thread, analyze data a bit at a time. With XDF, data is retrieved many times, but since it is 5-10times smaller (as I have already shown in previous blog posts when compared to *.txt or *.csv files), and it is written and stored into XDF file the same way as it was extracted from the memory, it enables faster computations, because no parsing of data chunks is required and because of the way, how data is stored, is maximizes the retrieval time of the data.

Preparing and storing or importing your data into XDF is important part of achieving faster computational time. Download some sample data from revolution analytics blog. I will be taking some AirOnTime data, a CSV file from here.

With help of following functions will help you to, I will import file from csv into xdf format.

rxTextToXdf() – for importing data to .xdf format from a delimited text file or csv.

rxDataStepXdf() – for transforming and subseting data of variables and/or rows for data exploration and analysis.

With following code:

setwd("C:/Users/Documents/33")
rxTextToXdf(inFile = "airOT201201.csv", outFile = "airOT201201.xdf",  
stringsAsFactors = T, rowsPerRead = 200000)

I have now converted csv file into xdf file within cca 13 seconds.

and files look like:

which is from original 105MB to 15 MB, it is 7 times smaller data file.

For further information on data handling, a very nice blog post is available here.

Quick information on the data set can be done using:

rxGetInfo("airOT201201.xdf", getVarInfo = TRUE, numRows = 20)

but we can also use following functions to expore and wrangle the data:

rxSummary(), rxCube, rxCrossTabs() – summary statistics for column and compute correlations or crosstabulation between the columns

rxHistogram() – plot a histogram of a column (variable)

rxLinePlot() – plot a line or scatterplot from XDF file or from rxCube

Running summary statistics for column DAY_OF_WEEK:

rxSummary(~DAY_OF_WEEK, data="airOT201201.xdf")
#or for the whole dataset
rxSummary(~., data="airOT201201.xdf")

we see the execution time and results of this statistic:

Rows Read: 200000, Total Rows Processed: 200000, Total Chunk Time: 0.007 seconds
Rows Read: 200000, Total Rows Processed: 400000, Total Chunk Time: 0.002 seconds
Rows Read: 86133, Total Rows Processed: 486133, Total Chunk Time: 0.002 seconds 
Computation time: 0.018 seconds.
Call:
rxSummary(formula = ~DAY_OF_WEEK, data = "airOT201201.xdf")

Summary Statistics Results for: ~DAY_OF_WEEK
Data: "airOT201201.xdf" (RxXdfData Data Source)
File name: airOT201201.xdf
Number of valid observations: 486133 
 
 Name        Mean     StdDev   Min Max ValidObs MissingObs
 DAY_OF_WEEK 3.852806 2.064557 1   7   486133   0

And run now rxHistogram for selected column:

#histogram
rxHistogram(~DAY_OF_WEEK, data="airOT201201.xdf")

Rows Read: 200000, Total Rows Processed: 200000, Total Chunk Time: 0.007 seconds
Rows Read: 200000, Total Rows Processed: 400000, Total Chunk Time: 0.004 seconds
Rows Read: 86133, Total Rows Processed: 486133, Total Chunk Time: Less than .001 seconds 
Computation time: 0.019 seconds.

to get the results for histogram:

Some of the following algorithms for predictions are available (and many more in addition):

rxLinMod() – linear regression model for XDF file

rxLogit() – logistic regression model for XDF file

rxDTree() – classification tree for XDF file

rxNaiveBayes() – bayes classifier for XDF file

rxGlm() – group of general linear models for XDF file

rxPredict() – predictions and residuals computations

Let’s create a bit larger regression decision tree on our sample data on departure delay, day of the week, distance and elapsed time.

Air_DTree <- rxDTree(DEP_DELAY_NEW ~ DAY_OF_WEEK + ACTUAL_ELAPSED_TIME +
 DISTANCE_GROUP,  maxDepth = 3, minBucket = 30000, data = "airOT201201.xdf")

Visualizing the tree data:

plotcp(rxAddInheritance(Air_DTree))
plot(rxAddInheritance(Air_DTree))
text(rxAddInheritance(Air_DTree))

2017-02-04-00_22_57-plot-zoom

or you can use the RevoTreeView package, which is even smarter:

library(RevoTreeView)
plot(createTreeView(Air_DTree))

we can visualize the tree:

2017-02-04-00_20_49-microsoft-corporation

Of course, pruning and checking for over-fitting must also be done.

When comparing – for example exDTrees to original function, the performance si much better in favor of R. And if you have the ability to use RevoScaleR package for computations on larger datasets or your client might be an issue, use this package. It sure will make your life easier.

Happy R-SQLing.

Tagged with: Microsoft R Server, R, RevoScaleR
Posted in Uncategorized

7 comments on “RevoScaleR package for Microsoft R”

RevoScaleR package for Microsoft R | A bunch of data says:

February 4, 2017 at 12:26 pm

[…] article was first published on R – TomazTsql, and kindly contributed to […]

LikeLike

Reply
Juan says:

February 4, 2017 at 1:05 pm

Hello.
Does Revoscaler have any function to fit regression models with random effects? (as lme4 or nlme do).
My dataset is bigger than memory and lme4 is not able to work with this sizes.
And what about bayesian regressión with Revoscaler? Stan is not able to work with large datasets either.

Regards

LikeLike

Reply
tomaztsql says:

February 4, 2017 at 10:18 pm

Hello Juan,

Can you please describe the model you want to fit with random effect. Just the formula and type of the variables you are making regression model.

Bayesian regression is not available per-se. you can do MLE to avoid overfitting on the sample and feed it to linear regression but without the percentage of uncertainty. But you would still lack the distribution of predictor, which is what would really be needed. There is gaussian distribution available, if this would be of any help. And also loss function is not available.

LikeLike

Reply
RevoScaleR – Curated SQL says:

February 7, 2017 at 1:10 pm

[…] Tomaz Kastrun explains how the RevoScaleR package is useful: […]

LikeLike

Reply
Randy Minder says:

March 2, 2017 at 5:24 pm

Is RevoScaleR necessary if we intend to do most of our analysis using SQL Server 2016 and R Services? It seems to me it wouldn’t be because all processing is being done within the SQL 2016 engine.

LikeLike

Reply
tomaztsql says:

March 2, 2017 at 6:11 pm

Hi Randy,

Basic R integration with SQL Server 2016 is available in all SQL Server 2016 editions, this means that you can use R integration for any kind of data analysis work with any edition. When you want to use ScaleR algorithms (in RevoScaleR Package) what support full parallelism or R with no memory limitations, then you would need enterprise edition or developer edition of Sql Server 2016.

In your case, if you do all the pre-processing in SQL Server engine, and push smaller amount of data for statistical analysis or data visualizations to R, you can also do this with basic R integration.

Best, Tomaž

LikeLike

Reply
tanishtalks says:

June 25, 2017 at 9:00 am

Is there any function for developing Recommendation Systems in RevoScaleR package?

LikeLike

Reply

	tomaztsql on Retrieving user access list to…
	Paola A Zambrano on Retrieving user access list to…
	“Reverse Hello… on Little useless-useful R functi…
	Max Petter on Using R and Python in Microsof…
	detlef kissel on Using R and Python in Microsof…

RevoScaleR package for Microsoft R

Share this:

Related

7 comments on “RevoScaleR package for Microsoft R”

Leave a comment Cancel reply