R or Python? Python or R? The ongoing debate.

On every SQL community event, where there could be a cluster of sessions dedicated to BI or analytics, I would have people asking me, “which one would you recommend?” or “which one I  prefer?”

2018-01-28 15_15_13-Edit Post ‹ TomazTsql — WordPress.com

So, questions about recommendation and preferences are in my opinion the hardest one. And not that I would know my preferences but because you are inadvertently creating someone’s taste or preferences by imposing yours. And expressing taste through someone else taste is even harder.

My initial reaction is a counter-question, why are you asking this? Simply because my curiosity goes beyond the question of preferring A over B, respectively.  And most of the time, the answer I get is, “because everyone is asking this question” or “because someone said this and the other said that”. In none of the cases, and I mean literally in none, I got the response (the one I would love to get) back like “we are running this algorithm and there are issues…” or “this library suits us better…”. So the community is mainly focused on asking themselves which one is better, instead of asking, can R / Python do the job. And I can assure you, that both can do the job! Period.

Image I ask you, would you prefer Apple iPhone over Samsung Galaxy, respectively? Or if I would ask you, would you prefer BMW over Audi, respectively? In all the cases, both phones or both cars will get the job done. So will Python or R, R or Python. So instead of asking which one I prefer, ask your self, which one suits my environment better? If your background is more statistics and less programming, take R, if you are more into programming and less into statistics, take Python; in both cases you will have faster time to accomplish results with your preferred language. If you ask me, can I do gradient boosting or ANOVA or MDS in Python or in R, the answer will be yes, you can do both in any of the languages.

Important questions are therefore the one that will give you fast results, easier adaptation and adoption, will give a better fit into your environment and will have less impact on your daily tasks.

Some might say, R is a child’s play language, while Python is a real programming language. Or some might say, Python is so complex and you have to program everything, whereas in R, everything is ready. And so on and on. All these allegations have some truth, but to fully understand them, I guess one needs to understand the background of the people saying this.  Obviously, Python in comparison to R is more general purpose scripting and programming language, therefore the number of packages is 10x higher, when compared to R. And both come with variety of different packages, giving users a specific functions, classes and procedures to execute their results. R on the other hand has had it’s moment in past couple of years and the community grew rapidly, whereas Python community is in it’s steady phase.

When you are deciding which one to select, here are some questions to be answered:

  • how big my corporate environment and how many end users will I have
  • who is the end user and how will the end user handle the results
  • what is current general knowledge with the language
  • which statistical and predictive algorithms will the company be using
  • would there be a need to parallel and distributed on-prem computations
  • if needed, do we need to connect (or copy/paste) the code to the cloud
  • how fast can the company adopt the language and the amount of effort needed
  • which language would fit easier with existing BI stack and visualization tools
  • how is your data centralized and silosd and which data sources are you using
  • governance and providence issues
  • installation, distribution of the core engine and packages
  • selection and the costs of IDE and GUI
  • corporate support and SLA
  • possibility to connect to different data sources
  • released dates of the most useful packages
  • community support
  • third party tools and additional programs for easier usage of the language
  • total cost of using the language once completely in place
  • asses the risk of using an GNU/open source software

After answering these questions, I implore you to do the stress and load tests against your datasets and databases to see, what perform better.

All in all, both languages, when doing statistical and predictive analysis, also have couple of annoyances that should also be addressed:

  • memory limitations (unless spilling to disk)
  • language specifics (e.g.: R is case-sensitive, Python is indent-sensitive and both will annoy you)
  • parallel and distributed computations (CPU utilization, multi-threading)
  • multi-OS running environment
  • cost of GUI/IDE
  • engine and package dependencies and versioning
  • and others

So next time, when you ask yourself or overhear the conversation in the community, which one is better (bigger, faster, stable,…), start asking the questions on your needs and effort to adopt it. Otherwise, I always add, learn both. It does not hurt to learn and use both (for at least the statistical and predictive purposes).

All best!

Advertisements

Performance differences between RevoScaleR, ColumnStore Table and In-Memory OLTP Table

Running *.XDF files using RevoScaleR computational functions versus have dataset available in Columnstore table or in In-Memory OLTP table will be focus of comparison for this blog post.

For this test, I will use the AirLines dataset, available here. Deliberately, I have picked a sample 200 MB (of 13GB dataset) in order to properly test the differences and what should be the best way.

After unzipping the file, I will use following T-SQL query to import the file into SQL Server.

With this example, you can import xdf file directly to SQL Server table (note, that I have transformed a CSV file into XDF and import xdf file into SQL table):

-- must have a write permissions on folder: C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                rxOptions(sampleDataDir = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData")
                inFile <- file.path(rxGetOption("sampleDataDir"), "airsample.csv")
                of <-  rxDataStep(inData = inFile, outFile = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData/airline20170428_2.xdf", 
                             transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")
                            ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))
                            ,overwrite = TRUE
                            ,maxRowsByCols = 10000000
                            ,rowsPerRead = 200000)
                OutputDataSet <- rxXdfToDataFrame(of)'

DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT 1 AS N'

EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((ArrDelay INT
                    ,CRSDepTime DECIMAL(6,4)
                    ,DofWeek NVARCHAR(20)))
GO

 

So the whole process is to be done by creating a table, converting the above sp_execute_external_script into procedure and import results from external procedure to the table.

--Complete process
CREATE TABLE AirFlights_small 
(id INT IDENTITY(1,1)
,ArrDelay INT
,CRSDepTime DECIMAL(6,4)
,DofWeek NVARCHAR(20) 
);
GO

CREATE Procedure ImportXDFtoSQLTable
AS
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                rxOptions(sampleDataDir = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData")
                inFile <- file.path(rxGetOption("sampleDataDir"), "airsample.csv")
                of <-  rxDataStep(inData = inFile, outFile = "airline20170428_2.xdf", 
                transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")
            ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))
            ,overwrite = TRUE
            ,maxRowsByCols = 10000000)
             OutputDataSet <- data.frame(rxReadXdf(file=of, varsToKeep=c("ArrDelay", "CRSDepTime","DayOfWeek")))'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT 1 AS N'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((ArrDelay INT,CRSDepTime DECIMAL(6,4),DofWeek NVARCHAR(20)));
GO

INSERT INTO AirFlights_small
EXECUTE ImportXDFtoSQLTable;
GO

 

There you go. Data are in T-SQL Table. Now we can start with comparisons.  I will be measuring the time to get average air delay time per day of the week.

2017-04-28 22_44_10-RStudio

RevoScaleR

With using the RevoScaleR package, I will be using rxCrossTabs function with the help of transform argument to convert day of the week into factors:

#importing data
outFile2 <- rxDataStep(inData = inFile, outFile = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData/airline20170428_2.xdf", 
            transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")
           ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))
           ,overwrite = TRUE
           ,maxRowsByCols = 10000000)

of2 <- data.frame(rxReadXdf(file=outFile2, varsToKeep=c("ArrDelay", "CRSDepTime","DayOfWeek")))

summary(rxCrossTabs(ArrDelay~DayOfWeek
                    ,data = of2  #outFile2
                    ,transforms = transforms
                    ,blocksPerRead=300000), output="means")

Now get those times:

# Getting times
system.time({ 
  summary(rxCrossTabs(ArrDelay~DayOfWeek
                      ,data = of2
                      ,transforms = transforms
                      ,blocksPerRead=300000), output="means")
  })

With results of 7.8 on elapsed time and computation time of 3.8 second.

Rows Read: 8400013, Total Rows Processed: 8400013, Total Chunk Time: 3.825 seconds 
Computation time: 3.839 seconds.
   user  system elapsed 
   2.89    0.37    7.89 

 

T-SQL query without any specifics

To have a baseline, let’s run the following query:

SET STATISTICS TIME ON;
SELECT 
[DofWeek]
,AVG(ArrDelay) AS [means]
FROM
    AirFlights_small
GROUP BY 
    [DofWeek]
SET STATISTICS TIME OFF;

And check these time statistics

 SQL Server Execution Times:
CPU time = 6124 ms,  elapsed time = 2019 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.

Obiously the CPU / computation time is higher, although the elapsed time is faster.

ColumnStore Table

Let’s create a nonclustered column store index.

CREATE TABLE AirFlights_CS
(id INT IDENTITY(1,1)
,ArrDelay INT
,CRSDepTime DECIMAL(6,4)
,DofWeek NVARCHAR(20) 
);
GO
INSERT INTO AirFlights_CS(ArrDelay, CRSDepTime, DofWeek)
SELECT ArrDelay, CRSDepTime, DofWeek FROM AirFlights_small 

CREATE NONCLUSTERED COLUMNSTORE INDEX NCCI_AirFlight
ON AirFlights_CS
(id, ArrDelay, CRSDepTime, DofWeek);
GO

With the execution of the same query

SET STATISTICS TIME ON;
SELECT 
[DofWeek]
,AVG(ArrDelay) AS [means]
FROM
  AirFlights_CS
GROUP BY     [DofWeek] SET STATISTICS TIME OFF;

The following time statistics are in

 SQL Server Execution Times:
CPU time = 202 ms,  elapsed time = 109 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.

 

In-Memory OLTP

To get Memory optimized table, we need to add a filegroup and create a table with memory optimized turned on:

CREATE TABLE dbo.AirFlight_M   
(  
  id INT NOT NULL PRIMARY KEY NONCLUSTERED
 ,ArrDelay INT
 ,CRSDepTime DECIMAL(6,4) 
 ,DofWeek NVARCHAR(20)
) WITH (MEMORY_OPTIMIZED=ON, DURABILITY = SCHEMA_AND_DATA);
GO

And insert the data

INSERT INTO AirFlight_M
SELECT * FROM AirFlights_small

Running the simple query

SET STATISTICS TIME ON;
SELECT 
[DofWeek]
,AVG(ArrDelay) AS [means]
FROM
    AirFlight_M
GROUP BY 
    [DofWeek]
SET STATISTICS TIME OFF;

results are:

 SQL Server Execution Times:
CPU time = 6186 ms,  elapsed time = 1627 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.

These results were somehow expected, mostly because the ColumnStore table is the only one having index and reading (also by looking in execution plans) optimized with comparison to others. Also degree of parallelism, clustered and non-clustered index can  be pushed, but the idea was to have tests similar to the one in RevoScaleR and R environemnt. With R, we can not push any index on the XDF file.

In R we run:

system.time({ 
LMResults <- rxLinMod(ArrDelay ~ DayOfWeek, data = outFile2, transforms = transforms)
LMResults$coefficients
})

And in SSMS we run:

SET STATISTICS TIME ON;
-- 1. T-SQL
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)
                OutputDataSet <- data.frame(LMResults$coefficients)'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlights_small]'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((
            --DofWeek NVARCHAR(20)
        --    ,
            Coefficient DECIMAL(10,5)
            ));
GO
SET STATISTICS TIME OFF;


SET STATISTICS TIME ON;
-- 2. ColumnStore
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)
                OutputDataSet <- data.frame(LMResults$coefficients)'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlights_CS]'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((
            --DofWeek NVARCHAR(20)
        --    ,
            Coefficient DECIMAL(10,5)
            ));
GO
SET STATISTICS TIME OFF;


SET STATISTICS TIME ON;
-- 3. Memory optimized
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)
                OutputDataSet <- data.frame(LMResults$coefficients)'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlight_M]'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((
            --DofWeek NVARCHAR(20)
        --    ,
            Coefficient DECIMAL(10,5)
            ));
GO
SET STATISTICS TIME OFF;

 

Conclusion

Gathering statistics on CPU time and elapsed time when running simple Linear regression, this is comparison:

df_LR_comparison <- data.frame (
  method = c("T-SQL", "ColumnStore", "Memory Optimized", "RevoScaleR")
  ,CPUtime = c(3000,1625,2156,7689)
  ,ElapsedTime = c(14323,10851,10600,7760)
  )
library(ggplot2)

ggplot(df_LR_comparison, aes(method, fill=method)) + 
  geom_bar(aes(y=ElapsedTime), stat="identity") +
  geom_line(aes(y=CPUtime, group=1), color="white", size=3) +
  scale_colour_manual(" ", values=c("d1" = "blue", "d2" = "red"))+
  #scale_fill_manual("",values="red")+
  theme(legend.position="none")

Showing that elapsed time for R environment with RevoScaleR is fastest (and getting data from XDF), where as simple T-SQL run with sp_execute_external_script and using RevoScaleR gives the slowest response.

2017-04-29 00_43_10-Plot Zoom

In terms of CPU time (white line), Columnstore with RevoScaleR call through external procedure outperforms all others.

Final conclusion: When running statistical analysis (using RevoScaleR or any other R library), use columnstore and index optimized tables/views to receive best CPU and elapsed times.  Important to remember is also the fact, that any aggregations and calculations that can be done within SQL Server, are better to be perfomered there.

 

As always, code is available at GitHub.

 

Happy coding! 🙂

Linear regression in “The Man who counted”

Recently, I got a book by Brasilian writer  Júlio César de Mello e Souza (published under pen name Malba Tahan), titled The Man who counted. Book is a collection of mathematical stories very similar to Scheherazada’s 1001 Nights, where mathematical story-telling is the center of book.

                                               2017-03-25 19_38_40-the man who counted - Google Search

In story 5“In so many words”, Malba describes a simple algebraic problem of proportion between price for lodging offered and price of jewel sold.

This man,” old Salim said pointing to the jeweler “came from Syria to sell 
precious stones in Baghdad. He promised he would pay 20 dinars for his 
lodgings if he sold all of his jewels for 100 dinars and 35 dinars if he 
sold them for 200. After several days of wandering about, he ended up selling 
all of them for 140 dinars. How much does he owe me according to our agreement?”

Both, jeweler and lodge owner,  calculate the result each using percent proportion problem, both ending with wrong results, that was each to favor each:

  1. Jeweler:
200 : 35 :: 140 : x

x = (35 x 140) / 200 = 24.5

2. Lodge Owner:

100 : 20 :: 140 : x

x = (20 x 140) / 100 = 28

 

With two different results, both would end up in argument, so the third needed to be calculated:

Sale Price        Price of Lodgings
200               35
-100             -20
-----            -------
100               15

 

So the difference between both calculations forms a proportion to calculate the new case, when Sale price for jewel is 140, the price of lodging would be 26.

100: 15:: 40: x

x = (15 x 40) / 100 = 6

 

Mathematically speaking, problem is very interesting to be solved also using Linear Regression, since the two pair of points [200, 35] and [100, 20] form a linear prediction function and we would need to predict what would be the price of lodging, when sale price for jewel is 140.

diamond <- c(100, 200)
sleep   <- c(20, 35)

# regression
sleep_model <- lm(sleep ~ diamond)

plot(x=diamond, y=sleep)
abline(lm(sleep ~ diamond))

2017-03-25 21_07_29-Plot Zoom

Now, we can call this a prediction, what actually Beremiz does by heart.

predict_data <- data.frame(diamond=140)
fit <- predict(sleep_model, predict_data, interval = "predict")

#new value for diamond=140
fit[1]

Result is 26, which is strictly algebraic and R-prediction speaking correct result.

In this case, linear regression does same as proportion calculation, but what strikes me is which calculation – not mathematically speaking – does make more sense? 26 or 24,5 or 28 ? And which method for calculating next price lodge should satisfy both jeweler and lodge owner.

Happy reading!

 

 

First thoughs on R integration in SQL Server 2016

This blog post is a very personal confession, might be delicate but it must be understood as a  proposal or plan for improvement.

Update #1: Typing error in title: correct title is First thoughts on R integration in SQL Server 2016 (update: June 24, 2016)

Before the start, let me put some background and context. I have been working with statistics and database for quite some time now and used different tools for statistical purposes. Starting with SPSS, Lisrel, Pajek, continuing with SPSS Modeler and then going to Weka and R together, very heavly used SQL Server SSAS  and working with Orange and in past 2 years working with SAS  and SAP Hana Analytics. With all the products I have used – and I am not here rank or make comparison which one is better or which one I prefer – some things remain the same. Getting to know your data to a detail is the most important one, where as  applying the statistical knowledge to it the result of it.

So I have had opportunity to work with different data sets, for different companies, on different projects and with different tools. And I am – looking back – very happy to have done it. So this thoughts are not me complaining about any of the products or solutions, but merely looking on improvements based on my year long experiences.

Working with R in SQL Server has been my personal thing for couple of years. The story goes back 4 years ago, when I started working on building a web based engine for creating marketing campaigns based on input parameters and desired goals. Once the business/marketing department decided on their new campaign, using a web based (in house build) program, they selected product groups and customer segment and some minor additional information (time frame, validity of campaign, legal stuff,..) they clicked okey and that was it. But in the background the stored procedure generated long T-SQL that has pushed (1) data into SQL Server engine and (2) selected proper model that feed the data to R engine.  So going back to year 2012, I have build framework for communication SQL Server with R engine. Most of the framework is available at SQLServerCentral web page.  The framework had some security flaws and issues (calling external procedure, storing data in external files) and mostly performance issues (scaling out wasn’t possible, R was solemnly dependent on RAM of the server), but overall the framework was working with putting it into production and project put into production.

Back then I wrote my framework in a way that I would be used as easily as possible, storing and handling R code and T-SQL code in SQL Server database. Most of the problems I had was, that SSMS couldn’t debug or validate R code and to some extent, converting R results back to table was also proven not to be as easy as one might think. But overall, I was still in the middle – between the platform and putting the solution into production – mainly to control the input data and results (outputted data in form of recommendations or campaign lists).

After working with R integration in SQL Server 2016 for the past couple of months, I have to admit, I was happy that security issues were solved. And performance issues were solved with RevoScale library. Both are very very welcoming features. Come to think of it, two main issues did Microsoft solved. And looking from my aspect of building my own framework, security issues were really and badly needed to be solved. Same logic applies to workload and working with large datasets. Performance has proven really great. In addition, you have to keep in mind that with R Integration Microsoft is – rather then keeping them separated – bringing different roles of people even closer. So DBA, Analysts and developers are now and can now be all part of one data process. Integration with Azure is also very positive and good thing, come to think of it as an extension to your flavor of landscape. Support to different databases is also what will in future be very helpful – but admitting, I haven’t test it yet.

What I expect in the future for R integration that Microsoft will do is:

  • support R in Analysis Services
  • better support R in Integration Services
  • stronger R integration with DQS and MDM
  • native integration of R into transact SQL (calling SP every time for every kind of statistics might be tiring)
  • have some DMV created for R profiler, for R debugging and for R performance monitoring
  • bring support for HTML files and JS / JQuery scripts into SSRS or PowerBI
  • improve usage of R exports with file table
  • open server Endpoints Objects for R in order to use full extension of R.
  • improve Resource Governor for R in Standard edition
  • questions on auditing, profiling of R code from T-SQL and
  • security (couple of questions as well)

I have most probably forgot on couple of points for improvement. But this was more technical, now let’s talk about user experience point of view:

  • think about the support of R debugger / validation of code in SSMS
  • in fact, bring R into SSMS (!) instead of VS or create RTMS (R Tools for SQL Server Management Studio)
  • not limit R exports only on data frames or images but support also vectors, lists, etc

Of course, not to hasten myself too much, I forgot to ask myself, what actually the purpose of integration was.

2016-06-23 11_48_40-sql-server-2016-r-services-7-638.jpg (JPEG Image, 656 × 333 pixels)

Is it really only about making the super predictive analytics or is it making statistical tool as well? If latter, than I would strongly suggest to recap my open suggestions.

So, for the end, I will serve you with a hint 🙂 I am preparing and building bunch of scripts (from my previous experience and from the framework that I have built) to help all the data scientists and developers to use R much easier. But (!) – mathematical and statistical background still is very much a something I would put as highly recommended to have when digging into R.

Follow the blog where I will keep you updated on the scripts and where the code will be available.

 

Happy R-SQLing 🙂