SQL Saturday statistics – Web Scraping with R and SQL Server

I wanted to check a simple query: How many times has a particular topic been presented and from how many different presenters.

Sounds interesting, tackling the problem should not be a problem, just that the end numbers may vary, since there will be some text analysis included.

First of all, some web scraping and getting the information from Sqlsaturday web page. Reading the information from the website, and with R/Python integration into SQL Server, this is fairly straightforward task:

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
 library(rvest)
 library(XML)
 library(dplyr)

#URL to schedule
 url_schedule <- ''http://www.sqlsaturday.com/687/Sessions/Schedule.aspx''

#Read HTML
 webpage <- read_html(url_schedule)

# Event schedule
 schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK

# Extracting HTML content
 ht <- html_text(schedule_info)

df <- data.frame(data=ht)

#create empty DF
 df_res <- data.frame(title=c(), speaker=c())

for (i in 1:nrow(df)){
 #print(df[i])
 if (i %% 2 != 0) #odd flow
 print(paste0("title is: ", df$data[i]))
 if (i %% 2 == 0) #even flow
 print(paste0("speaker is: ", df$data[i]))
 df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1]))
 }
df_res_new = df_res[seq(1, nrow(df_res), 2), ]
OutputDataSet <- df_res_new'

Python offers Beautifulsoup library that will do pretty much the same (or even better) job as rvest and XML packages combined. Nevertheless, once we have the data from a test page out (in this case I am reading the Slovenian SQLSaturday 2017 schedule, simply because, it is awesome), we can “walk though” the whole web page and generate all the needed information.

SQLSaturday website has every event enumerated, making it very easy to parametrize the web scrapping process:

2017-11-12 13_13_30-SQLSaturday #687 - Slovenia 2017 _ Sessions _ Schedule

So we will scrape through last 100 events, by simply incrementing the integer of the event; so input parameter will be parsed as:

http://www.sqlsaturday.com/600/Sessions/Schedule.aspx

http://www.sqlsaturday.com/601/Sessions/Schedule.aspx

http://www.sqlsaturday.com/602/Sessions/Schedule.aspx

and so on, regardless of the fact if the website functions or not. Results will be returned back to the SQL Server database.

Creating stored procedure will go the job:

USE SqlSaturday;
GO

CREATE OR ALTER PROCEDURE GetSessions
 @eventID SMALLINT
AS

DECLARE @URL VARCHAR(500)
SET @URL = 'http://www.sqlsaturday.com/' +CAST(@eventID AS NVARCHAR(5)) + '/Sessions/Schedule.aspx'

PRINT @URL

DECLARE @TEMP TABLE
(
 SqlSatTitle NVARCHAR(500)
 ,SQLSatSpeaker NVARCHAR(200)
)

DECLARE @RCODE NVARCHAR(MAX)
SET @RCODE = N' 
 library(rvest)
 library(XML)
 library(dplyr)
 library(httr)
 library(curl)
 library(selectr)
 
 #URL to schedule
 url_schedule <- "'
 
DECLARE @RCODE2 NVARCHAR(MAX) 
SET @RCODE2 = N'"
 #Read HTML
 webpage <- html_session(url_schedule) %>%
 read_html()

# Event schedule
 schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK

# Extracting HTML content
 ht <- html_text(schedule_info)

df <- data.frame(data=ht)

#create empty DF
 df_res <- data.frame(title=c(), speaker=c())

for (i in 1:nrow(df)){
 #print(df[i])
 if (i %% 2 != 0) #odd flow
 print(paste0("title is: ", df$data[i]))
 if (i %% 2 == 0) #even flow
 print(paste0("speaker is: ", df$data[i]))
 df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1]))
 }

df_res_new = df_res[seq(1, nrow(df_res), 2), ]
 OutputDataSet <- df_res_new ';

DECLARE @FINAL_RCODE NVARCHAR(MAX)
SET @FINAL_RCODE = CONCAT(@RCODE, @URL, @RCODE2)

INSERT INTO @Temp
EXEC sp_execute_external_script
 @language = N'R'
 ,@script = @FINAL_RCODE


INSERT INTO SQLSatSessions (sqlSat,SqlSatTitle,SQLSatSpeaker)
SELECT 
 @EventID AS sqlsat
 ,SqlSatTitle
 ,SqlSatSpeaker
FROM @Temp

 

Before you run this, just a little environement setup:

USE [master];
GO

CREATE DATABASe SQLSaturday;
GO

USE SQLSaturday;
GO

CREATE TABLE SQLSatSessions
(
 id SMALLINT IDENTITY(1,1) NOT NULL
,SqlSat SMALLINT NOT NULL
,SqlSatTitle NVARCHAR(500) NOT NULL
,SQLSatSpeaker NVARCHAR(200) NOT NULL
)

 

There you go! Now you can run a stored procedure for a particular event (in this case SQL Saturday Slovenia 2017):

EXECUTE GetSessions @eventID = 687

or you can run this procedure against multiple SQLSaturday events and web scrape data from SQLSaturday.com website instantly.

For Slovenian SQLSaturday, I get the following sessions and speakers list:

2017-11-13 19_19_46-49_blog_post.sql - SICN-KASTRUN.SQLSaturday (SPAR_si01017988 (57))_ - Microsoft .png

Please note that you are running this code behind the firewall and proxy, so some additional changes for the proxy or firewall might be needed!

So going to original question, how many times has the query store been presented on SQL Saturdays (from SQLSat600 until  SqlSat690), here is the frequency table:

2017-11-13 19_57_04-Statistics_on_web_scraping_results.sql - SICN-KASTRUN.SQLSaturday (SPAR_si010179

Or presented with pandas graph:

session_stats

Query store is popular, beyond all R, Python or Azure ML topics, but Powershell is gaining its popularity like crazy. Good work PowerShell people! ūüôā

UPDATE #1: More statistics; in general PowerShell session is presented on every second SQLSaturday, Query Store on every third, whereas there are minimum 2 topics related to Azure on every SQLSat event (relevant for SqlSat events ranging from SqlSat600 to SqlSat690).

As always, code is available at Github.

 

Advertisements

Robust IRIS Dataset?

This blog post was born out of pure curiosity about the robustness of the IRIS Dataset. Biological datasets do not need to be that big in comparison to datasets of customers, consumption, stock and anything that might be volatile.

When still at the university, on one occasion I can remember, we were measuring the length of the frog legs and other frogy features. And after just a couple of measures, the further prediction was steady. Also, any kind of sampling was (RS and SRS, cluster/stratified sampling, sampling with replacements and many other creative ways of sampling) proven to be rigid, robust and would converge quickly to a good result.

Therefore, I have decided to put the IRIS dataset to the test, using a simple classification method. Calculating first the simple euclidian distance, following by finding the neighbour and based on that checking the membership of the type of flowers with the labels.

Accuracy of the prediction was tested by mapping the original species with predicted ones. And the test was, how large can a train dataset be in order to still get a good result.

After some Python and R code, the results were in.

I have tested following pairs (train:test sample size):

  • 80% – 20%
  • 60% – 40%
  • 50% – 50%
  • 30% – 70%
  • 10% – 90%

Note, that the IRIS dataset has 150 observations, each evenly distributed among three species. Following Python code loop through the calculation of euclidean distance.

for x in range(3000):
    exec(open("./classification.py").read(), globals())
    x += 1

At the end I have generated the file:

predictions

With these results, simple R code to generate the scatter plot was used:

library(ggplot2)
setwd("C:\\Predictions\\")
df_pred <- data.frame(read.table("results_split.txt", sep=";"))
p <- ggplot(df_pred, aes(df_pred$V3, df_pred$V1)) 
p <- p + geom_point(aes(df_pred$V3))
p <- p + labs(x="Accuracy (%) of predictions", y="Size of training subset")
p

Which resulted as:

Plot

The graph clearly shows that 10% of training set (10% out of 150 observations) can generate very accurate predictions every 1,35x times.

Other pairs, when taking 30% or 50% of training set, will for sure give close to 100% accuracy almost every time.

Snippet of Python code to generate euclidean distance:

def eucl_dist(set1, set2, length):
    distance = 0
    for x in range(length):
        distance += pow(set1[x] - set2[x], 2)
    return math.sqrt(distance)

and neighbours:

def find_neighbors(train, test, nof_class):
    distances = []
    length_dist = len(test) - 1
    for x in range(len(train)):
        dist = eucl_dist(test, train[x], length_dist)
        distances.append((train[x],dist))
    distances.sort(key=operator.itemgetter(1))
    neighbour = []
    for x in range(nof_class):
        neighbour.append(distances[x][0])
    return neighbour

 

Conclusion, IRIS dataset is – due to the nature of the measurments and observations – robust and rigid; one can get very good accuracy results on a small training set. Everything beyond 30% for training the model, is for this particular case, just additional overload.

Happy R & Python coding! ūüôā

Passing two SQL queries to sp_execute_external_script

Recently, I got a question on one of my previous blog posts, if there is possibility to pass two queries in same run-time as an argument to external procedure sp_execute_external_script.

Some of the  arguments of the procedure sp_execute_external_script are enumerated. This is valid for the inputting dataset and as the name of argument @input_data_1 suggests, one can easily (and this is valid doubt) think, there can also be @input_data_2 argument, and so on. Unfortunately, this is not true.  External procedure can hold only one T-SQL dataset, inserted through this parameter.

There are many reasons for that, one would be the cost of sending several datasets to external process and back, so inadvertently, this forces user to rethink and pre-prepare the dataset (meaning, do all the data munging beforehand), prior to sending it into external procedure.

But there are workarounds on how to pass additional query/queries to sp_execute_external_script. I am not advocating this, and I strongly disagree with such usage, but here it is.

First I will create two small datasets, using T-SQL

USE SQLR;
GO

DROP TABLE IF EXISTS dataset;
GO

CREATE TABLE dataset
(ID INT IDENTITY(1,1) NOT NULL
,v1 INT
,v2 INT
CONSTRAINT pk_dataset PRIMARY KEY (id)
)

SET NOCOUNT ON;
GO

INSERT INTO dataset(v1,v2)
SELECT TOP 1
 (SELECT TOP 1 number FROM master..spt_values WHERE type IN ('EOB') ORDER BY NEWID()) AS V1
,(SELECT TOP 1 number FROM master..spt_values WHERE type IN ('EOD') ORDER BY NEWID()) AS v2
FROM master..spt_values
GO 50

This dataset will be used directly into @input_data_1 argument. The next one will be used through R code:

CREATE TABLE external_dataset
(ID INT IDENTITY(1,1) NOT NULL
,v1 INT
CONSTRAINT pk_external_dataset PRIMARY KEY (id)
)

SET NOCOUNT ON;
GO

INSERT INTO external_dataset(v1)
SELECT TOP 1
 (SELECT TOP 1 number FROM master..spt_values WHERE type IN ('EOB') ORDER BY NEWID()) AS V1
FROM master..spt_values
GO 50

Normally,  one would use a single dataset like:

EXEC sp_execute_external_script
     @language = N'R'
    ,@script = N'OutputDataSet <- data.frame(MySet);'
    ,@input_data_1 = N'SELECT TOP 5 v1, v2 FROM dataset;'
    ,@input_data_1_name = N'MySet'
WITH RESULT SETS
((
 Val1 INT
 ,Val2 INT
))

But by “injecting” the¬† ODBC into R code, we can allow external procedure, to get back to your SQL Server and get additional dataset.

This can be done by following:

EXECUTE AS USER = 'RR'; 
GO

DECLARE @Rscript NVARCHAR(MAX)
SET @Rscript = '
   library(RODBC)
   myconn <-odbcDriverConnect("driver={SQL Server};
         Server=SICN-KASTRUN;database=SQLR;uid=RR;pwd=Read!2$16")
  External_source <- sqlQuery(myconn, "SELECT v1 AS v3 
                    FROM external_dataset")
  close(myconn) 
  Myset <- data.frame(MySet)
   #Merge both datasets
   mergeDataSet <- data.frame(cbind(Myset, External_source));'

EXEC sp_execute_external_script
    @language = N'R'
   ,@script = @Rscript
   ,@input_data_1 = N'SELECT v1, v2 FROM dataset;'
   ,@input_data_1_name = N'MySet'
   ,@output_data_1_name = N'mergeDataSet'
WITH RESULT SETS
((
    Val1 INT
   ,Val2 INT
   ,Val3 INT
))

REVERT;
GO

And the result will be merged two datasets, in total three columns:

2017-07-25 22_04_48-Two_data_sets_R_sp_execute_external_script.sql - SICN-KASTRUN.SQLR (SPAR_si01017

which correspond to two datasets:

-- Check the results!
SELECT * FROM dataset
SELECT * FROM external_dataset

There are, as already mentioned, several opposing factors to this approach, and I would not recommend this. Some are:

  • validating and keeping R code in one place
  • performance issues
  • additional costs of data transferring
  • using ODBC connectors
  • installing additional R packages (in my case RODBC package)
  • keeping different datasets in one place
  • security issues
  • additional login/user settings
  • firewall inbound/outbound rules setting

This, of course, can also be achieved with *.XDF file formats, if they are stored locally or on server as a files.

As always, code is available at Github.

Happy R-SQLing! ūüôā

Graph Database with T-SQL

SQL Server 2017 will bring users graph database. Graph database is type of database that uses graph attributes – edges (or relationships) and nodes (or vertices) and features, to store data. Relationship between nodes are through edges and features (also called properties) hold the values to represent the strength or direction of the edges. Many of this links can be retrieved easy with a single operation (in comparison to T-SQL language, which can quickly become complex).

Links are also stored in the database, and when querying the graph database, use join clause, to collect the related data. Especially very useful for hierarchical structures and complex social structures, can be easily stored and queried in comparison to “relational” database.

Network database vs. graph database are similar, whereas, network might have problems when dealing with several chains of edges.

When to use graph/network database should be obvious; when you have hierarchical data or when your data is too complex with many many-to-many relations, that would make a relational database hard to query.

small_undirected_network_labeled

 

To put the graph database to the test, I took bunch of emails from a particular MVP SQL Server distribution list (content will not be shown and all the names will be anonymized). On my gmail account, I have downloaded some 90MiB of emails in mbox file format. With some python scripting,  only FROM and SUBJECTS were extracted:

writer.writerow(['from','subject'])
for index, message in enumerate(mailbox.mbox(infile)):
    content = get_content(message)
    row = [
        message['from'].strip('>').split('<')[-1],
        decode_header(message['subject'])[0][0],"|"
          ]
    writer.writerow(row)

Up to this point, I have extracted a CSV file with only From field and subject field, looking as:

Excel_emails

After importing the CSV file to SQL Server database, I have started to prepare my graph database structures:

CREATE TABLE MVP_DataPlatform
(
 MVP_ID INT IDENTITY(1,1) NOT NULL
 ,MVP_Name VARCHAR(100) NULL
 ,MVP_Email VARCHAR(100) NOT NULL
) AS NODE
GO

CREATE TABLE MVP_Topics
(
 Topic_ID INT IDENTITY(1,1) NOT NULL
 ,Title NVARCHAR(250) NOT NULL
) AS NODE
GO

CREATE TABLE Initiated AS EDGE;
CREATE TABLE Replied AS EDGE;
And prepare the data with:

with mails 
as
(
SELECT 
 froms 
 ,REPLACE(SUBSTRING(froms, 1, CASE CHARINDEX(';', froms) 
  WHEN 0 THEN LEN(froms) ELSE CHARINDEX(';',froms)-1 END),',','') AS from_MVP
 ,REPLACE(SUBSTRING(froms, CHARINDEX(';',froms) + 1, 1000),',|','') AS title

FROM [dbo].[clean_mail5_2] 
)
SELECT 
 from_MVP
 ,SUBSTRING(title,2, LEN(title)) as title
 ,CASE WHEN (
 SUBSTRING(title,1, 10) LIKE '%RE%' 
 OR SUBSTRING(title,1, 10) LIKE '%AW%'
 OR SUBSTRING(title,1, 10) LIKE '%FWD%'
 OR SUBSTRING(title,1, 10) LIKE '%OT%'
 ) THEN 0 ELSE 1 END MVP_start
INTO MVP_Mails
FROM mails
 WHERE
 from_MVP like '%@%'

Populating the nodes (vertices) should be a easy task:

INSERT INTO MVP_DataPlatform (MVP_Name, MVP_Email)
SELECT 
 SUBSTRING(from_MVP, 1, CHARINDEX('@',from_MVP)-1) AS MVP_NAME
 ,from_MVP AS MVP_Email
FROM MVP_Mails
GROUP BY 
 SUBSTRING(from_MVP, 1, CHARINDEX('@',from_MVP)-1)
 ,from_MVP;
GO

INSERT INTO MVP_Topics (Title)
SELECT 
 Title
FROM MVP_Mails
GROUP BY 
 title;
GO

But populating the edges (relationships) is part of how you want to define your graph.

I have created two edges; Replied and initiated. When using replied, logic will be following:

  1. AB -> send a mail to distribution list =  AB -> AB
  2. CD -> replies to AB  = CD -> AB
  3. EF -> replies to CD (and AB) = EF -> CD
  4. and so on.

When Initiated, logic is slightly different:

  1. AB -> send a mail to distribution list =  AB -> AB
  2. CD -> replies to AB  = CD -> AB
  3. EF -> replies to CD (and AB) = EF -> AB
  4. and so on…

So difference is, how I want to store information at step 3.; when EF replies. But this rule is – totally arbitrary and up to your type of business question, hierarchy or many-to-many rules. I can even decide to store information at step 3. normalized, as: ((EF -> CD) AND (EF -> AB)).

To store this set of information into the table of edges, following transformation is needed:

CREATE OR ALTER PROCEDURE INSERT_EDGE
AS
DECLARE @id1 INT;
DECLARE @id2 INT;
DECLARE @i INT;
DECLARE @hid INT;
DECLARE @START INT;

select 
 mail.from_MVP as mail_ID
,pl.MVP_id AS pl_ID
,mail.title
,mail.MVP_Start
,row_number() OVER (partition by title ORDER BY title, MVP_Start DESC) as rn_title
,row_number() Over (order by (select 1)) as rn_global
into #temp
from MVP_Mails as mail
JOIN MVP_DataPlatform AS Pl
ON mail.from_mvp = pl.mvp_email

SELECT @i = MAX(rn_global) FROM #TEMP

SET @hid=1;
WHILE @hid <= @i
BEGIN 
 SELECT @start = MVP_start FROM #temp WHERE rn_global = @hid
 PRINT @start
 SELECT @id1 = pl_id FROM #temp WHERE rn_global = @hid
 PRINT @id1
 IF @start = 1
 SELECT @id2 = pl_id FROM #temp WHERE rn_global = @hid+1
 IF @start = 0
 SET @id2 = @id1
 PRINT @id2

INSERT INTO replied VALUES ((SELECT $NODE_ID FROM MVP_DataPlatform WHERE MVP_ID = @id1), (SELECT $NODE_ID FROM MVP_DataPlatform WHERE MVP_ID = @id2));
 SET @hid = @hid + 1;
END

Executing this procedure:

EXECUTE PROCEDURE INSERT_EDGE

You should be getting a table with relations:

relations

And now you can perform simple queries against the graph – including complex relations, in order to get results relatively simple:

SELECT 
 MVP_1.MVP_Name 
 ,MVP_2.MVP_Name
 ,T.Title
 ,T.Topic_ID
FROM 
 MVP_DataPlatform AS MVP_1
 ,MVP_DataPlatform AS MVP_2
 ,Replied
 ,MVP_Topics AS T
WHERE 
 MATCH(MVP_1-(Replied)->MVP_2)

 

Sure, there are many other ways to do the same steps and also many other ways how you want your data to be represented in a graph.

Happy T-SQL Querying! ūüôā

 

Installing R packages with rxInstallPackages in Microsoft R Server

In MicrosoftML package comes – in my opinion – long anticipated function for installing R packages for SQL Server and Microsoft R Server. And, I am super happy.

Last year, in one of my previous blog posts, I have been showing how to install R package from SSMS using sp_execute_external_script. Now, with new package MicrosoftML (that is part of Microsoft R Server 9.x and above)  new function is available that enables you to easy install the package and also little bit more.

Code is relatively simple and straightforward:

USE SQLR;
GO

EXECUTE sp_execute_external_script
 @language = N'R'
 ,@script = N'

 packagesToInstall <- c("caret","tree","party")
 library(MicrosoftML)
 SqlServerCC <- RxInSqlServer(connectionString = "Driver=SQL Server;
+Server=SICN-KASTRUN\\SQLSERVER2017C2;Database=SQLR;
+Trusted_Connection=True;")
 rxInstallPackages(pkgs = packagesToInstall, owner = '', 
+scope = "shared", computeContext = "SqlServerCC");';
GO

This is way too easy to be true, but it is. Make sure to do couple of things prior to running this code:

  1. set the compute environment to where your packages are installed
  2. set up the correct permissions and access
  3. Check up also the tcp/ip protocols

In rxInstallPackages function use computeContext parameter to set either to “Local” or to your¬† “SqlServer” environment, you can also use scope as shared or private (difference is, if you install package as shared it can be used by different users across different databases, respectively for private). You can also specify owner if you are running this command out of db_owner role.

Happy SQLR-ing!

Performance differences between RevoScaleR, ColumnStore Table and In-Memory OLTP Table

Running *.XDF files using RevoScaleR computational functions versus have dataset available in Columnstore table or in In-Memory OLTP table will be focus of comparison for this blog post.

For this test, I will use the AirLines dataset, available here. Deliberately, I have picked a sample 200 MB (of 13GB dataset) in order to properly test the differences and what should be the best way.

After unzipping the file, I will use following T-SQL query to import the file into SQL Server.

With this example, you can import xdf file directly to SQL Server table (note, that I have transformed a CSV file into XDF and import xdf file into SQL table):

-- must have a write permissions on folder: C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                rxOptions(sampleDataDir = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData")
                inFile <- file.path(rxGetOption("sampleDataDir"), "airsample.csv")
                of <-  rxDataStep(inData = inFile, outFile = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData/airline20170428_2.xdf", 
                             transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")
                            ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))
                            ,overwrite = TRUE
                            ,maxRowsByCols = 10000000
                            ,rowsPerRead = 200000)
                OutputDataSet <- rxXdfToDataFrame(of)'

DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT 1 AS N'

EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((ArrDelay INT
                    ,CRSDepTime DECIMAL(6,4)
                    ,DofWeek NVARCHAR(20)))
GO

 

So the whole process is to be done by creating a table, converting the above sp_execute_external_script into procedure and import results from external procedure to the table.

--Complete process
CREATE TABLE AirFlights_small 
(id INT IDENTITY(1,1)
,ArrDelay INT
,CRSDepTime DECIMAL(6,4)
,DofWeek NVARCHAR(20) 
);
GO

CREATE Procedure ImportXDFtoSQLTable
AS
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                rxOptions(sampleDataDir = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData")
                inFile <- file.path(rxGetOption("sampleDataDir"), "airsample.csv")
                of <-  rxDataStep(inData = inFile, outFile = "airline20170428_2.xdf", 
                transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")
            ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))
            ,overwrite = TRUE
            ,maxRowsByCols = 10000000)
             OutputDataSet <- data.frame(rxReadXdf(file=of, varsToKeep=c("ArrDelay", "CRSDepTime","DayOfWeek")))'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT 1 AS N'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((ArrDelay INT,CRSDepTime DECIMAL(6,4),DofWeek NVARCHAR(20)));
GO

INSERT INTO AirFlights_small
EXECUTE ImportXDFtoSQLTable;
GO

 

There you go. Data are in T-SQL Table. Now we can start with comparisons.  I will be measuring the time to get average air delay time per day of the week.

2017-04-28 22_44_10-RStudio

RevoScaleR

With using the RevoScaleR package, I will be using rxCrossTabs function with the help of transform argument to convert day of the week into factors:

#importing data
outFile2 <- rxDataStep(inData = inFile, outFile = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData/airline20170428_2.xdf", 
            transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")
           ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))
           ,overwrite = TRUE
           ,maxRowsByCols = 10000000)

of2 <- data.frame(rxReadXdf(file=outFile2, varsToKeep=c("ArrDelay", "CRSDepTime","DayOfWeek")))

summary(rxCrossTabs(ArrDelay~DayOfWeek
                    ,data = of2  #outFile2
                    ,transforms = transforms
                    ,blocksPerRead=300000), output="means")

Now get those times:

# Getting times
system.time({ 
  summary(rxCrossTabs(ArrDelay~DayOfWeek
                      ,data = of2
                      ,transforms = transforms
                      ,blocksPerRead=300000), output="means")
  })

With results of 7.8 on elapsed time and computation time of 3.8 second.

Rows Read: 8400013, Total Rows Processed: 8400013, Total Chunk Time: 3.825 seconds 
Computation time: 3.839 seconds.
   user  system elapsed 
   2.89    0.37    7.89 

 

T-SQL query without any specifics

To have a baseline, let’s run the following query:

SET STATISTICS TIME ON;
SELECT 
[DofWeek]
,AVG(ArrDelay) AS [means]
FROM
    AirFlights_small
GROUP BY 
    [DofWeek]
SET STATISTICS TIME OFF;

And check these time statistics

 SQL Server Execution Times:
CPU time = 6124 ms,  elapsed time = 2019 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.

Obiously the CPU / computation time is higher, although the elapsed time is faster.

ColumnStore Table

Let’s create a nonclustered column store index.

CREATE TABLE AirFlights_CS
(id INT IDENTITY(1,1)
,ArrDelay INT
,CRSDepTime DECIMAL(6,4)
,DofWeek NVARCHAR(20) 
);
GO
INSERT INTO AirFlights_CS(ArrDelay, CRSDepTime, DofWeek)
SELECT ArrDelay, CRSDepTime, DofWeek FROM AirFlights_small 

CREATE NONCLUSTERED COLUMNSTORE INDEX NCCI_AirFlight
ON AirFlights_CS
(id, ArrDelay, CRSDepTime, DofWeek);
GO

With the execution of the same query

SET STATISTICS TIME ON;
SELECT 
[DofWeek]
,AVG(ArrDelay) AS [means]
FROM
  AirFlights_CS
GROUP BY     [DofWeek] SET STATISTICS TIME OFF;

The following time statistics are in

 SQL Server Execution Times:
CPU time = 202 ms,  elapsed time = 109 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.

 

In-Memory OLTP

To get Memory optimized table, we need to add a filegroup and create a table with memory optimized turned on:

CREATE TABLE dbo.AirFlight_M   
(  
  id INT NOT NULL PRIMARY KEY NONCLUSTERED
 ,ArrDelay INT
 ,CRSDepTime DECIMAL(6,4) 
 ,DofWeek NVARCHAR(20)
) WITH (MEMORY_OPTIMIZED=ON, DURABILITY = SCHEMA_AND_DATA);
GO

And insert the data

INSERT INTO AirFlight_M
SELECT * FROM AirFlights_small

Running the simple query

SET STATISTICS TIME ON;
SELECT 
[DofWeek]
,AVG(ArrDelay) AS [means]
FROM
    AirFlight_M
GROUP BY 
    [DofWeek]
SET STATISTICS TIME OFF;

results are:

 SQL Server Execution Times:
CPU time = 6186 ms,  elapsed time = 1627 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.

These results were somehow expected, mostly because the ColumnStore table is the only one having index and reading (also by looking in execution plans) optimized with comparison to others. Also degree of parallelism, clustered and non-clustered index can  be pushed, but the idea was to have tests similar to the one in RevoScaleR and R environemnt. With R, we can not push any index on the XDF file.

In R we run:

system.time({ 
LMResults <- rxLinMod(ArrDelay ~ DayOfWeek, data = outFile2, transforms = transforms)
LMResults$coefficients
})

And in SSMS we run:

SET STATISTICS TIME ON;
-- 1. T-SQL
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)
                OutputDataSet <- data.frame(LMResults$coefficients)'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlights_small]'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((
            --DofWeek NVARCHAR(20)
        --    ,
            Coefficient DECIMAL(10,5)
            ));
GO
SET STATISTICS TIME OFF;


SET STATISTICS TIME ON;
-- 2. ColumnStore
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)
                OutputDataSet <- data.frame(LMResults$coefficients)'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlights_CS]'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((
            --DofWeek NVARCHAR(20)
        --    ,
            Coefficient DECIMAL(10,5)
            ));
GO
SET STATISTICS TIME OFF;


SET STATISTICS TIME ON;
-- 3. Memory optimized
DECLARE @RScript nvarchar(max)
SET @RScript = N'library(RevoScaleR)
                LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)
                OutputDataSet <- data.frame(LMResults$coefficients)'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlight_M]'
EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RScript
    ,@input_data_1 = @SQLScript
WITH RESULT SETS ((
            --DofWeek NVARCHAR(20)
        --    ,
            Coefficient DECIMAL(10,5)
            ));
GO
SET STATISTICS TIME OFF;

 

Conclusion

Gathering statistics on CPU time and elapsed time when running simple Linear regression, this is comparison:

df_LR_comparison <- data.frame (
  method = c("T-SQL", "ColumnStore", "Memory Optimized", "RevoScaleR")
  ,CPUtime = c(3000,1625,2156,7689)
  ,ElapsedTime = c(14323,10851,10600,7760)
  )
library(ggplot2)

ggplot(df_LR_comparison, aes(method, fill=method)) + 
  geom_bar(aes(y=ElapsedTime), stat="identity") +
  geom_line(aes(y=CPUtime, group=1), color="white", size=3) +
  scale_colour_manual(" ", values=c("d1" = "blue", "d2" = "red"))+
  #scale_fill_manual("",values="red")+
  theme(legend.position="none")

Showing that elapsed time for R environment with RevoScaleR is fastest (and getting data from XDF), where as simple T-SQL run with sp_execute_external_script and using RevoScaleR gives the slowest response.

2017-04-29 00_43_10-Plot Zoom

In terms of CPU time (white line), Columnstore with RevoScaleR call through external procedure outperforms all others.

Final conclusion: When running statistical analysis (using RevoScaleR or any other R library), use columnstore and index optimized tables/views to receive best CPU and elapsed times.  Important to remember is also the fact, that any aggregations and calculations that can be done within SQL Server, are better to be perfomered there.

 

As always, code is available at GitHub.

 

Happy coding! ūüôā

Saving input and output with sp_execute_external_script using temporal table and file table (part #2)

In my previous blog post, Saving input and output with sp_execute_external_script, I was exploring the possibilities how to capture the R code that external procedure sends it internally to the Launchpad.exe program.  And a blog comment by reader Bob gave me additional push to write the second part to same topic. Thank you Bob for sharing this with us.

Bob was explained that how they are doing, and you can read all about it here. To recap, R code (as well as any additional packages) is stored on local file system, it gets copied, zipped, transferred and inserted to SQL table. Calling sp_execute_external_script everything gets unzipped and executed.

I will not comment on the solution Bob provided, since I don’t know how their infrastructure, roles, security is set up. At this point, I am grateful for his comment. But what I will comment, is that there is no straightforward way or any out-of-the-box solution. Furthermore, if your R code requires any additional packages, storing the packages with your R code is not that bad idea, regardless of traffic or disk overhead. And versioning the R code is something that is for sure needed.

To continue from previous post, getting or capturing R code, once it gets to Launchpad, is tricky. So storing R code it in a database table or on file system seems a better idea.

Starting with original sample R code:

EXEC sys.sp_execute_external_script
     @language = N'R'
    ,@script = N'
       d <- InputDataSet 
       c <- data.frame(Num_V1 = c(1,2,3))
       c
       OutputDataSet <- c'
    ,@input_data_1 = N'SELECT 1 AS Nmbrs_From_R'

WITH RESULT SETS ((Numbers_From_R INT));

We can create SQL table for R code to be persistent and always available.

CREATE TABLE R_code 
(id INT
,R NVARCHAR(MAX))
INSERT INTO R_code
SELECT 1, '
       d <- InputDataSet 
       c <- data.frame(Num_V1 = c(1,2,3))
       c
       OutputDataSet <- c'
-- (1 row(s) affected)

DECLARE @r_code NVARCHAR(MAX)
SELECT @r_code = R FROM R_code WHERE id = 1

EXEC sys.sp_execute_external_script
     @language = N'R'
    ,@script = @r_code
    ,@input_data_1 = N'SELECT 1 AS Nmbrs_From_R'
WITH RESULT SETS ((Numbers_From_R INT));

This will yield same results. To have R code and T-SQL code side by side, I would suggest to store T-SQL in table as well.

DROP TABLE IF EXISTS R_code
CREATE TABLE R_code 
(id INT
,R NVARCHAR(MAX)
,SQLC NVARCHAR(MAX))

INSERT INTO R_code
SELECT 1, '
       d <- InputDataSet 
       c <- data.frame(Num_V1 = c(1,2,3))
       c
       OutputDataSet <- c','SELECT 1 AS Nmbrs_From_R'
-- (1 row(s) affected)

DECLARE @r_code NVARCHAR(MAX)
DECLARE @sql_code NVARCHAR(MAX)
SELECT @r_code = R FROM R_code WHERE id = 1
SELECT @sql_code = SQLC FROM R_code WHERE id = 1

EXEC sys.sp_execute_external_script
     @language = N'R'
    ,@script = @r_code
    ,@input_data_1 = @sql_code
WITH RESULT SETS ((Numbers_From_R INT));

Now, the best thing to do, is to add some logging to the R_code table and some versioning. Easiest way to achieve this by using Temporal Table.

TEMPORAL TABLE

Rewrite the original table:

CREATE TABLE R_code 
(
 id INT IDENTITY(1,1)
,CombinationID INT NOT NULL CONSTRAINT PK_ComboID PRIMARY KEY
,R NVARCHAR(MAX)
,SQLC NVARCHAR(MAX)
,Valid_From DATETIME2 GENERATED ALWAYS AS ROW START NOT NULL
,Valid_To DATETIME2 GENERATED ALWAYS AS ROW END NOT NULL
,PERIOD FOR SYSTEM_TIME (Valid_From, Valid_To)
)
WITH (SYSTEM_VERSIONING = ON);

Please note, that table will be represented slightly differently (see the clock in the icon).

temp_table

Besides actual table (or Temporal Table), system automatically creates history table where all the changes are being kept.

Once this is done, I can store T-SQL and R-code.

INSERT INTO R_code (CombinationID, R, SQLC)
SELECT 1,'
       d <- InputDataSet 
       c <- data.frame(Num_V1 = c(1,2,3))
       c
       OutputDataSet <- c','SELECT 1 AS Nmbrs_From_R'
-- (1 row(s) affected)

Now I can run the query same way as before:

DECLARE @r_code NVARCHAR(MAX)
DECLARE @sql_code NVARCHAR(MAX)
SELECT @r_code = R FROM R_code WHERE CombinationID = 1
SELECT @sql_code = SQLC FROM R_code WHERE CombinationID = 1

EXEC sys.sp_execute_external_script
     @language = N'R'
    ,@script = @r_code
    ,@input_data_1 = @sql_code
WITH RESULT SETS ((Numbers_From_R INT));

Suppose that there are changes either to R code or SQL Code committed, and I will simulate this change with an UPDATE query:

-- INSERT ANOTHER CombinationID = 1, with changed R Code
UPDATE R_code
SET R = '
       d <- InputDataSet 
       c <- data.frame(Num_V1 = c(1,2,3,4))
       c
       d
       OutputDataSet <- c'
,SQLC = 'SELECT 1 AS Nmbrs_From_R'
WHERE
    CombinationID = 1
-- (1 row(s) affected)

The best part is that system maintains the versioning automatically and I – as an end user – don’t need to worry about changing the original T-SQL code that executes R Script. So once again I can execute the same query:

DECLARE @r_code NVARCHAR(MAX)
DECLARE @sql_code NVARCHAR(MAX)
SELECT @r_code = R FROM R_code WHERE CombinationID = 1
SELECT @sql_code = SQLC FROM R_code WHERE CombinationID = 1

EXEC sys.sp_execute_external_script
     @language = N'R'
    ,@script = @r_code
    ,@input_data_1 = @sql_code
WITH RESULT SETS ((Numbers_From_R INT));

And now, I will get the results from updated R-script or T-SQL code.

result_temporal

FILE TABLE

Another way (among many) is to use File Table. In one of my previous blog posts, I covered how to create and configure File Table. In this scenario, we will consider following. Create a file, that will contain R code and store it with *.R extension. And upload it to the directory, where File Table is considering this file-stream data to be landed or stored.

Quickly check the configuration:

--- Check configurations
SELECT 
  DB_NAME(database_id) AS DbName
 ,non_transacted_access
 ,non_transacted_access_desc
 ,directory_name  
 ,*
FROM  sys.database_filestream_options
WHERE 
    DB_NAME(database_id) = db_name() --'FileTableRChart'

By checking where my R file is residing:

SELECT 
     FT.Name AS [File Name]
    ,IIF(FT.is_directory=1,'Directory','Files') AS [File Category]
    ,FT.file_type AS [File Type]
    ,(FT.cached_file_size)/1024.0 AS [File Size (KB)]
    ,FT.creation_time AS [File Created Time]
    ,FT.file_stream.GetFileNamespacePath(1,0) AS [File Path]
    ,ISNULL(PT.file_stream.GetFileNamespacePath(1,0),'Root Directory') AS [Parent Path]
FROM 
    [dbo].[ChartsR] AS FT
LEFT JOIN [dbo].[ChartsR] AS PT
ON FT.path_locator.GetAncestor(1) = PT.path_locator
WHERE
    FT.File_type = 'R'

And you can see that I have created R file with the name R_combination1.R

2017-04-17 19_46_14-Document1 - Word

So we can access this using OPENROWSET.

SELECT * FROM 
OPENROWSET(BULK N'\\****\RCharts\DocumentTable\R_Combination1.R',
   SINGLE_CLOB) AS R_Code

You can also do this using the master.dbo.xp_cmdshell, whatever suits you better.

Once you have code read from R file, you can simply continue to execute external procedure.

-- Physical Location of FileTable
DECLARE @r_code NVARCHAR(MAX)
SELECT @r_code = BulkColumn FROM 
OPENROWSET(BULK N'C:\DataTK\00\R_Combination1.R', SINGLE_CLOB) AS R_Code

EXEC sys.sp_execute_external_script
     @language = N'R'
    ,@script = @r_code
    ,@input_data_1 = N'SELECT 1 AS Nmbrs_From_R'
WITH RESULT SETS ((Numbers_From_R INT));

And result is the same. In this case you need to consider the usage of BCP, BulkCopy or XP_CMDSHELL. Again, based on your eco-system, what suits you best.

With file table versioning can be a bit of work-around. In addition, you should implement some naming convention to store files correctly or add additional logic to storing and keeping the changes.2017-04-17 20_31_20-Store_input_output_sp_execute_external_script.sql - SICN-KASTRUN.FileTableRChart

So in my case, I am keeping versioning on FileName level, which can be done using T-SQL or renaming the file on file system and later creating hierarchies with files.

 

As always, Code is available at GitHub.

 

Happy coding!