Using Python Pandas dataframe to read and insert data to Microsoft SQL Server

In the SQL Server Management Studio (SSMS), the ease of using external procedure sp_execute_external_script has been (and still will be) discussed many times. But the reason for this short blog post is the fact that, changing Python environments using Conda package/module management within Microsoft SQL Server (Services), is literally impossible. Scenarios, where you want to build  a larger set of modules (packages) but are impossible to be compatible with your SQL Server or Conda, then you would need to set up a new virtual environment and start using Python from there.

Communicating with database to load the data into different python environment should not be a problem. Python Pandas module is an easy way to store dataset in a table-like format, called dataframe. Pandas is very powerful python package for handling data structures and doing data analysis.

pandas_logo

 

Loading data from SQL Server to Python pandas dataframe

This underlying task is something that every data analyst, data engineer, statistician and data scientist will be using in everyday work. Extracting data from Microsoft SQL Server database using SQL query and storing it in pandas (or numpy) objects.

With following code:

## From SQL to DataFrame Pandas
import pandas as pd
import pyodbc

sql_conn = pyodbc.connect('DRIVER={ODBC Driver 13 for SQL Server};
                            SERVER=SQLSERVER2017;
                            DATABASE=Adventureworks;
                            Trusted_Connection=yes') 
query = "SELECT [BusinessEntityID],[FirstName],[LastName],
                 [PostalCode],[City] FROM [Sales].[vSalesPerson]"
df = pd.read_sql(query, sql_conn)

df.head(3)

 

you will get the first three rows of the result:

2018-07-15 09_23_00-Spyder (Python 3.6)

Make sure that you configure the SERVER and DATABASE as well as the credentials to your needs.  If you are running older version of SQL Server, you will need to change the driver configuration as well.

Inserting data from Python pandas dataframe to SQL Server

Once you have the results in Python calculated, there would be case where the results would be needed to inserted back to SQL Server database. In this case, I will use already stored data in Pandas dataframe and just inserted the data back to SQL Server.

First, create a table in SQL Server for data to be stored:

USE AdventureWorks;
GO
DROP TABLE IF EXISTS vSalesPerson_test;
GO
CREATE TABLE vSalesPerson_test(
[BusinessEntityID] INT
,[FirstName] VARCHAR(50)
,[LastName] VARCHAR(100))

After that, just simply run the following Python code:

connStr = pyodbc.connect('DRIVER={ODBC Driver 13 for SQL Server};
                            SERVER=SQLSERVER2017;
                            DATABASE=Adventureworks;
                            Trusted_Connection=yes')
cursor = connStr.cursor()

for index,row in df.iterrows():
.. cursor.execute("INSERT INTO dbo.vSalesPerson_test([BusinessEntityID],
                        [FirstName],[LastName]) 
                         values (?, ?,?)", row['BusinessEntityID'], 
                                           row['FirstName'], 
                                           row['LastName']) 
.. connStr.commit()
cursor.close()
connStr.close()
*Python indentation might be broken;  use github file.

And the data will be inserted in SQL Server table:

2018-07-15 09_34_57-Window

As always, sample code is available at Github.

Happy coding! 🙂

Advertisements
Tagged with: , , , , , ,
Posted in Uncategorized

R null values: NULL, NA, NaN, Inf

R language supports several null-able values and it is relatively important to understand how these values behave, when making data pre-processing and data munging.

In general, R supports:

  • NULL
  • NA
  • NaN
  • Inf / -Inf

NULL is an object and is returned when an expression or function results in an undefined value. In R language, NULL (capital letters) is a reserved word and can also be the product of importing data with unknown data type.

NA is a logical constant of length 1 and is an indicator for a missing value.NA (capital letters) is a reserved word and can be coerced to any other data type vector (except raw) and can also be a product when importing data. NA and “NA” (as presented as string) are not interchangeable. NA stands for Not Available.

NaN stands for Not A Number and is a logical vector of a length 1 and applies to numerical values, as well as real and imaginary parts of complex values, but not to values of integer vector. NaN is a reserved word.

Inf and -Inf stands for infinity (or negative infinity) and is a result of storing  either a large number or a product that is a result of division by zero. Inf is a reserved word and is – in most cases – product of computations in R language and therefore very rarely a product of data import. Infinite also tells you that the value is not missing and a number!

All four null/missing data types have accompanying logical functions available in base R; returning the TRUE / FALSE for each of particular function: is.null(), is.na(), is.nan(), is.infinite().

General understanding of all values by simply using following code:

#reading documentation on all data types:
?NULL
?NA
?NaN
?Inf

#populating variables
a <- "NA"
b <- "NULL"
c <- NULL
d <- NA
e <- NaN
f <- Inf

### Check if variables are same?
identical(a,d)
# [1] FALSE

# NA and NaN are not identical
identical(d,e)
# [1] FALSE

###checking length of data types
length(c)
# [1] 0
length(d)
# [1] 1
length(e)
# [1] 1
length(f)
# [1] 1

###checking data types
str(c); class(c); 
#NULL
#[1] "NULL"

str(d); class(d); 
#logi NA
#[1] "logical"
str(e); class(e); 
#num NaN
#[1] "numeric"
str(f); class(f); 
#num Inf
#[1] "numeric"

Getting data from R

Nullable data types can have a different behavior when propagated to e.g.: list or or vectors or data.frame types.

We can test this by creating NULL or NA or NaN vectors and dataframes and observe the behaviour:

#empty vectors for NULL, NA and NaN
v1 <- c(NULL, NULL, NULL)
v2 <- NULL

str(v1); class(v1); mode(v1)
str(v2); class(v2); mode(v2)

v3 <- c(NA, NA, NA)
v4 <- NA

str(v3); class(v3); mode(v3)
str(v4); class(v4); mode(v4)

v5 <- c(NaN, NaN, NaN)
v6 <- NaN

str(v5); class(v5); mode(v5)
str(v6); class(v6); mode(v6)

Clearly, it is evident that the NULL vector will always be an empty one, regardless of the elements it can hold. With NA and NaN, it will be the length of the elements it holds, with a slight difference, that NA will be a vector of class Logical, whereas NaN will be a vector of class numeric.

NULL vector will not change the size but class when combined with a mathematical operation:

#operation on NULL Vector
v1 <- c(NULL, NULL, NULL)
str(v1)
# NULL

v1 <- v1+1
str(v1)
# num(0)

This will only change the class but not the length and still any of the data will not persist in the vector.

With data.frames it is relatively the same behavior.

#empty data.frame
df1 <- data.frame(v1=NA,v2=NA, v3=NA)
df2 <- data.frame(v1=NULL, v2=NULL, v3=NULL)
df3 <- data.frame(v1=NaN, v2=NaN, V3=NaN)

str(df1); str(df2);str(df3)

Dataframe consisting of NULL values for each of the column will presented as dataframe with 0 observations and 0 variables (0 columns and 0 rows). Dataframe with NA and NaN will be of 1 observation and 3 variables, of logical data type and of numerical data type, respectively.

When adding new observations to data frames, different behavior when dealing with NULL, NA or NaN.

Adding to “NA” data.frame:

# adding new rows to existing dataframe
df1 <- rbind(df1, data.frame(v1=1, v2=2,v3=3))

#explore data.frame
df1

it is clear that new row is added, and when adding a new row (vector) of different size, it will generate error, since the dataframe definitions holds the dimensions. Same behavior is expected when dealing with NaN value. On the other hand, different results when using NULL values:

#df2 will get the dimension definition
df2 <- rbind(df2, data.frame(v1=1, v2=2))

#this will generate error since the dimension definition is set
df2 <- rbind(df2, data.frame(v1=1, v2=NULL))

#and with NA should be fine
df2 <- rbind(df2, data.frame(v1=1, v2=NA))

with first assignment, the df2 will get the dimension definition, albeit the first construction of df2 was a nullable vector with three elements.

NULLable is also a result when we are looking in the vector element that is not existent, due to the fact that is out of boundaries:

l <- list(a=1:10, b=c("a","b","c"), c=seq(0,10,0.5))
l$a
# [1]  1  2  3  4  5  6  7  8  9 10
l$c
# [1]  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0  8.5  9.0  9.5 10.0
l$r
# NULL

we are calling the sublist r of list l, which is a NULL value, but is not missing or not existing, it is NULL, which in fact is rather contradictory, since the definition is not set. Different results (Not Available) would be returned when calling a vector element:

v <- c(1:3)
v[4]
#[1] NA

Boundaries in list and in vector are defined differently for NA and NULL data types.

 

Getting data from SQL Server

I will use several different data-types deriving from following SQL table.

USE AzureLanding;
GO

CREATE TABLE R_Nullables
 (
 ID INT IDENTITY(1,1) NOT NULL
 ,num1 FLOAT
 ,num2 DECIMAL(20,10)
 ,num3 INT
 ,tex1 NVARCHAR(MAX)
 ,tex2 VARCHAR(100)
 ,bin1 VARBINARY(MAX)
 )

INSERT INTO R_Nullables
          SELECT 1.22, 21.535, 245, 'This is Nvarchar text','Varchar text',0x0342342
UNION ALL SELECT 3.4534, 25.5, 45, 'This another Nvarchar text','Varchar text 2',0x03423e3434tg
UNION ALL SELECT NULL, NULL, NULL, NULL,NULL,NULL
UNION ALL SELECT 0, 0, 0, '','',0x

By using RODBC R library, data will be imported in R environment:

library(RODBC)
SQLcon <- odbcDriverConnect('driver={SQL Server};server=TOMAZK\\MSSQLSERVER2017;database=AzureLanding;trusted_connection=true')
# df <- sqlQuery(SQLcon, "SELECT * FROM R_Nullables")
df <- sqlQuery(SQLcon, "SELECT ID ,num1 ,num2 ,num3 ,tex1 ,tex2 FROM R_Nullables")
close(SQLcon)

When SELECT * query is executed, the varbinary data type from SQL Server is represented as 2GiB binary object in R and most likely, you will receive an error, because R will not be able to allocate memory:

2018-07-04 23_36_12-RStudio

After altering the columns, the df object will be created. The presentation is straight-forward, yet somehow puzzling:

  ID   num1   num2 num3                       tex1           tex2
1  1 1.2200 21.535  245      This is Nvarchar text   Varchar text
2  2 3.4534 25.500   45 This another Nvarchar text Varchar text 2
3  3     NA     NA   NA                       <NA>           <NA>
4  4 0.0000  0.000    0    

When put side-by-side the output from SQL Server and output in R, there are some differences:2018-07-04 23_40_00-stage_SQL_to_R_sample_data.sql - TOMAZK_MSSQLSERVER2017.AzureLanding (TOMAZK_Tom

What is presented in SQL Server as NULL value, it is represented in R as NA; which is a logical type, but not the real NA. And only the <NA> is logical object, that is the Not Available information. So this means, that handling NA is not only about the “Not Available” but also the type of “Not Available” information and each of these needs special attention, otherwise when doing some calculations or functions, coerce error will be constantly emerging.

Data imported using SQL Server can be used as normal dataset imported in R in any other way:

#making some elementary calculations
df$num1 * 2
# [1] 2.4400 6.9068 NA 0.0000
is.na(df$num1)
# [1] FALSE FALSE TRUE FALSE

Same logic applied to text1 and text2 fields. Both are factors, but NULL or NA values can be treated respectively.

# Text
df$text2
# NULL
df$text1
# NULL

This is rather unexpected, since the SQL Server data types again are not working for R environment. So changing the original SQL query to cast all the values:

df <- sqlQuery(SQLcon, "SELECT ID ,num1 ,num2 ,num3 ,CAST(tex1 AS VARCHAR(100)) as text1 ,CAST(tex2 AS VARCHAR(100)) as text2 FROM R_Nullables")

and rerunning the df population, the result of df$text1 will be:

[1] This is Nvarchar text      This another Nvarchar text <NA>                                                 
Levels:  This another Nvarchar text This is Nvarchar text

Getting data from TXT / CSV files

I have created a sample txt/csv file that will be imported into R by executing:

setwd("C:\\Users\\Tomaz\\")
dft <- read.csv("import_txt_R.txt")
dft

Side-by-side; R and CSV file will show that data types are handled great:

2018-07-05 00_07_02-RStudio

but only on the first glance. Let’s check the last observation by examing the tpye:

is.na(dft[5,])

# text1 text2 val1 val2
#5 TRUE FALSE FALSE FALSE

This is the problem, as the factor and the values of each, will be treated differently, although both are same type, but one is real NA, and the other is not.

identical(class(dft[5,2]),class(dft[5,1]))
# [1] TRUE

Before going on next voyage, make sure you check all the data types and values.

As always, code is available at Github. Happy coding! 🙂

Tagged with: , , , , , , ,
Posted in Uncategorized

Connecting Azure Machine Learning studio with on-premises SQL Database

Azure Machine Learning Studio has been around for the past 3 years and a lot of new features have been added. I am positive, many more will follow. One of more thrilling features has been the ability to connect to more  data sources.

2018-06-29 16_57_51-Window

One more welcoming is the ability to connect your on-premises SQL Server database as a datasource and writing T-SQL queries against it.

On-Premises SQL Database data source is available in the Import Data object, as shown on the print-screen below:

2018-06-29 17_01_48-Window

But First things first. We need to setup the secure connection between your Azure ML Studio and on-premises SQL Server Database.

Once, you are logged in into your Azure ML Studio, select Settings:

2018-06-29 18_02_48-Window

and choose DATA GATEWAYS. you will need to define the gateway, where you will also receive the Registration Key. Once you select create new gateway, give it a name, and proceed to next step.  Here you will get the overview on the gateway information and the registration key itself.

2018-06-29 18_05_53-Window

The second part, you will need to download the Data Management Gateway. You can click here, and start downloading and install the software. Some 200+MiB later, you will start the Microsoft Integration Runtime Configuration Manager  and welcome screen will want you to insert the Gateway Registration Key (generated in Azure ML Studio Settings).

Once entered, you should be able to register the gateway, by clicking on Register.

2018-06-29 18_10_33-Window

After registration key authentication and node initialization components, you will be able to launch the  Configuration Manager.

2018-06-29 18_12_32-Window

In the Configuration Manager, you can run the diagnostics (checking the connections), tweaking the setting of Proxy and running the updates and backups. But most important is, that the gateway now is connected with your Azure Account and your local machine.

Check your credentials under the Diagnostics tab, once your credentials to the on-premises SQL Server database have been successfully entered, you should get green “go”:

2018-06-29 18_20_04-Window

Credentials should be same as the credentials entered in SSMS, whenever you want to connect to database or open a new query windows, literally as:

2018-06-29 18_23_21-tk_azure_ml.sql - TOMAZK_MSSQLSERVER2017.AzureLanding (TOMAZK_Tomaz (69))_ - Mic

Now, we will setup the Landing database (for the sake of hygiene) in our On-premises SQL Server database and setup the table, to immediately check and test the data types.

Connect to your on-premises SQL Server and run:

USE [Master];
GO

CREATE DATABASE [AzureLanding];
GO

USE [AzureLanding];
GO

DROP TABLE IF EXISTS test_azure_read;
GO

CREATE TABLE test_azure_read
 ( ID INT IDENTITY(1,1) NOT NULL
 ,num1 FLOAT
 ,num2 DECIMAL(20,10)
 ,num3 BIGINT
 ,tex1 NVARCHAR(MAX)
 ,tex2 CHAR(10)
 ,bin1 VARBINARY(MAX)
 )

-- Populating the table with couple of records
INSERT INTO test_azure_read (num1,num2,num3,tex1,tex2,bin1)
 SELECT 2.5, 33.5774, 75643221345, N'Some text. Nothing to see here'
,'CharText',CAST('sometext' AS VARBINARY(MAX))
UNION ALL SELECT 67654.87654,87654.876543,987654335467
,N'Pearl Jam-đšžćčž@ääü','Pearl Jam',CAST('đšžćčž@ääü' AS VARBINARY(MAX))
UNION ALL SELECT -0.87654,87.6543,-967,N'Pearl Jam-đšžćčž@ääü'
,'Pearl Jam',0x05543f5323

Note, that I have deliberately created a test table “test_azure_read” to check the data-types.  I am testing the BIGINT, FLOAT, DECIMAL, NVARCHAR and VARBINARY and on SQL Server side, data looks fine:

2018-06-29 18_28_02-tk_azure_ml.sql - TOMAZK_MSSQLSERVER2017.AzureLanding (TOMAZK_Tomaz (69))_ - Mic

Once the data on-premises have been prepared, let’s go back to Azure ML Studio. Create new experiment and add the “Import Data” object on the canvas:

2018-06-29 18_31_14-Window

Set up the Import Data Reader, by setting it / pointing it to the same server, we have used it in Diagnostic test in the Configuration Manager:

2018-06-29 18_33_02-Window.png

The only thing that is left out is to enter the credentials (by clicking on Enter values under “User name and password”). Note here that, 1) you might need to install an additional plug-in if you are not using IE or Edge, and 2) check your settings if you are blocking and pop-up windows. In this case, you should enable it and 3) you might get a dialog window for downloading the CredentialManager.application from the proxy cloud hub. No worries, you can confirm it:

2018-06-29 18_37_58-Window

Feel free to start it, even if you get prompted to continue, you should not worry, because it is trusted application. Once this is through, you will see the connection being set up and you will also be prompted to enter the credential: user name and password.The dialog window will be:2018-06-29 18_56_07-Experiments - Microsoft Azure Machine Learning Studio and 1 more page ‎- Microso

 

After successfully entering the user and password, you should be getting the green go in your import dataset object:

2018-06-29 18_44_04-Microsoft Edge

 

Now we will test the previously created table, by entering the query:

SELECT * FROM test_azure_read

And you will be now testing the 1) the connection with Azure ML Studio and 2) importing data from your on-premises SQL Server Database to Azure ML Studio.

Save and run the experiment.

After the initial run, the experiment will fail (deliberately)! this is due to the data type conversions. The decimal data type will not be converted through the data import reader and you will need to do the cast / convert of this particular field:

SELECT 
 ID
,num1
,CAST(num2 as FLOAT) as num2_float
,num3
,tex1
,tex2
,bin1
 FROM test_azure_read

Once this is converted, you can ingest the data and start doing the machine learning in the Azure ML studio.

Once the import is finished, by clicking on connection node

1

you will get the summary and descriptive information on the data-set.

2

All data types were imported. Prior to import, we have converted the DECIMAL to FLOAT data type and you can also note, that VARBINARY presentation needs some additional cast or convert:

-- final tweak
SELECT 
 ID
,num1
--,num2
,CAST(num2 as FLOAT) AS num2_float
,num3
,tex1
,tex2
--,bin1
,CONCAT('0x',CONVERT(VARCHAR(MAX),bin1,2)) AS bin1_VARCHAR
 FROM test_azure_read

 

Happy coding!

 

Tagged with: , , , , ,
Posted in Uncategorized

Real-time data visualization with SQL Server and Python Dash

The need for visualizing the real-time data (or near-real time) has been and still is a very important daily driver for many businesses. Microsoft SQL Server has many capabilities to visualize streaming data and this time, I will tackle this issue using Python. And python Dash package  for building web applications and visualizations. Dash is build on top of the Flask, React and Plotly and give the wide range of capabilities to create a interactive web applications, interfaces and visualizations.

First, we will create a sample SQL Table where data will be inserted as mimicking the data stream:

DROP TABLE IF EXISTS dbo.LiveStatsFromSQLServer;
GO

CREATE TABLE dbo.LiveStatsFromSQLServer
(
 ID int identity(1,1)
,Num TINYINT NOT NULL
)

And using this query, we will generate some random data, that will be inserted into the table and simultaneously presented on the graph:

-- Do some inserts to mimic the data stream
INSERT INTO dbo.LiveStatsFromSQLServer(num)
SELECT ABS(CHECKSUM(NewId())) % 14
WAITFOR DELAY '00:00:01.500'
GO 1000

Python code will be executed on the server (localhost) from the command line.

In your favorite Python editor, you will import the following packages:

import dash
from dash.dependencies import Output, Event
import dash_core_components as dcc
import dash_html_components as html
import plotly
import plotly.graph_objs as go
from collections import deque
import pandas as pd
import pyodbc

To create a connection to Microsoft SQL Server and the the table you have created in previous step, we will use the following function from the pyodbc:

def connectSQLServer(driver, server, db):
    connSQLServer = pyodbc.connect(
        r'DRIVER={' + driver + '};'
        r'SERVER=' + server + ';'
        r'DATABASE=' + db + ';'
        r'Trusted_Connection=yes;',
       autocommit=True
    )
    return connSQLServer

Another function will be needed that will ingest the data from Microsoft SQL Server and create a data set and layout set for the Dash:

def update_graph_scatter():
    dataSQL = [] #set an empty list
    X = deque(maxlen=10) 
    Y = deque(maxlen=10)

    sql_conn = connectSQLServer('ODBC Driver 13 for SQL Server', 
                            'TOMAZK\MSSQLSERVER2017', 'test') 
    cursor = sql_conn.cursor()
    cursor.execute("SELECT num,ID FROM dbo.LiveStatsFromSQLServer")
    rows = cursor.fetchall()
    for row in rows:
        dataSQL.append(list(row))
        labels = ['num','id']
        df = pd.DataFrame.from_records(dataSQL, columns=labels)
        X = df['id']
        Y = df['num']

     data = plotly.graph_objs.Scatter(
             x=list(X),
             y=list(Y),
             name='Scatter',
             mode= 'lines+markers'
             )

    return {'data': [data],'layout' : go.Layout(
                                  xaxis=dict(range=[min(X),max(X)]),
                                  yaxis=dict(range=[min(Y),max(Y)]),)}

This function will fetch next row from SQL Server Table using previously declared function connectSQLServer and  extract the two columns ID and num from the Pandas DataFrame and pass it to the plotly.graph_objs.Scatter object. Function returns the data list and the definition of the layout of the graph; that is, the borders of the graph.

The declaration of the Dash application will be specifies as:

name_title = 'Stats from SQL Server'
app = dash.Dash(__name__)

app.layout = html.Div(children=[
     html.H1(children='Read near real-time data from SQL Server 
                            on Scatterplot '),
     dcc.Graph(
          id='example-graph',
          animate=True),
      dcc.Interval(
           id='graph-update',
           interval=1*500),
])

@app.callback(Output('example-graph', 'figure'), 
 events=[Event('graph-update', 'interval')])

This part finally declares the behaviour of our graph, the interval of refresh (0,5 seconds) and the @app.callback, as the events will be triggered for the graph refresh.

You can download the code from the Github, for easier code manipulation. Next, open CMD and navigate to your Anaconda or Python environment, where all the packages are already pre-installed and run:

Python C:\Real_TimeStatsVisualizationWithDash.py

You will be prompted with:

2018-06-18 20_01_09-Administrator_ Command Prompt - Python C__DataTK_dash_py_live_stats3.py

Once this is active, open your browser and point to: http://127.0.0.1:8050.

To simulate the real-time visualization, in your SSMS, run the query:

-- Do some inserts to mimic the data stream
INSERT INTO dbo.LiveStatsFromSQLServer(num)
SELECT ABS(CHECKSUM(NewId())) % 14
WAITFOR DELAY '00:00:01.500'
GO 1000

And the graph in the browser will be visualizing the data stream.

2018-06-18 18_28_08-Dash

For building slightly more informative dashboards for the need of your data streaming (either for the purposes of DBA and health checks, inbound/outbound OLTP transactions, for the purposes of data science or simply for monitoring network, ….) more beautiful graphs can be added, more controls and even CSS can be included for esthetics.

As always, the code is available at Github for your purpose and needs.

I have also added a short clip for to show how this visualization works, available also at Github.

Happy coding!

Tagged with: , , , , ,
Posted in Uncategorized

Native scoring in SQL Server 2017 using R

Native scoring is a much overlooked feature in SQL Server 2017 (available only under Windows and only on-prem), that provides scoring and predicting in pre-build and stored machine learning models in near real-time.

stopwatch

                                                                                             Icons made by Smashicons from www.flaticon.com is licensed by CC 3.0 BY

 

Depending on the definition of real-time, and what does it mean for your line of business, I will not go into the definition of real-time, but for sure, we can say scoring 10.000 rows in a second from a mediocre client computer (similar to mine) .

Native scoring in SQL Server 2017 comes with couple of limitations, but also with a lot of benefits. Limitations are:

  • currently supports only SQL server 2017 and Windows platform
  • trained model should not exceed 100 MiB in size
  • Native scoring with PREDICT function supports only following algorithms from RevoScaleR library:
    • rxLinMod (linear model as linear regression)
    • rxLogit (logistic regression)
    • rxBTrees (Parallel external memory algorithm for Stochastic Gradient Boosted Decision Trees)
    • rxDtree (External memory algorithm for Classification and Regression Trees
    • rxDForest (External memory algorithm for Classification and Regression Decision Trees)

Benefits of using PREDICT function for native scoring are:

  • No configuration of R or ML environment is needed (assuming that the trained models are already stored in the database),
  • Code is cleaner, more readable, and no additional R code is needed when performing scoring,
  • No R engine is called in the run-time, so tremendous deduction of  CPU and I/O costs as well as, no external calls,
  • Client or server running Native scoring with PREDICT function does not need R engine installed, because it uses C++ libraries from Microsoft, that can read serialized model stored in a table and un-serialize it and generate predictions, all without the need of R

Overall, if you are looking for a faster predictions in your enterprise and would love to have a faster code and solution deployment, especially integration with other applications or building API in your ecosystem, native scoring with PREDICT function will surely be advantage to you. Although not all of the predictions/scores are supported, majority of predictions can be done using regression models or decision trees models (it is estimated that both type (with derivatives of regression models and ensemble methods) of algorithms are used in 85% of the predictive analytics).

To put the PREDICT function to the test, I have deliberately taken the semi-larger dataset, available in RevoScaleR package in R – AirlineDemoSmall.csv. Using a simple BULK INSERT, we get the data into the database:

BULK INSERT ArrivalDelay
 FROM 'C:\Program Files\Microsoft SQL Server\140\R_SERVER\library\RevoScaleR\SampleData\AirlineDemoSmall.csv'
 WITH 
 ( FIELDTERMINATOR =',', ROWTERMINATOR = '0x0a', FIRSTROW = 2, CODEPAGE = 'RAW');

Once data is in the database, I will split the data into training and test sub-sets.

SELECT TOP 20000 *
INTO ArrDelay_Train
FROM ArrDelay ORDER BY NEWID()
-- (20000 rows affected)

SELECT *
INTO ArrDelay_Test
FROM ArrDelay AS AR
WHERE NOT EXISTS (SELECT * FROM ArrDelay_Train as ATR
                     WHERE
                       ATR.arrDelay = AR.arrDelay
                   AND ATR.[DayOfWeek] = AR.[DayOfWeek]
                   AND ATR.CRSDepTime = AR.CRSDepTime
                 )
-- (473567 rows affected

And the outlook of the dataset is relatively simple:

ArrDelay CRSDepTime DayOfWeek
1        9,383332   3
4        18,983334  4
0        13,883333  4
65       21,499998  7
-3       6,416667   1

Creating models

So we will create essentially two same models using rxLinMod function with same formula, but one with additional parameter for real-time scoring set to TRUE.

-- regular model creation
DECLARE @model VARBINARY(MAX);
EXECUTE sp_execute_external_script
 @language = N'R'
 ,@script = N'
 arrDelay.LM <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, 
                         data = InputDataSet)
 model <- rxSerializeModel(arrDelay.LM)'
 ,@input_data_1 = N'SELECT * FROM ArrDelay_Train'
 ,@params = N'@model varbinary(max) OUTPUT'
 ,@model = @model OUTPUT
 INSERT [dbo].arrModels([model_name], [native_model])
 VALUES('arrDelay.LM.V1', @model) ;

-- Model for Native scoring
DECLARE @model VARBINARY(MAX);

EXECUTE sp_execute_external_script
 @language = N'R'
 ,@script = N'
 arrDelay.LM <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, 
                             data = InputDataSet)
 model <- rxSerializeModel(arrDelay.LM, realtimeScoringOnly = TRUE)'
 ,@input_data_1 = N'SELECT * FROM ArrDelay_Train'
 ,@params = N'@model varbinary(max) OUTPUT'
 ,@model = @model OUTPUT
 INSERT [dbo].arrModels([model_name], [native_model])
 VALUES('arrDelay.LM.NativeScoring.V1', @model) ;

Both models will have same training set, and will be stored into a table for future scoring. Upon first inspection, we can see there is a difference in the model size:

model_size

Scoring Models

Both models  took relatively the same amount of time to train and to store in the table. Both can also be created on R Machine Learning server and stored in the same way (with or without argument realtimeScoringOnly). The model size gives you an idea, why and how the realtime scoring can be achieved -> is to keep your model as small as possible. Both models will give you exact same predictions scores, just that the native scoring will be much faster. Note also, if you are planning to do any text analysis with real-time scoring, keep in mind the 100 MiB limitation, as the text prediction models often exceed this limitation.

Comparing the execution of scoring models, I will compare using “traditional way” of using external procedure sp_execute_external_script and using PREDICT function.

------------------------------------
-- Using sp_execute_external_script
------------------------------------
DECLARE @model VARBINARY(MAX) = (SELECT native_model FROM arrModels 
WHERE model_name = 'arrDelay.LM.V1')

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
 modelLM <- rxUnserializeModel(model)
 OutputDataSet <- rxPredict( model=modelLM,
                 data = ArrDelay_Test,
                 type = "link",
                 predVarNames = "ArrDelay_Pred",
                 extraVarsToWrite = c("ArrDelay","CRSDepTime","DayOfWeek")
 )'
 ,@input_data_1 = N'SELECT * FROM dbo.ArrDelay_Test'
 ,@input_data_1_name = N'ArrDelay_Test'
 ,@params = N'@model VARBINARY(MAX)'
 ,@model = @model
WITH RESULT SETS
((
 AddDelay_Pred FLOAT
,ArrDelay INT 
,CRSDepTime NUMERIC(16,5)
,[DayOfWeek] INT
))
-- (473567 rows affected)
-- Duration 00:00:08

---------------------------
-- Using Real Time Scoring
---------------------------
DECLARE @model varbinary(max) = ( SELECT native_model FROM arrModels 
WHERE model_name = 'arrDelay.LM.NativeScoring.V1');

SELECT 
 NewData.*
 ,p.*
 FROM PREDICT(MODEL = @model, DATA = dbo.ArrDelay_Test as newData)
 WITH(ArrDelay_Pred FLOAT) as p;
GO
-- (473567 rows affected)
-- Duration 00:00:04

Both examples are different from each other, but PREDICT function looks much more readable and neater. Time performance is also on the PREDICT function side, as the model returns the predictions much faster.

In addition, I have mentioned that PREDICT function does not need R engine or Launchpad Service to be running in the same environment, where the code will be executed. To put this to test, I will simply stop the SQL Server Launchpad Service:

Service_stop

After executing the first set of predictions using sp_execute_external_script, SQL Server or Machine Learning Server will notify you that the service is not running:

fail_to_communicate

whereas, the PREDICT function will work flawlessly.

Verdict

For sure, faster predictions are the something that can be very welcoming in gaming industry, in transport, utility and metal industry, financial as well as any other types, where real-time predictions against OLTP systems will be much appreciated. With the light-weight models and good algorithm support, I would for sure give it an additional thought, especially, if you see a good potential in faster and near real-time predictions.

As always, complete code and data sample are available at Github. Happy coding! 🙂

Tagged with: , , , , , , , , ,
Posted in Uncategorized

Using R in SQL Server Reporting Services (SSRS)

SQL Server Reporting services (SSRS) is an outstanding tool for creating, deploying and managing paginated, mobile, KPI reports as well as Power BI reports. Tool provides simple way to share and get data insights in your corporate environment.

ss-reporting-services-all-together

(Photo source: Microsoft docs)

Using the privileges of R language to enrich your data, your statistical analysis or visualization is a simple task to get more out of your reports.

The best practice to embed R code into SSRS report is to create stored procedure and output the results to report. To demonstrate this, we will create two reports; one that will take two input parameters and the second one to demonstrate the usage of R visualization.

First, We will create a designated database and generate some sample data:

CREATE DATABASE R_SSRS;
GO
USE R_SSRS;
GO

CREATE TABLE R_data (
 v1 INT ,v2 INT,v_low INT,v_high INT,letter CHAR(1));
GO

CREATE OR ALTER PROCEDURE generate_data
AS
BEGIN
  INSERT INTO R_data(v1,v2,v_low,v_high,letter)
  SELECT TOP 10
    CAST(v1.number*RAND() AS INT) AS v1
   ,CAST(v2.number*RAND() AS INT) AS v2
   ,v1.low AS v_low
   ,v1.high AS v_high
   ,SUBSTRING(CONVERT(varchar(40), NEWID()),0,2) AS letter
  FROM master..spt_values AS v1
  CROSS JOIN master..spt_values AS v2
  WHERE  v1.[type] = 'P' AND v2.[type] = 'P'
 ORDER BY NEWID() ASC;
END
GO
EXEC generate_data;
GO 10

For the first report we will create a dynamic report, where user will be able to select two parameters. Based on select of the parameters, the resulted dataset will be passed to R, where with additional R code, we will generate summary statistics.

The idea is, to have the values “v1” and “letter” parametrized:

SELECT * FROM R_data
WHERE 
 v1 > 400 -- replace this with parameter; e.g.: @v1
AND letter IN ('1','2','3') -- replace this with parameter; e.g.: @letters

So in this case, we will create a stored procedure, that will get the data from T-SQL and call external procedure for execution of R Code. This procedure will be as:

CREATE OR ALTER PROCEDURE sp_R1(
   @v1 INT
  ,@lett VARCHAR(20)
) AS
BEGIN
 DECLARE @myQuery NVARCHAR(1000)

  CREATE TABLE #t (let CHAR(1))
  INSERT INTO #t
  SELECT value
  FROM STRING_SPLIT(@lett,',')
 
 SET @myQuery = N'
    SELECT * FROM R_data
    WHERE 
    v1 > '+CAST(@v1 AS VARCHAR(10))+'
    AND letter IN (SELECT * FROM #t)'

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
   df <- InputDataSet 
   OutputDataSet <- data.frame(summary(df))'
  ,@input_data_1 = @myQuery
WITH RESULT SETS
((v1 NVARCHAR(100)
 ,v2 NVARCHAR(100)
 ,freq NVARCHAR(100)))
END;
GO

By testing, we can see that passing both parameters (one single-valued and one multi-valued) will return result of R code:

EXEC sp_R1 @v1 = 200, @lett = '1,2,3'

Both of these parameters will be used in report for selection on data (or filtering the data). In SQL Server 2016 Report Builder, we create a blank report, add data source and create a new dataset, using stored procedure:

report1

Once the dataset is added and stored procedure tested, reporting builder will automatically create report parameters:

report2

In addition, we need to set up the parameters, so, adding additional dataset for the paramteres to have a controlled set of available values, and adding this pool of values to each of the parameters, by selecting the dataset, where the parameters will be taken from.

report3

For the multi-valued parameter, we also need to specify that user can select more than one value at the time. By letting the reporting builder know that “lett” parameter will be parsed back as a string of multiple values, we must select “Allow multiple values”.

report4

Once this is completed, both parameters will be shown as a drop-down selection boxes, where user will get the results back, based upon the selection made.

report5

At the end, report should look very similar to this, where the table is result generated by R code, the rest is T-SQL results and the beautiful Reporting Services.report6

For the second report, we will create a new stored procedure, that will generate  a boxplot for selected values retrieved from the user selection.

The procedure will roughly remain the same, just the R code will be changed in order to be able to return the graph (complete procedure is in the code):

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
  df <- InputDataSet 
  image_file <- tempfile()
  jpeg(filename = image_file, width = 400, height = 400)
  boxplot(df$v1~df$letter)
  dev.off()
  OutputDataSet <- data.frame(data=readBin(file(image_file, "rb"), what=raw(), n=1e6))'

 

In addition to that, on our new report, we will need to show the returned value of this procedure (or simply result) as a graph.

On report canvas drag an image and change the settings of this image as following:

report7

Users will be now able to selected the values and based on their selection, the R graph generated inside SSRS report will be changed respectively.

This next Figure show just this, how simple it is for the user to change the selection (add or remove new values) and the boxplot graph is redrawn with updated values.

report8

Both demos should help you leverage the usage of R scripts and R visualizations in your corporate reporting service and bring some additional user interface dynamics into it.

As always, the complete code and reports are available at Github. And code will also hold my random generated dataset, if you want to get same results as in this blogpost.

Happy RCoding 🙂

Tagged with: , , , , , ,
Posted in Uncategorized

R or Python? Python or R? The ongoing debate.

On every SQL community event, where there could be a cluster of sessions dedicated to BI or analytics, I would have people asking me, “which one would you recommend?” or “which one I  prefer?”

2018-01-28 15_15_13-Edit Post ‹ TomazTsql — WordPress.com

So, questions about recommendation and preferences are in my opinion the hardest one. And not that I would know my preferences but because you are inadvertently creating someone’s taste or preferences by imposing yours. And expressing taste through someone else taste is even harder.

My initial reaction is a counter-question, why are you asking this? Simply because my curiosity goes beyond the question of preferring A over B, respectively.  And most of the time, the answer I get is, “because everyone is asking this question” or “because someone said this and the other said that”. In none of the cases, and I mean literally in none, I got the response (the one I would love to get) back like “we are running this algorithm and there are issues…” or “this library suits us better…”. So the community is mainly focused on asking themselves which one is better, instead of asking, can R / Python do the job. And I can assure you, that both can do the job! Period.

Image I ask you, would you prefer Apple iPhone over Samsung Galaxy, respectively? Or if I would ask you, would you prefer BMW over Audi, respectively? In all the cases, both phones or both cars will get the job done. So will Python or R, R or Python. So instead of asking which one I prefer, ask your self, which one suits my environment better? If your background is more statistics and less programming, take R, if you are more into programming and less into statistics, take Python; in both cases you will have faster time to accomplish results with your preferred language. If you ask me, can I do gradient boosting or ANOVA or MDS in Python or in R, the answer will be yes, you can do both in any of the languages.

Important questions are therefore the one that will give you fast results, easier adaptation and adoption, will give a better fit into your environment and will have less impact on your daily tasks.

Some might say, R is a child’s play language, while Python is a real programming language. Or some might say, Python is so complex and you have to program everything, whereas in R, everything is ready. And so on and on. All these allegations have some truth, but to fully understand them, I guess one needs to understand the background of the people saying this.  Obviously, Python in comparison to R is more general purpose scripting and programming language, therefore the number of packages is 10x higher, when compared to R. And both come with variety of different packages, giving users a specific functions, classes and procedures to execute their results. R on the other hand has had it’s moment in past couple of years and the community grew rapidly, whereas Python community is in it’s steady phase.

When you are deciding which one to select, here are some questions to be answered:

  • how big my corporate environment and how many end users will I have
  • who is the end user and how will the end user handle the results
  • what is current general knowledge with the language
  • which statistical and predictive algorithms will the company be using
  • would there be a need to parallel and distributed on-prem computations
  • if needed, do we need to connect (or copy/paste) the code to the cloud
  • how fast can the company adopt the language and the amount of effort needed
  • which language would fit easier with existing BI stack and visualization tools
  • how is your data centralized and silosd and which data sources are you using
  • governance and providence issues
  • installation, distribution of the core engine and packages
  • selection and the costs of IDE and GUI
  • corporate support and SLA
  • possibility to connect to different data sources
  • released dates of the most useful packages
  • community support
  • third party tools and additional programs for easier usage of the language
  • total cost of using the language once completely in place
  • asses the risk of using an GNU/open source software

After answering these questions, I implore you to do the stress and load tests against your datasets and databases to see, what perform better.

All in all, both languages, when doing statistical and predictive analysis, also have couple of annoyances that should also be addressed:

  • memory limitations (unless spilling to disk)
  • language specifics (e.g.: R is case-sensitive, Python is indent-sensitive and both will annoy you)
  • parallel and distributed computations (CPU utilization, multi-threading)
  • multi-OS running environment
  • cost of GUI/IDE
  • engine and package dependencies and versioning
  • and others

So next time, when you ask yourself or overhear the conversation in the community, which one is better (bigger, faster, stable,…), start asking the questions on your needs and effort to adopt it. Otherwise, I always add, learn both. It does not hurt to learn and use both (for at least the statistical and predictive purposes).

All best!

Tagged with: , , ,
Posted in Uncategorized
Categories
Follow TomazTsql on WordPress.com
Revolutions

Tomaz doing BI and DEV with SQL Server and R

tenbulls.co.uk

attaining enlightenment with sql server, .net, biztalk, windows and linux

SQL DBA with A Beard

He's a SQL DBA and he has a beard

DB NewsFeed

Matan Yungman's SQL Server blog

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Clocksmith Games

We make games we love to play

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R

Search Msdn

Tomaz doing BI and DEV with SQL Server and R

R-bloggers

Tomaz doing BI and DEV with SQL Server and R

Ms SQL Girl

Julie Koesmarno's Journey In Data, BI and SQL World

R-bloggers

R news and tutorials contributed by (750) R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

SQL Server, Azure and .net in a nutshell :D

Paul te Braak

Business Intelligence Blog

Sql Server Insane Asylum (A Blog by Pat Wright)

Information about SQL Server from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...

SQLPam's Blog

Life changes fast and this is where I occasionally take time to ponder what I have learned and experienced. A lot of focus will be on SQL and the SQL community – but life varies.

William Durkin

William Durkin a blog on SQL Server, Replication, Performance Tuning and whatever else.

$hell Your Experience !!!

As aventuras de um DBA usando o Poder do $hell