Number 6174 or Kaprekar constant in R

Not always is the answer 42 as explained in Hitchhiker’s guide. Sometimes it is also 6174.

2019-02-17 10_38_38-Presentation1 - PowerPoint

Kaprekar number is one of those gems, that makes Mathematics fun. Indian recreational mathematician D.R.Kaprekar, found number 6174 – also known as Kaprekar constant – that will return the subtraction result when following this rules:

  1.  Take any four-digit number, with minimum of two different numbers (1122 or 5151 or 1001 or 4375 and so on.)
  2. Sort the taken number and sort it descending order and ascending order.
  3. Subtract the descending number from ascending number.
  4. Repeat step 2. and 3. until you get the result 6174

In practice, e.g.: number 5462, the steps would be:

6542 - 2456 = 4086
8640 -  468 = 8172
8721 - 1278 = 7443
7443 - 3447 = 3996
9963 - 3699 = 6264
6642 - 2466 = 4176
7641 - 1467 = 6174

or for number 6235:

6532 - 2356 = 4176
7641 - 1467 = 6174

Based on different number, the steps might vary.

Function for Kaprekar is:

kap <- function(num){
    #check the len of number
    if (nchar(num) == 4) {
        kaprekarConstant = 6174
        while (num != kaprekarConstant) {
          nums <- as.integer(str_extract_all(num, "[0-9]")[[1]])
          sortD <- as.integer(str_sort(nums, decreasing = TRUE))
          sortD <- as.integer(paste(sortD, collapse = ""))
          sortA <- as.integer(str_sort(nums, decreasing = FALSE))
          sortA <- as.integer(paste(sortA, collapse = ""))
          num = as.integer(sortD) - as.integer(sortA)
          r <- paste0('Pair is: ',as.integer(sortD), ' and ', as.integer(sortA), ' and result of subtraction is: ', as.integer(num))
          print(r)
         }
    } else {
      print("Number must be 4-digits")
    }
}

 

Function can be used as:

kap(5462)

and it will return all the intermediate steps until the function converges.

[1] "Pair is: 6542 and 2456 and result of subtraction is: 4086"
[1] "Pair is: 8640 and 468  and result of subtraction is: 8172"
[1] "Pair is: 8721 and 1278 and result of subtraction is: 7443"
[1] "Pair is: 7443 and 3447 and result of subtraction is: 3996"
[1] "Pair is: 9963 and 3699 and result of subtraction is: 6264"
[1] "Pair is: 6642 and 2466 and result of subtraction is: 4176"
[1] "Pair is: 7641 and 1467 and result of subtraction is: 6174"

And to make the matter more interesting, let us find the distribution, based on all valid four-digit numbers, and append the number of steps needed to find the constant.

First, we will find the solutions for all four-digit numbers and store the solution in dataframe.

Create the empty dataframe:

#create empty dataframe for results
df_result <- data.frame(number =as.numeric(0), steps=as.numeric(0))
i = 1000
korak = 0

And then run the following loop:

# Generate the list of all 4-digit numbers
while (i <= 9999) {
   korak = 0
   num = i
   while ((korak <= 10) & (num != 6174)) {
      nums <- as.integer(str_extract_all(num, "[0-9]")[[1]])
      sortD <- as.integer(str_sort(nums, decreasing = TRUE))
      sortD <- as.integer(paste(sortD, collapse = ""))
      sortA <- as.integer(str_sort(nums, decreasing = FALSE))
      sortA <- as.integer(paste(sortA, collapse = ""))
      num = as.integer(sortD) - as.integer(sortA)

     korak = korak + 1
    if((num == 6174)){
     r <- paste0('Number is: ', as.integer(i), ' with steps: ', as.integer(korak))
     print(r)
     df_result <- rbind(df_result, data.frame(number=i, steps=korak))
   }
 }
i = i + 1
}

 

Fifteen seconds later, I got the dataframe with solutions for all valid (valid solutions are those that comply with step 1 and have converged within 10 steps) four-digit numbers.

2019-02-17 16_07_56-RStudio

Now we can add some distribution, to see how solutions are being presented with numbers. Summary of the solutions shows in average 4,6 iteration (mathematical subtractions) were needed in order to come to number 6174.

2019-02-17 16_15_49-RStudio

But adding the counts to steps, we get the most frequent solutions:

table(df_result$steps)
hist(df_result$steps)

2019-02-17 16_33_50-RStudio

With some additional visual, you can see the results as well:

library(ggplot2)
library(gridExtra)

#par(mfrow=c(1,2))
p1 <- ggplot(df_result, aes(x=number,y=steps)) + 
geom_bar(stat='identity') + 
scale_y_continuous(expand = c(0, 0), limits = c(0, 8))

p2 <- ggplot(df_result, aes(x=log10(number),y=steps)) + 
geom_point(alpha = 1/50)

grid.arrange(p1, p2, ncol=2, nrow = 1)

And the graph:

2019-02-17 16_29_39-Plot Zoom

A lot of numbers converges on third step, meaning that every 4th or 5th number.  We would need to look into the steps of the solutions, what these numbers have in common. This will follow! So stay tuned.

Fun fact: For the time of writing this blog post, the number 6174 was not constant in R base. 🙂

As always, code is available at Github.

 

Happy Rrrring 🙂

Advertisements
Tagged with: , , , ,
Posted in Uncategorized

Installing R using Powershell

Installing R from scratch and creating your favorite IDE setup is especially useful when making fresh installation or when you are developing and testing out different versions.

This blogpost will guide you through some essential steps (hopefully, there will not be many) on how to download the desired R engine, desired R GUI – in this case RStudio, and how to prepare the additional packages with some custom helper functions to be used in the client set-up / environment. And mostly, using PowerShell script.

2019-02-14 20_03_23-Window

Test folder for this new R Environment will be: C:\DataTK\99_REnv\01_Source\.  And the rest of the folder structure will be:

2019-02-14 20_59_45-Window

Folder structure is completely arbitrary and can be changed, accordingly.

1. Downloading the RStudio and R

All the programs will be installed with predefined paths (Please note, this path might vary on your client machine):

  • RStudio ->  c:\Program Files\RStudio
  • R Engine -> c:\Program Files\R\R-3.5.1

Both paths can be different on your machine. In the folder structure, I will set my folder pointing to 01_Source sub-folder, as shown in ps script.

$dir = "C:\DataTK\99_REnv\01_Source\"
Set-Location $dir

## Download RSTudio for Windows machine

# Version of RStudio is deliberatly set to specific version
# so that code is repeatable and always returns same results
$urlRStudio = "https://download1.rstudio.org/RStudio-1.1.463.exe"
$outputRStudio = "$dir\RStudio.exe"

$wcRStudio = New-Object System.Net.WebClient
$wcRStudio.DownloadFile($urlRStudio, $outputRStudio) # $PSScriptRoot 
Write-Output "Download Completed"

## Download R engine for Windows machine
$urlR = "https://cran.r-project.org/bin/windows/base/R-3.5.2-win.exe"
$outputR = "$dir\R-win.exe"
$wcR = New-Object System.Net.WebClient
$wcR.DownloadFile($urlR, $outputR)
Write-Output "Download completed"

## Installing R / RStudio on desired Path
## Silent install
$dirRStudio = $dir + "RStudio.exe"
$dirR = $dir + "R-win.exe"

Start-Process -FilePath $dirRStudio -ArgumentList "/S /v/qn"
Start-Process -FilePath $dirR -ArgumentList "/S /v/qn"

Now that we have the R engine and R Studio installed, you need to repeat the process for downloading the R Packages. In same manner, I will start downloading the specific R packages.

2. Downloading the R packages

For the brevity of this post, I will only download couple of R packages from CRAN repository, but this list is indefinite.

There are ways many ways to retrieve the CRAN packages for particular R version using powershell. I will just demonstrate this by using Invoke-WebRequest cmdlet.

Pointing your cmdlet to URL: https://cran.r-project.org/bin/windows/contrib/3.5  where  list of all packages for this version is available. But first we need to extract the HTML tag where information is stored. Since the URL stores data in a table, we have to navigate to following tag: html>body>table>tbody>tr>td>a where the file name is presented.

2019-02-17 07_37_25-Window

Packages names is retrieved by:

2019-02-17 07_45_48-Window

$ListRPackages= Invoke-WebRequest -Uri "https://cran.r-project.org/bin/windows/contrib/3.5"
$pack = ($ListRPackages.ParsedHtml.getElementsByTagName('a')).outerText

If you have the list of needed packages listed in a txt file, you can read the package names from file and iterate through the webpage and download the files:

$ListPackageLocation = "C:\DataTK\99_REnv\01_SourceList\packages.txt"
$PackList = Get-Content -Path $ListPackageLocation
$dir = "C:\DataTK\99_REnv\01_Source\"

ForEach ($Name in $PackList)
{
   $UrlRoot = "https://cran.r-project.org/bin/windows/contrib/3.5/"
   $url = $UrlRoot + $Name
   $FileName = $dir +'\' + $Name
   $PackagesOut = New-Object System.Net.WebClient
   $PackagesOut.DownloadFile($url, $FileName) 
   Write-Output "Download Completed"
}

 

Now that we have all the packages downloaded and programs installed, we can move to R.

3. Setting up the R Environment

In the folder structure, there is a folder including the helper files:

2019-02-17 08_15_56-03_RHelperFiles.png

Paths.R

In this file all the paths are typed and later used in any other file. Simply the folder structure is described:

sourcePath = "c:\\DataTK\\R_packages\\01_Source"
sourcePackagePath = "c:\\DataTK\\R_packages\\01_Sourcelist"
libPath = "C:\\DataTK\\R_Packages\\02_R"
wdPath = "C:\\DataTK\\R_Packages"

 

Functions.R

This file includes all the functions lists in one place, mainly for sharing or creating shared projects. In  this case, just two functions, one for checking and installing missing packages, read from the folder structures (that were previously downloaded using powershell).

# Function for sum of squares for two input integers
sum_squares <- function(x,y) {
  x^2 + y^2
}

# Function for package installation with check for existing packages
function_install_4 <- function(df_name) {
    for (i in 1:nrow(df_name)){
       if (df_name[i,2] %in% rownames(installed.packages(lib.loc=libPath))){
       #print(df_name[i,2])
       print(paste0("Package ",df_name[i,2], " already installed."))
   } else {
     install.packages(df_name[i,1], type="source", repos=NULL, lib=libPath)
          }
     }
 }

 

Intial.R

This file wraps all the helper files in one place and invokes all the functions from packages and paths:

# Loading files with function lists and Paths
source(file="C:\\DataTK\\R_Packages\\paths.R")
source(file="C:\\DataTK\\R_Packages\\functions.R")

#updating the list of packages
setwd(sourcePackagePath)
listPackages <- data.frame(read.csv("packages.txt", header=FALSE))
names(listPackages)[1] <- "name"

#just names of the packages
temp <- strsplit(as.character(listPackages$name),"_")
temp <- data.frame(library=matrix(unlist(temp), ncol=2, byrow=TRUE)[,1])
listPackages<- cbind(name=listPackages, library=temp$library)


#installing the missing packages
setwd(sourcePath)
function_install_4(listPackages)

library(dplyr, lib.loc=libPath)
library(ggplot2, lib.loc=libPath)
library(knitr, lib.loc=libPath)

4. Start using R

Finally, every new R file or projects needs to have a single line included:

#initialize
source(file="C:\\DataTK\\R_Packages\\initial.R")

And this will load all the settings, all the packages and make sure the environment downloaded are correctly installed.

 

As always, complete code is available on Github.

Happy coding and happy Rrrrring 🙂

Tagged with: , , , , ,
Posted in Uncategorized

Window Aggregate operator in batch mode in SQL Server 2019

So this came as a surprise, when working on calculating simple statistics on my dataset, in particular min, max and median. First two are trivial. The last one was the one, that caught my attention.

While finding the fastest way on calculating the median (statistic: median) for given dataset, I have stumbled upon an interesting thing.  While WINDOW function was performing super slow and calling R or Python using sp_execute_xternal_script outperform window function as well, it raised couple of questions.

But first, I created a sample table and populate it sample rows:

DROP TABLE IF EXISTS  t1;
GO

CREATE TABLE t1
(id INT IDENTITY(1,1) NOT NULL
,c1 INT
,c2 SMALLINT
,t VARCHAR(10) 
)

SET NOCOUNT ON;
INSERT INTO t1 (c1,c2,t)
SELECT 
	x.* FROM
(
	SELECT 
	ABS(CAST(NEWID() AS BINARY(6)) %1000) AS c1
	,ABS(CAST(NEWID() AS BINARY(6)) %1000) AS c2
	,'text' AS t
) AS x
	CROSS JOIN (SELECT number FROM master..spt_values) AS n
	CROSS JOIN (SELECT number FROM master..spt_values) AS n2
GO 2

 

Query generated – in my case – little over 13 million records, just enough to test the performance.

So starting with calculating Median, but sorting first half and second half of rows respectively, the calculation time was surprisingly long:

-- Itzik Solution
SELECT (
(SELECT MAX(c1) FROM
  (SELECT TOP 50 PERCENT c1 FROM t1 ORDER BY c1) AS BottomHalf)
+
(SELECT MIN(c1) FROM
  (SELECT TOP 50 PERCENT c1 FROM t1 ORDER BY c1 DESC) AS TopHalf)
) / 2 AS Median

Before and after each run, I cleaned the stored execution plan. The execution on 13 million rows took – on my laptop – around 45 seconds.

Next query, for median calculation was a window function query.

SELECT DISTINCT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY c1) 
       OVER (PARTITION BY (SELECT 1)) AS MedianCont
FROM t1

To my surprise, the performance was even worse, and at this time, I have to say, I was running this on SQL Server 2017 with CU7. But luckily, I had a SQL Server 2019 CTP 2.0 also installed and here, with no further optimization the query ran little over 1 second.

So the difference between the versions was enormous. I could replicate the same results by switching the database compatibility level from 140 to 150, respectively.

ALTER DATABASE SQLRPY 
SET COMPATIBILITY_LEVEL = 140; 
GO
SELECT DISTINCT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY c1) 
    OVER (PARTITION BY (SELECT 1)) AS MedianCont140
FROM t1

ALTER DATABASE SQLRPY 
SET COMPATIBILITY_LEVEL = 150; 
GO

SELECT DISTINCT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY c1) 
    OVER (PARTITION BY (SELECT 1)) AS MedianCont150
FROM t1

The answer was found in execution plan. When running window function under 140 compatibility level, execution plan decides to create nested loop two times, for both groups of upper and lower 50% of the dataset.comp140_1

This plan is is somehow similar to understanding of 50% of upper and lower dataset but with only one nested loop:

itzik1

Difference is that when running the window function calculation of median on SQL Server version 2017, the query optimizer decides to take row execution mode for built-in window function with WITHIN GROUP.

comp140_2

This was, as far as I knew, not an issue since SQL Server 2016, where batch mode operator for window aggregation was already used.

When switching to compatibility level 150 and running the same window function, the execution plan is, as expected:

comp150_1

And window aggregate uses batch mode:

comp150_2

When calculating Median using R:

sp_Execute_External_Script
   @language = N'R'
  ,@script = N'd <- InputDataSet
               OutputDataSet <- data.frame(median(d$c1))'
  ,@input_data_1 = N'select c1 from t1'
WITH RESULT SETS (( Median_R VARCHAR(100) ));
GO

or Python:

sp_Execute_External_Script
  @language = N'Python'
 ,@script = N'
import pandas as pd
dd = pd.DataFrame(data=InputDataSet)
os2 = dd.median()[0]
OutputDataSet = pd.DataFrame({''a'':os2}, index=[0])'
 ,@input_data_1 = N'select c1 from t1'
WITH RESULT SETS (( MEdian_Python VARCHAR(100) ));
GO

both are executing and returning the results in about 5 seconds. So no bigger difference between R and Python when handling 13 million rows for calculating simple statistics.

To wrap up, If you find yourself in situation, where you need to calculate – as in my case – Median or any statistics, using window function within group, R or Python would be the fastest solutions, following T-SQL. Unless, you have the ability to use SQL Server 2019, T-SQL is your best choice.

Code and the plans, used in this blog post are available, as always at Github.

Tagged with: , , , , , , , , ,
Posted in Uncategorized

Friday five MVP award blog post

Short blog post on contribution to Friday five MVP Award blog post from September 2018.

Link to the blog:  https://blogs.msdn.microsoft.com/mvpawardprogram/2018/09/28/friday-five-september-28/

Happy New Year 2019!

See you in 2019.

Tagged with: ,
Posted in Uncategorized

Using Python with Microsoft Reporting Services (SSRS)

Using Python with SQL Server 2017 SSRS should not be an obstacle, since the framework and the technology is the same as with R language. With SQL Server 2017, Python got a full and functional support for native SSRS.

The usability and functionality of Python is simply immense.  This blog post will be dedicated on using Python for data science in SSRS

Most of the samples  will be derived from the Adventure-Works, both SQL Server sample datasets are available at Microfost Github.

Selecting Python results with SSRS

External procedure sp_execute_external_script within SQL Server database using Python language does not need to have the ODBC drivers or python modules like pyodbc or sqlachemy for extracting or writing data between Sql server engine and python engine, the only module needed is Python pandas, since the communication between sql server requires data frames for exposing data to python (from sql server), respectively.

-- Check Python runtime
EXECUTE sp_execute_external_script
       @language =N'Python'
      ,@script=N'OutputDataSet = InputDataSet'
      ,@input_data_1 = N'SELECT 1 AS result'
WITH RESULT SETS
((
result INT
))
GO

Please note: Python code, as specified with @script parameter, must be validated and is sensitive to indentation. So single missing (or excessive) indent might cause the complete query to fail.

We will introduce the input parameters for better python data selection. At the same time, this parameter will be used as input parameter in SSRS. T-SQL query with python script:

-- query parametrization
DECLARE @MaritalStatusIN CHAR(1) = 'S' -- S/M Single/Married

EXECUTE sp_execute_external_script
@language =N'Python',
@script = N'
import pandas as pd
df = InputDataSet
df_gender = df[''MaritalStatus''] == MaritalStatus
df_gen = df[df_gender]
correlation = df_gen.corr(method=''pearson'')
OutputDataSet = pd.DataFrame(correlation, columns=["nof","age"])',
@input_data_1 = N'SELECT 
               COUNT(*) AS nof,MaritalStatus,age 
               FROM AdventureWorksDW2014.dbo.vTargetMail
               WHERE age < 100
               GROUP BY maritalstatus,age'
,@params = N'@MaritalStatus CHAR(1)'
,@MaritalStatus = @MaritalStatusIN

WITH RESULT SETS 
(( 
   CountObs FLOAT
  ,Age FLOAT 
));
GO

For the purposes of SSRS, I will encapsulate the upper code into stored procedure with one input parameter – that is Marital status. This same parameter will be propagated into python script.

EXEC SelectingPythonResults
      @MaritalStatusIN = 'S'

With results presented through SSRS with option of selecting the values.

2018-11-18 18_47_03-01. Selecting Python Results - SQL Server 2017 Reporting Services

Predicting and scoring with Python in SSRS

In addition to selecting filtered parameters in SSRS, we can also check how our prediction model is working.

For this manner, we will create a slightly different subset and we will try to predict the number of cars owned by the customers in the dataset, based on the age and the daily commute distance. Query for the dataset will the following:

--- Query for scoring the model
SELECT 
   age 
  ,NumberCarsOwned
  ,CAST(REPLACE(LEFT(CommuteDistance,2),'-','') AS TINYINT) as CommuteDistance
FROM AdventureWorksDW2014.dbo.vTargetMail
WHERE
    age < 100

2018-11-18 19_14_46-Python-SQL Server.sql - TOMAZK_MSSQLSERVER2017.SQLPY (TOMAZK_Tomaz (55))_ - Micr

We will run a simple linear regression with dependent (y) variable = NumberCarsOwned and age + CommuteDistance both as X variables.  We will store couple of different versions of prediction model in database, trained on different size of training subsets; 20%, 30%, 40% and 50% of the original dataset for training purposes. All four different models will be serialized and stored in a table.

Let’s create a table to store serialized models:

-- creating table for storing models
CREATE TABLE PredictingWithPy_models
(model VARBINARY(MAX)
,modelName VARCHAR(100)
,trainSize FLOAT
)

Now, we will  use Scikit-learn python module to train the model based on different subsets of original dataset.

CREATE OR ALTER PROCEDURE [dbo].[RunningPredictionsPy] 
(
  @size FLOAT --- format: 0.3 or 0.4 or 0.5 
 ,@name VARCHAR(100) 
 ,@trained_model varbinary(max) OUTPUT
)
AS
BEGIN
EXEC sp_execute_external_script
  @language = N'Python'
 ,@script = N'
import numpy as np
import pandas as pd
import pickle
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

##Create SciKit-Learn linear regression model
X = df2[["age", "CommuteDistance"]]
y = np.ravel(df2[["nofCars"]])
name = name
##Create training (and testing) variables based on test_size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=size)

## CreateLinear Model
SKL_lr = LinearRegression()
linRegObj = SKL_lr.fit(X_train, y_train)

##Serialize model
trained_model = pickle.dumps(linRegObj)'

 ,@input_data_1 = N'SELECT * FROM vTargetMail'
 ,@input_data_1_name = N'df2'
 ,@params = N'@trained_model varbinary(max) OUTPUT, @size FLOAT, @name VARCHAR(100)'
 ,@trained_model = @trained_model OUTPUT
 ,@size = @size
 ,@name = @name
END;
GO

And executing this procedure to store trained models in previously created table:

DECLARE @model VARBINARY(MAX);
EXEC [RunningPredictionsPy] 
    @size = 0.2  
   ,@name = 'Ln_20'
   ,@trained_model = @model OUTPUT
INSERT INTO PredictingWithPy_models (model, modelName, trainSize) 
VALUES(@model, 'Ln_20', 0.2);
GO

We repeat, so that at the end, we end up with percentage for 20, 30, 40 and 50; complete list is available at GitHub. At the end, the table with stored models should hold 4 lines with serialized models.

2018-11-18 20_43_55-Python-SQL Server.sql - TOMAZK_MSSQLSERVER2017.SQLPY (TOMAZK_Tomaz (55))_ - Micr

Now that we have trained the models, we want to predict the results, using SSRS.

For this manner, we will need another stored procedure, that will have X variables (independent variables: age and Commutedistance) as input parameters, besides selected model and have the prediction returned (Y variable – Number of cars owned).

Following procedure will be used in SSRS:

-- CREATE Procedure to predict the number of cars owned
CREATE OR ALTER PROCEDURE [dbo].[RunningPredictionWithValesPy] 
(
   @model varchar(100)
  ,@age INT
  ,@commuteDistance INT
)
AS
BEGIN

DECLARE @modelIN VARBINARY(MAX) = 
    (SELECT model FROM PredictingWithPy_models WHERE modelName = @model)
-- Create a T-SQL Query 
DECLARE @q NVARCHAR(MAX) = N'SELECT '+CAST(@age AS VARCHAR(5))+' AS age, 
         '+ CAST(@commuteDistance AS VARCHAR(5))+' AS CommuteDistance'

-- Store the new values for prediction in temp table
DROP TABLE IF EXISTS #t
CREATE TABLE #t (age INT, CommuteDistance INT)
INSERT INTO #t (age, CommuteDistance)
EXEC sp_executesql @q

EXEC sp_execute_external_script
@language = N'Python'
,@script = N'
import pickle
import numpy as np
import pandas as pd
from sklearn import metrics

##Deserialize model
mod = pickle.loads(modelIN)
X = InputDataSet[["age", "CommuteDistance"]]

##Create numpy Array when you introducte more values at the same time (bulk prediction)
predArray = mod.predict(X)
#predList = []
#for i in range(len(predArray)):
#     predList.append((predArray[i])[1])

OutputDataSet = pd.DataFrame(data = predArray, columns = ["predictions"])' 
  ,@input_data_1 = N'SELECT * FROM #t'
  ,@input_data_1_name = N'InputDataSet'
  ,@params = N'@modelIN varbinary(max)'
  ,@modelIN = @modelIN
WITH RESULT SETS 
((
prediction_Score FLOAT
));
END
GO

Checking for the procedure to work fine, the following T-SQL query will run the predictions against new dataset:

EXEC [RunningPredictionWithValesPy]
   @model = 'Ln_30'
  ,@age = 44
  ,@commuteDistance = 1

We can integrate the procedure into reporting services (SSRS) and have the combination of the input values displayed with the selected model (size of test set):

2018-11-18 21_17_44-02. Running Predictions - SQL Server 2017 Reporting Services

with the results of predictions in SSRS or in SSMS; both return same results:

2018-11-18 21_45_31-Python-SQL Server.sql - TOMAZK_MSSQLSERVER2017.SQLPY (TOMAZK_Tomaz (55)) - Micro

 

All reports featured here will also be available at GitHub.

Visualizing results with Python in SSRS

As we have created four different models, we would also like to have the accuary of the model visually represented using SSRS.

Showing plots created with Python might not be as straight forward, as with R Language.

Following procedure will extract the data from database and generate plot, that can be used and visualized in SSRS.

CREATE OR ALTER PROCEDURE [dbo].[VisualizeWithPyR2] 
(
     @inputVariable VARCHAR(100)
)
AS
BEGIN

DECLARE @q NVARCHAR(MAX) = N'SELECT '+CAST(@inputVariable AS VARCHAR(50))+' AS val1 FROM vTargetMail'
-- Store the values in temp table
DROP TABLE IF EXISTS #t
CREATE TABLE #t (val1 FLOAT)
INSERT INTO #t (val1)
EXEC sp_executesql @q

EXEC sp_execute_external_script
  @language = N'Python'
  ,@script = N'
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(12, 5))
plt.plot(plotdataset)
fig.savefig(''C:\\\PyGraphs\\firstGraph.png'') 
OutputDataSet = pd.DataFrame(data =[1], columns = ["plot"])' 
,@input_data_1 = N'SELECT * FROM #t'
,@input_data_1_name = N'plotdataset'
WITH RESULT SETS 
((
    plot INT
));
END
GO

 

So far, I have found this as to be the simplest way – saving the plot on same server, where SSRS is running and exposing the disk path to SSRS. Python stores the graph to disk on the host / client / server with following command:

fig = plt.figure(figsize=(12, 5))
plt.plot(plotdataset)
fig.savefig(''C:\\\PyGraphs\\firstGraph.png'')

Resulting in SSRS as:

2018-11-18 22_59_18-03. Visualizing with Python - SQL Server 2017 Reporting Services

In Report builder, one additional setting worth mentioning is the image properties, where the path to the saved plot should be the same as the one presented in Python code. Savefig function in python is pointing to same location as the external source for this image setting, with function as: file://C://PyGraphs//firstGraph.png.

2018-11-18 23_01_10-Image Properties

 

Evaluating all four linear regression models, based on the size of the test sample size, can be calculated using R2 and accompanying python code:

from sklearn.metrics import r2_score
def R2EvalMod(test_sizeIN):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_sizeIN)
    SKL_lr = LinearRegression()
    linRegObj = SKL_lr.fit(X_train, y_train)
    predArray = linRegObj.predict(X_test)
    R2 = r2_score(y_test, predArray) 
    return R2

sql_conn = pyodbc.connect('DRIVER={ODBC Driver 13 for SQL Server};SERVER=TOMAZK\\MSSQLSERVER2017;DATABASE=SQLPY;Trusted_Connection=yes') 
query3 = '''
SELECT trainSize FROM [dbo].[PredictingWithPy_models]
'''
df3 = pd.read_sql(query3, sql_conn)

R2List = []
for i in range(len(df3)):
    z = df3.at[i, 'trainSize']
    R2=R2EvalMod(z)
    R2List.append(([z],[R2]))
 
labels = ['trainsize', 'R2']
df33 = pd.DataFrame.from_records(R2List, columns=labels)

With the fact that, better the model the higher R2 is – between 0 and 1 (also negative -1 to 0), showing that the model R2 is best when using 30% of dataset for testing and 70% for training.

2018-11-18 23_23_57-Book1 - Excel

 

As always, code is available at GitHub.

Happy Pythoning.

Tagged with: , , , ,
Posted in Uncategorized

SQL Saturday Slovenia 2018

SQL Saturday Slovenia 2018 is coming back in town.  And it will be, as always, awesome. I prefer not to use superlatives, but the schedule speaks for it self.

2018-10-29 10_31_15-

The event will take place on Saturday, 8th December 2018, at the Faculty of Computer and Information Science of the University of Ljubljana, Večna pot 113, Ljubljana.

Recap of statistics so far:

  • 133 submitted sessions
  • 55 speakers from Europe and USA
  • Over 30 sessions submitted by speakers originating from Germany, following by UK speakers with count little over 20
  • More than 50% of selected speakers are Microsoft MVP

Two pre-conference workshops will also be held on Friday, 7th of December by:

You are kindly invited to join the pre-conference day with one of the workshops; there are still some places available. And also to take part at the event on saturday.

Ljubljana in December will be, as always, festive and beautiful. Take your time, register for the event, and after the event enjoy the fairs, events, new-years feeling and lights, mulled wine and much more.

In addition, a special thanks to our sponsors, supporting the event:

 

See you on Friday, 7th and on Saturday, 8th of December, 2018!

Tagged with: , , , , , ,
Posted in Uncategorized

Running R scripts within in-database SQL Server Machine Learning

Having all the R functions, all libraries and any kind of definitions (URL, links, working directories, environments, memory, etc) in one file is nothing new, but sometimes a lifesaver.

Using R function source is the function to achieve this. Storing all definitions, functions, classes on one place can help enterprises achieve faster installation and safer environment usage.

So the idea is simple. Stack all the needed functions, classes, libraries and configurations you want to use in a specific environment and save it in a R  file.  Create a file with all the needed setups of libraries and functions, as seen below. We will call this file as util.R.

2018-10-14 20_42_44-RStudio.png

And the rest is just to refer to this file in any R environment or R script with simple call using source.

setwd("C:\\MyFiles\\R_setup")
source("util.R")

Any new R script will have now all the functions and libraries included by default. Testing it with the simple function call is straightforward:

x <- c(1,2,3,4,5)
y <- c(2,3,4,5,6)

mysummary(x)
sss(x,y)

and the result is, as expected

2018-10-14 20_49_11-RStudio

In my previous blog posts, I have mentioned also calling R scripts from SQL Server tables, external files and others. Using R source function is another way to pull in predetermined and preinstalled functions also in SQL Server in-database machine learning service (or machine learning server), that will make setting up same environments on client machines in enterprise environment much easier and faster. With just one update of the file, all client machines can read same definitions.

Using same util.R file, I have placed the file in the location where SQL Server R and workers will have access granted. By using sp_execute_external_script and external R file, this can be simplified:

USE AdventureWorks;
/* SELECT n FROM (VALUES(1),(2),(3),(4)) AS tbl(n) */
exec sp_execute_external_script
  @language = N'R'
 ,@script = N'
#have access to this folder for SQL SERVER /R workers  
   source("C:\\MyFiles\\R_setup\\util.R") 
    x <- InputDataSet
    x <- as.numeric(x)
   out <- addOne(x)
  OutputDataSet <- data.frame(as.numeric(out))'
,@input_data_1 = N'SELECT n FROM (VALUES(1)) AS tbl(n)'
WITH RESULT SETS
((
FunctionResult VARCHAR(100)
))

Result is the same as if ran from and other R environment, but this time, I am running it from SQL Server.

2018-10-14 21_00_02-read_external_r_script_from_R_database_Services.sql - TOMAZK_MSSQLSERVER2017.Adv

There are couple of thoughts to point out, though:

1) have the util.R file or any external R file validated (!)

2) all the libraries, that util.R file is referring to, must be installed. If not, the external file will not be loaded. This can simply be achieved by using:

if(!is.element('AGD', installed.packages()[,1])){
install.packages('AGD')} else {
library(AGD)}

instead of just referencing the package by using library(AGD).

3) there should be more IF statements used regarding the R core engine and to check the availability and compatibility of new packages.

4) not all packages can be used with Microsoft R engine and the current R version.

 

Enjoy R and SQL Server and happy coding.

Tagged with: , , , , , , ,
Posted in Uncategorized
Categories
Follow TomazTsql on WordPress.com
Programs I Use
Programs I Use
Programs I Use
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
Discover

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python and beyond

tenbulls.co.uk

attaining enlightenment with sql server, .net, biztalk, windows and linux

SQL DBA with A Beard

He's a SQL DBA and he has a beard

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python and beyond

Ms SQL Girl

Julie Koesmarno's Journey In Data, BI and SQL World

R-bloggers

R news and tutorials contributed by (750) R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

SQL Server, Azure and DLM in a nutshell :D

Paul te Braak

Business Intelligence Blog

Sql Server Insane Asylum (A Blog by Pat Wright)

Information about SQL Server from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...

SQLPam's Blog

Life changes fast and this is where I occasionally take time to ponder what I have learned and experienced. A lot of focus will be on SQL and the SQL community – but life varies.

William Durkin

William Durkin a blog on SQL Server, Replication, Performance Tuning and whatever else.