Running multiple correlations with R and T-SQL

Getting to know the data is always an interesting part of data science. With R integration into SQL Server, the exploration part is still part of the game.

Usual way to get some statistics out of the dataset is to run some frequencies, descriptive statistics and nevertheless correlations.

Running correlations against a set of variables in T-SQL might be a bit of a drag, hence using R code with SP_EXECUTE_EXTERNAL_SCRIPT is just as easy as following:

USE WideWorldImporters;
GO

 DECLARE @sql NVARCHAR(MAX)
 SET @sql = 'SELECT 
                      SupplierID
                    , UnitPackageID
                    , OuterPackageID
                    , LeadTimeDays
                    , QuantityPerOuter
                    , TaxRate
                    , UnitPrice
                    , RecommendedRetailPrice
                    , TypicalWeightPerUnit
                FROM [Warehouse].[StockItems]'

DECLARE @Rscript NVARCHAR(MAX)
SET @Rscript = N'df <- data.frame(cor(Stock, use="complete.obs", method="pearson"))
                OutputDataSet<-df'

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=@Rscript
      ,@input_data_1 = @sql
      ,@input_data_1_name = N'Stock'
WITH RESULT SETS (( 
                     SupplierID NVARCHAR(100)
                    ,UnitPackageID NVARCHAR(100)
                    ,OuterPackageID NVARCHAR(100)
                    ,LeadTimeDays NVARCHAR(100)
                    ,QuantityPerOuter NVARCHAR(100)
                    ,TaxRate NVARCHAR(100)
                    ,UnitPrice NVARCHAR(100)
                    ,RecommendedRetailPrice NVARCHAR(100)
                    ,TypicalWeightPerUnit NVARCHAR(100)
                    ));

I am using WideWorldImporters; (GitHub or at Codeplex);  new Demo database from Microsoft that was released just this month, beginning of June 2016.

By running this query with correlations R returns a dataframe that T-SQL is able to interpret and output the results in SSMS in following format. Very cool.

2016-06-26 07_31_07-SQLQuery1.sql - SICN-00031_SQLSERVER2016RC3.WideWorldImporters (SPAR_si01017988

The outlook is very similar to one for example in SPSS:

2016-06-26 09_06_37-_Output1 [Document1] - IBM SPSS Statistics Viewer

Numbers are matching (!) and the outline is relatively the same; very clear and easily readable. One thing is missing – SPSS delivers statistical significance (p-value) whereas R only delivers value of Pearson correlation coefficient. For that matter we need to run additional T-SQL / R procedure in order to get p-values.

DECLARE @sql NVARCHAR(MAX)
SET @sql = 'SELECT 
                     SupplierID
                    ,UnitPackageID
                    ,OuterPackageID
                    ,LeadTimeDays
                    ,QuantityPerOuter
                    ,TaxRate
                    ,UnitPrice
                    ,RecommendedRetailPrice
                    ,TypicalWeightPerUnit
                FROM [Warehouse].[StockItems]'

DECLARE @Rscript NVARCHAR(MAX)
SET @Rscript = N'
                library(Hmisc) 
                df <- data.frame(rcorr(as.matrix(Stock), type="pearson")$P)
                OutputDataSet<-df
                '

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=@Rscript
      ,@input_data_1 = @sql
      ,@input_data_1_name = N'Stock'
WITH RESULT SETS (( 
                     SupplierID DECIMAL(10,5)
                    ,UnitPackageID DECIMAL(10,5)
                    ,OuterPackageID DECIMAL(10,5)
                    ,LeadTimeDays DECIMAL(10,5)
                    ,QuantityPerOuter DECIMAL(10,5)
                    ,TaxRate DECIMAL(10,5)
                    ,UnitPrice DECIMAL(10,5)
                    ,RecommendedRetailPrice DECIMAL(10,5)
                    ,TypicalWeightPerUnit DECIMAL(10,5)
                    ));

So we have now statistical significance of our correlation matrix. I used using library Hmisc and function rcorr.

2016-06-26 09_34_04-SQLQuery1.sql - SICN-00031_SQLSERVER2016RC3.WideWorldImporters (SPAR_si01017988

Rcorr function has very little options to be set. So results may vary when compared to other (by default) functions. You can also use cor.test function:

data.frame(p_value = cor.test(df$my_var1,df$my_var2,use="complete.obs", 
method="pearson")$p.value, var1= "my_var1", var2= "my_var2")

but since the function can not deal with matrix / dataframe, a loop function to go through every combination of variables and store the results with variable names into dataframe. The rcorr function will do the trick, for now.

The final step would be (hint) to combine both sp_execute_external_script into one stored procedure, store both results from R, combine the coefficients with significance level and export only one table with all the information needed. This is already prepared as part of my R scripts.

Happy R-SQLing!

Advertisements

First thoughs on R integration in SQL Server 2016

This blog post is a very personal confession, might be delicate but it must be understood as a  proposal or plan for improvement.

Update #1: Typing error in title: correct title is First thoughts on R integration in SQL Server 2016 (update: June 24, 2016)

Before the start, let me put some background and context. I have been working with statistics and database for quite some time now and used different tools for statistical purposes. Starting with SPSS, Lisrel, Pajek, continuing with SPSS Modeler and then going to Weka and R together, very heavly used SQL Server SSAS  and working with Orange and in past 2 years working with SAS  and SAP Hana Analytics. With all the products I have used – and I am not here rank or make comparison which one is better or which one I prefer – some things remain the same. Getting to know your data to a detail is the most important one, where as  applying the statistical knowledge to it the result of it.

So I have had opportunity to work with different data sets, for different companies, on different projects and with different tools. And I am – looking back – very happy to have done it. So this thoughts are not me complaining about any of the products or solutions, but merely looking on improvements based on my year long experiences.

Working with R in SQL Server has been my personal thing for couple of years. The story goes back 4 years ago, when I started working on building a web based engine for creating marketing campaigns based on input parameters and desired goals. Once the business/marketing department decided on their new campaign, using a web based (in house build) program, they selected product groups and customer segment and some minor additional information (time frame, validity of campaign, legal stuff,..) they clicked okey and that was it. But in the background the stored procedure generated long T-SQL that has pushed (1) data into SQL Server engine and (2) selected proper model that feed the data to R engine.  So going back to year 2012, I have build framework for communication SQL Server with R engine. Most of the framework is available at SQLServerCentral web page.  The framework had some security flaws and issues (calling external procedure, storing data in external files) and mostly performance issues (scaling out wasn’t possible, R was solemnly dependent on RAM of the server), but overall the framework was working with putting it into production and project put into production.

Back then I wrote my framework in a way that I would be used as easily as possible, storing and handling R code and T-SQL code in SQL Server database. Most of the problems I had was, that SSMS couldn’t debug or validate R code and to some extent, converting R results back to table was also proven not to be as easy as one might think. But overall, I was still in the middle – between the platform and putting the solution into production – mainly to control the input data and results (outputted data in form of recommendations or campaign lists).

After working with R integration in SQL Server 2016 for the past couple of months, I have to admit, I was happy that security issues were solved. And performance issues were solved with RevoScale library. Both are very very welcoming features. Come to think of it, two main issues did Microsoft solved. And looking from my aspect of building my own framework, security issues were really and badly needed to be solved. Same logic applies to workload and working with large datasets. Performance has proven really great. In addition, you have to keep in mind that with R Integration Microsoft is – rather then keeping them separated – bringing different roles of people even closer. So DBA, Analysts and developers are now and can now be all part of one data process. Integration with Azure is also very positive and good thing, come to think of it as an extension to your flavor of landscape. Support to different databases is also what will in future be very helpful – but admitting, I haven’t test it yet.

What I expect in the future for R integration that Microsoft will do is:

  • support R in Analysis Services
  • better support R in Integration Services
  • stronger R integration with DQS and MDM
  • native integration of R into transact SQL (calling SP every time for every kind of statistics might be tiring)
  • have some DMV created for R profiler, for R debugging and for R performance monitoring
  • bring support for HTML files and JS / JQuery scripts into SSRS or PowerBI
  • improve usage of R exports with file table
  • open server Endpoints Objects for R in order to use full extension of R.
  • improve Resource Governor for R in Standard edition
  • questions on auditing, profiling of R code from T-SQL and
  • security (couple of questions as well)

I have most probably forgot on couple of points for improvement. But this was more technical, now let’s talk about user experience point of view:

  • think about the support of R debugger / validation of code in SSMS
  • in fact, bring R into SSMS (!) instead of VS or create RTMS (R Tools for SQL Server Management Studio)
  • not limit R exports only on data frames or images but support also vectors, lists, etc

Of course, not to hasten myself too much, I forgot to ask myself, what actually the purpose of integration was.

2016-06-23 11_48_40-sql-server-2016-r-services-7-638.jpg (JPEG Image, 656 × 333 pixels)

Is it really only about making the super predictive analytics or is it making statistical tool as well? If latter, than I would strongly suggest to recap my open suggestions.

So, for the end, I will serve you with a hint 🙂 I am preparing and building bunch of scripts (from my previous experience and from the framework that I have built) to help all the data scientists and developers to use R much easier. But (!) – mathematical and statistical background still is very much a something I would put as highly recommended to have when digging into R.

Follow the blog where I will keep you updated on the scripts and where the code will be available.

 

Happy R-SQLing 🙂

Case sensitive R in sp_execute_external_script

R Language is case sensitive! Any R-coder can tell you this and share the experience. Case sensitive in R can be a challenge as with any case sensitive language.

This also applies to SP_EXECUTE_EXTERNAL_SCRIPT stored procedure in T-SQL. This is a simple demo that will support this. This script will return normal result.

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'OutputDataSet<-c1'
      ,@input_data_1 = N'SELECT 1234 as number'
      ,@input_data_1_name = N'c1'

 

2016-06-21 21_33_05-SQL_Queries.sql - SICN-00031_SQLSERVER2016RC3.WideWorldImporters (SPAR_si0101798

With slight modification of parameter @input_data_1_name I will change the name from c1 (small caps) to C1 (all caps).

-- modification of @input_data_1_name  from "c1" (small caps) to "C1" ( all caps)
EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'OutputDataSet<-c1'
      ,@input_data_1 = N'SELECT 1234 as cifra'
      ,@input_data_1_name = N'C1'

It can be seen the difference of name of input data set (C1) with parameter @input_data_1_name and the dataset introduced into R with @script parameter.

Capture

In this case R output result shown in Message window of SSMS is at least informative that it will tell you: object ‘c1’ not found. It might be that some other time the error message will not be that straightforward.

Despite relative “oh, yes. I know this error” I am posting this, as I have found myself numerous times forgetting the fact, R is case sensitive. In opposite to @script or @input_data_1_name parameters, all other parameters (@input_data_1, @params, @output_data,..) are not case sensitive as the not go through R engine.

So spear yourself minutes of searching for errors in R script by keeping in mind that all code passed through @script or @input_data_1_name are case sensitive!

Unfortunately SSMS does not have debugger for R code (yet!), so you can always copy/paste the R code into RTVS or into RStudio (or any other) and validate and debug your R code.

Happy R-SQLing!

Passing T-SQL select statement to sp_execute_external_script

Passing T-SQL Statement to sp_execute_external_script has couple of limitations which are not mentioned on MSDN or on known issues list. Or I haven’t seen it yet on MSDN.

So in this blog post I will discuss what are the limitations of passing T-SQL Statement to sp_execute_external_script as a input data-set.

As the name of the blog post suggests, only T-SQL SELECT statement are allowed as input data-set. Any other form of statement being pushed as T-SQL into input data will yield this error:

2016-06-19 23_28_26-SQLQuery7.sql - SICN-00031_SQLSERVER2016RC3.Sandbox (SPAR_si01017988 (55))_ - Mi

So I generated this error with passing following T-SQL:

EXECUTE sp_execute_external_script    
        @language = N'R'    
      ,@script=N'OutputDataSet<- InputDataset'
      ,@input_data_1 = N'EXECUTE sample_data'

So passing stored procedure into @input_data_1 parameter is not allowed!

Trying to start with WITH CTE statement:

EXECUTE sp_execute_external_script    
        @language = N'R'    
      ,@script=N'OutputDataSet<-InputDataSet;'
      ,@input_data_1 = N'WITH CTE
                                AS ( SELECT name,number,[type]
                                FROM master..spt_values
                                WHERE [type] = ''EOD'')
                                SELECT * FROM CTE'
WITH RESULT SETS (
                     (name VARCHAR(50)
                    ,number INT
                    ,[type] CHAR(3))
                    );

The script returns the values resolved from the query.

It also handles UNION operation in select statement:

EXECUTE sp_execute_external_script    
       @language = N'R'    
      ,@script=N'OutputDataSet<-InputDataSet'
      ,@input_data_1 = N'SELECT 1 as cifra UNION SELECT 5 as cifra'
WITH RESULT SETS (( cifra VARCHAR(10)));

So you correctly predicted, if it supports UNION, it also supports UNION ALL, EXCEPT and INTERSECT.

Batch terminators or semicolons are also not allowed:

EXECUTE sp_execute_external_script    
        @language = N'R'    
      ,@script=N'OutputDataSet<-InputDataSet;'
      ,@input_data_1 = N' SELECT 1; 
                            GO 10'
WITH RESULT SETS UNDEFINED;

 

Working with variables. Declaring variable within input_data parameter will return error:

EXECUTE sp_execute_external_script    
        @language = N'R'    
      ,@script=N'OutputDataSet<-InputDataSet;'
      ,@input_data_1 = N'DECLARE @i INT = 1
                         SELECT @i;' 
WITH RESULT SETS UNDEFINED;

 

Whereas, both parameters @script and @input_data_1 can take value from an external variable when passed through procedure, as in example below:

CREATE PROCEDURE pp_tt 
    (@i INT)
AS

DECLARE @SQL NVARCHAR(MAX)
SET @SQL = 'SELECT '+CAST(@i AS CHAR(10))+';'

EXECUTE sp_execute_external_script    
        @language = N'R'    
      ,@script=N'OutputDataSet<-InputDataSet;'
      ,@input_data_1 = @sql
WITH RESULT SETS UNDEFINED

EXECUTE pp_tt @i=23

 

Please note: paramaters in sp_execute_external_script can not be furhter parametrized. Only assigned with variables, which is further parametrized and have a input varible defined (as shown in above example). To support this, following this example will return an error:

CREATE PROCEDURE pp_tt 
    (@i INT)
AS
EXECUTE sp_execute_external_script    
        @language = N'R'    
      ,@script=N'OutputDataSet<-InputDataSet;'
      ,@input_data_1 = 'SELECT '+CAST(@i AS CHAR(10))+';'
WITH RESULT SETS UNDEFINED;

 

What about logical conditions (IF Statement) or LOOP statements (WHILE, FOR). To put it briefly: no! 🙂 Same error.

 EXECUTE sp_execute_external_script    
        @language = N'R'    
      ,@script=N'OutputDataSet<-InputDataSet;'
      ,@input_data_1 = N'IF RAND() <= 0.5
                            SELECT 1 AS ResultIF
                            ELSE SELECT 2 AS ResultIF;' 
WITH RESULT SETS UNDEFINED;

With logical statements, you can not use IF Statement even if you declare special variable and pass it into @input_data_1 parameter within stored procedure:

CREATE PROCEDURE IFY_IF
AS

DECLARE @IF NVARCHAR(MAX)
SET @IF = 'IF RAND() <= 0.5
            SELECT 1 AS ResultIF
            ELSE SELECT 2 AS ResultIF;'

EXECUTE sp_execute_external_script    
        @language = N'R'    
      ,@script=N'OutputDataSet<-InputDataSet;'
      ,@input_data_1 = @IF
WITH RESULT SETS UNDEFINED;

EXECUTE IFY_IF

Same logic applies to FOR, WHILE statements.

This tests could be long and many many other things could be taken into consideration.To wrap up, following T-SQL Statements can not be used within @input_data_1 parameter:

  • stored procedure
  • logical IF and WHILE, FOR loops
  • temporary variables or tables
  • no updates, inserts or deletes (only select)
  • GO statement or semicolon
  • OUTPUT clause from and DML statetement
  • CURSORS

Following statements (besides SELECT statement) can be used:

  • SELECT with multiple JOINS
  • WITH Common table expressions
  • UNION, UNION ALL, EXCEPT, INTERSECT
  • any SET, STRING, LOGICAL, COMPARISON, BITWISE, ARITHMETIC, COMPOUND operators
  • COLLATE

 

This is so far what I have tested. Feel free to add any test at Github or comment section below. I will test more possibilities for this parameter and will update my blog.

All of the code (and more) from this blog post used, is available at Github.

Happy R-SQLing!

Importing CSV data using T-SQL and R

Several ways exists to import CSV (excel) data into your SQL Server Database.

362-exporting-csv-and-excel

One is of course, using SSIS. The other one – similar to SSIS – is using import/export function in SSMS. With wizard, you will be able to import data.

Futher on, BULK INSERT (BCP) statement is to all DBA and developers very close way of importing data. Another T-SQL statement is selecting from OPENDATASOURCE; simple and fast way of doing this (also you might want to check Linked Servers). In c# there is straightforward class SqlBulkCopy in System.Data.SqlClient namespace. Going into script language, PowerShell is also a fast and neat way to import CSV into your SQL Server. Script Guy server four ways how to do it with PS. I usually use c# namespace and same class.

Another way, going back to T-SQL, is using OLE Automation stored procedure, using standard set of sp_OAMethod. And also natively compiled stored procedures will also bring you same functionality.

With R integration in SQL Server with Microsoft R Server, importing CSV data in SQL Server can be done with sp_execute_external_script as well. With following snippet you can import data from csv into T-SQL

EXECUTE sp_execute_external_script    
        @language = N'R'    
    , @script = N'
                mydata <- data.frame(read.csv("C:\\MyFolder\\Book1.csv", 
sep=";", HEADER=FALSE))
                colnames(mydata)[1] <- "numb"
                colnames(mydata)[2] <- "charc"
                OutputDataset <- mydata;'

WITH RESULT SETS (( 
                     numb INT
                    ,charc VARCHAR(10)
                    ));

Or even shorter input file directly assigned to OutputDataSet:

EXECUTE sp_execute_external_script    
        @language = N'R'    
       ,@script=N'OutputDataSet<- read.csv("C:\\MyFolder\\Book1.csv")
            colnames(OutputDataSet)[1] <- "numb"
            colnames(OutputDataSet)[2] <- "charc;'
WITH result sets ((numb int),(charc VARCHAR(10)));

 

Creating stored procedure (with above execute statement we create procedure call ImportCSV ) with this script you can store data directly into table:

DECLARE @result TABLE (cifra varchar(10))
INSERT INTO @result
EXECUTE ImportCSV
SELECT * FROM @result

And this is my Book1.csv sample file:

2016-06-19 07_48_09-Book1.csv - Excel

So, another way to store data from CSV into SQL Server.

Happy R-SQLing 🙂

Find installed R packages on your Microsoft R Server

Want to find out what R libraries you have available in your SQL Server 2016 environment?

2016-06-16 10_28_29-microsoft r server - Google Search

Run the following scripts to find out how many R Libraries you have available for your R Server in SSMS:

EXECUTE sp_execute_external_script
          @language = N'R'
         ,@script = N'installpackages <- nrow(installed.packages());
                    OutputDataSet <- as.data.frame(installpackages);'
         ,@input_data_1 = N'SELECT 1 AS NOF_LIBRARIES'
 WITH RESULT SETS(
                  (NOF_Libraries int)
                  );
GO

And all their names along with couple of additional information (version, dependency,…)

EXECUTE sp_execute_external_script
  @language=N'R'
 ,@script = N'installpackages <-installed.packages();
             PackageName <- installpackages[,1];
             PackageVersion <- installpackages[,3];
             PackagePath <- installpackages[,2];
             PackageDepend <- installpackages[,6];
             OutputDataSet <- as.data.frame(cbind(PackageName,PackageVersion,PackageDepend,PackagePath));'
,@input_data_1 = N'SELECT 1 as col'
WITH RESULT SETS (
                   (PackageName NVARCHAR(50)
                   ,PackageVersion NVARCHAR(50)
                   ,PackageDepend NVARCHAR(250)
                   ,PackageDepend NVARCHAR(500))
                  );
GO

Happy R-SQLing 😉

Data types: R and SQL Server – get NVARCHAR working

I like data types. Data type is an attribute that specifies what kind of data the object can hold. And object can be from variable, to column, to table, expression, etc. By defining the allowed data for objects, this inadvertently creates set of rules that set the boundaries for each data type.

2016-06-12 21_59_27-R microsoft - Google Search

Example: data type TINY has several rules that defined which data can be stored. So in this case rules would be: non-negative numbers, no alpha-characters, numbers ranging from 0 to 255, etc. This rules apply particular behavior, preferred operations, selecting precedence, conversion rules and many other query, storage, execution plans related rules.

In this blog post I will mainly play with unsupported data types in R and SQL Server integration. So all the data types that can not be used as part of query input data-set for SP_EXECUTE_EXTERNAL_SCRIPT stored procedure.

By definition these are the data types that are not supported:

  • cursor
  • timestamp
  • datetime2, datetimeoffset, time
  • All Unicode text data types: nchar, nvarchar, ntext
  • sql_variant
  • text, image
  • xml
  • hierarchyid, geometry, geography
  • CLR user-defined types

Most common – of course – being nvarchar, nchar or ntext.

Let’s put it to the test. First of all let’s focus on R Code. I write following R code in RTVS environment:

library(RODBC)
myconn <-odbcDriverConnect("driver={SQL Server};Server=SICN-00031\\SQLSERVER2016RC3;database=Adventureworks;trusted_connection=true")

cust.data <- sqlQuery(myconn, "SELECT 1 AS Nof, 'D' AS GROUPs
                     UNION ALL SELECT 2 AS Nof, 'A' AS GROUPs
                     UNION ALL SELECT 3 AS Nof, 'A' AS GROUPs
                     UNION ALL SELECT 1 AS Nof, 'D' AS GROUPs
                     UNION ALL SELECT 3 AS Nof, 'Č' AS GROUPs
                     UNION ALL SELECT 4 AS Nof, 'Č' AS GROUPs")

cust.data2 <- sqlQuery(myconn, "SELECT 1 AS Nof, 'D' AS GROUPs
                      UNION ALL SELECT 2 AS Nof, 'A' AS GROUPs
                      UNION ALL SELECT 3 AS Nof, 'A' AS GROUPs
                      UNION ALL SELECT 1 AS Nof, 'D' AS GROUPs
                      UNION ALL SELECT 3 AS Nof, 'Č' AS GROUPs
                      UNION ALL SELECT 4 AS Nof, 'Č' AS GROUPs
                      UNION ALL SELECT 6 AS Nof, 'C' AS GROUPs")
close(myconn) 

mytable <- with(cust.data, table(Nof,GROUPs))
df <- data.frame(margin.table(mytable,2))
df

mytable2 <- with(cust.data2, table(Nof,GROUPs))
df2 <- data.frame(margin.table(mytable2,2))
df2

Upon running the resutls of both data.frames df and df2 I find the results very disturbing. In both datasets there should be special character Č (say an english word chapter and the first pronounced letter is our special character also represented as C^ or C/ or &ccaron; This letter (character) is a diacritic one as an extended part of ASCII table, available in UTF-8 or Latin-2 code table) but R transforms it into simple C. Which in case of first data frame (df) is okey, with second one is not (df2), because there is Č and C and both became C.

Results of both dataframes:

2016-06-12 21_54_05-RStudio

We get wrong results! So please take into consideration when dealing with special characters in R when using RODBC library! Defining encoding (in my case) did not help at all. If I define a data frame with Slovenian characters in R environment, I get correct results:

cust.data3 <- data.frame(Nof=c(1,2,3,1,3,4,6), GROUPs=c('D','A','A','D','Č','Č','C'))

mytable3 <- with(cust.data3, table(Nof,GROUPs))
df3 <- data.frame(margin.table(mytable3,2))
df3

with results:

2016-06-12 22_16_00-RStudio

So this is a ODBC connection problem. Code is available at github.

Now, switching  back to T-SQL I will run couple of additional tests:

DECLARE @RScript NVARCHAR(MAX)
SET @RScript = N'OutputDataSet <- InputDataSet;'


DECLARE @SQLScript NVARCHAR(MAX)
SET @SQLScript = N'
 SELECT top 10
         sod.[OrderQty] AS OrderQty
        ,so.[DiscountPct] AS Discount
        ,CAST(pc.name aS NVARCHAR(4000)) AS name
        ,pc.name as name2
        ,''ČŽŠ'' AS More_spec_chars
        ,CAST(''ČŽŠ'' AS VARCHAR(400)) AS More_spec_chars2


    FROM  Adventureworks.[Sales].[SalesOrderDetail] sod
    INNER JOIN Adventureworks.[Sales].[SpecialOffer] so
    ON so.[SpecialOfferID] = sod.[SpecialOfferID]
    INNER JOIN Adventureworks.[Production].[Product] p
    ON p.[ProductID] = sod.[ProductID]
    INNER JOIN Adventureworks.[Production].[ProductSubcategory] ps
    ON ps.[ProductSubcategoryID] = p.ProductSubcategoryID
    INNER JOIN Adventureworks.[Production].[ProductCategory] pc
    ON pc.ProductCategoryID = ps.ProductCategoryID'
                        

EXEC sp_execute_external_script 
                @language = N'R'
                , @script = @RScript
                , @input_data_1 = @SQLScript
                , @input_data_1_name = N''
WITH result SETS ( (
                     OrderQty INT
                    ,Discount DECIMAL(10,2)
                    ,name NVARCHAR(MAX)
                    ,name2 VARCHAR(200)
                    ,more_spec_chars NVARCHAR(200)
                    ,more_spec_chars2 VARCHAR(200)
                    ) );

I get a results returned; in slightly weird format which does not look good at all:

2016-06-12 22_36_43-01-DataType_R_SQLServer.sql - SICN-00031_SQLSERVER2016RC3.R_Logging (SPAR_si0101

But neither CAST in select list or specifying NVARCHAR in Result SET will not do the trick of returning and special characters.

So running T-SQL code with R doing the same as the previous R block code; I get the same result.

DECLARE @RScript NVARCHAR(MAX)
SET @RScript = N'
InputDataSet <- cust.data
mytable <- with(cust.data, table(Nof,GROUPs))
OutputDataSet <- data.frame(margin.table(mytable,2));'


DECLARE @SQLScript NVARCHAR(MAX)
SET @SQLScript = N'
          SELECT 1 AS Nof, ''D'' AS GROUPs
UNION ALL SELECT 2 AS Nof, ''D'' AS GROUPs
UNION ALL SELECT 3 AS Nof, ''A'' AS GROUPs
UNION ALL SELECT 1 AS Nof, ''B'' AS GROUPs
UNION ALL SELECT 3 AS Nof, ''B'' AS GROUPs
UNION ALL SELECT 4 AS Nof, ''C'' AS GROUPs'
                        
EXEC sp_execute_external_script 
                @language = N'R'
                , @script = @RScript
                , @input_data_1 = @SQLScript
                , @input_data_1_name = N'cust.data'
WITH result SETS ( (
                      GROUPs CHAR(2)
                     ,Freq INT
                    ) );

2016-06-12 22_44_07-01-DataType_R_SQLServer.sql - SICN-00031_SQLSERVER2016RC3.R_Logging (SPAR_si0101

Until I introduce special characters.

DECLARE @RScript NVARCHAR(MAX)
SET @RScript = N'
InputDataSet <- cust.data
mytable <- with(cust.data, table(Nof,GROUPs))
OutputDataSet <- data.frame(margin.table(mytable,2));'


DECLARE @SQLScript NVARCHAR(MAX)
SET @SQLScript = N'
          SELECT 1 AS Nof, ''D'' AS GROUPs
UNION ALL SELECT 2 AS Nof, ''A'' AS GROUPs
UNION ALL SELECT 3 AS Nof, ''A'' AS GROUPs
UNION ALL SELECT 1 AS Nof, ''D'' AS GROUPs
UNION ALL SELECT 3 AS Nof, ''Č'' AS GROUPs
UNION ALL SELECT 4 AS Nof, ''Č'' AS GROUPs'
                        
EXEC sp_execute_external_script 
                @language = N'R'
                , @script = @RScript
                , @input_data_1 = @SQLScript
                , @input_data_1_name = N'cust.data'
WITH result SETS ( (
                      GROUPs CHAR(2)
                     ,Freq INT
                    ) );

And I get lovely error:
2016-06-12 22_46_20-01-DataType_R_SQLServer.sql - SICN-00031_SQLSERVER2016RC3.R_Logging (SPAR_si0101

Yes Yes, I know. I can recode this categorical chars into ASCII bound characters. But what if I have to do a text analysis or Master data cleaning process of let’s say customer’s First Name? Then this is a no-go!

Since R data.frame returning a standard R set of data types and among is varchar (standard set is: bit, int, float, varchar, datetime).  MSDN has posted Known Issues for SQL Server R Service stating:

Capture.PNG

Which in a way is not good – as I have already mentioned; if you want to do text analysis. So is this really an issue and how it can be solved?

So using RODBC library, using; I can tackle this issue by storing data into table (code is run from RTVS):

library(RODBC)
#myconn <-odbcDriverConnect("driver={SQL Server};Server=SICN-00031\\SQLSERVER2016RC3;database=Adventureworks;trusted_connection=true;DBMSencoding=utf8")
myconn <-odbcDriverConnect("driver={SQL Server};Server=SICN-00031\\SQLSERVER2016RC3;database=Adventureworks;trusted_connection=true")

###|||||||||||||||||
### I have created a table in SQL Server and stored data in table
### CREATE TABLE chartest (Nof INT,GROUPs NVARCHAR(10))
### INSERT INTO chartest
### SELECT 1 AS Nof, 'D' AS GROUPs
### UNION ALL SELECT 2 AS Nof, 'A' AS GROUPs
### UNION ALL SELECT 3 AS Nof, 'A' AS GROUPs
### UNION ALL SELECT 1 AS Nof, 'D' AS GROUPs
### UNION ALL SELECT 3 AS Nof, 'Č' AS GROUPs
### UNION ALL SELECT 4 AS Nof, 'Č' AS GROUPs
###||||||||||||||||

cust.data <- sqlQuery(myconn, "SELECT * FROM NQS.dbo.chartest")

close(myconn) 

#with no special definition of DBMSencoding, I see special characters
cust.data

Which still leaves the question opened, how to workaround by running same problem from SQL Server Management Studio. Same solution is to store data into table:

IF EXISTS (SELECT * FROM sys.objects WHERE [name] = 'chartest')
DROP TABLE dbo.chartest;

CREATE TABLE dbo.chartest 
(
    Nof INT
    ,GROUPs NVARCHAR(10)
)

INSERT INTO dbo.chartest
          SELECT 1 AS Nof, 'D' AS GROUPs
UNION ALL SELECT 2 AS Nof, 'A' AS GROUPs
UNION ALL SELECT 3 AS Nof, 'A' AS GROUPs
UNION ALL SELECT 1 AS Nof, 'D' AS GROUPs
UNION ALL SELECT 3 AS Nof, 'Č' AS GROUPs
UNION ALL SELECT 4 AS Nof, 'Č' AS GROUPs
UNION ALL SELECT 6 AS Nof, 'C' AS GROUPs


-- Same set of R Code
-- With this presentation NVARCHAR is working with no problems
DECLARE @RScript NVARCHAR(MAX)
SET @RScript = N'InputDataSet <- cust.data
mytable <- with(cust.data, table(Nof,GROUPs))
OutputDataSet <- data.frame(margin.table(mytable,2));'


DECLARE @SQLScript NVARCHAR(MAX)
SET @SQLScript = N'SELECT * FROM chartest'
                    
EXEC sp_execute_external_script 
                 @language = N'R'
                ,@script = @RScript
                ,@input_data_1 = @SQLScript
                ,@input_data_1_name = N'cust.data'

WITH result SETS ( (
                      GROUPs NVARCHAR(2)
                     ,Freq INT
                    ) );

So both: table definition and result set have NVARCHAR dataset, meaning, that it send and received special characters from SSMS to R and back! And it was worth testing this limitation.

And this is it! 🙂 All code available here.

For the end; this is my R sesssionInfo on locale:

2016-06-13 10_19_19-RStudio

Cheers!