Data types: R and SQL Server – get NVARCHAR working

I like data types. Data type is an attribute that specifies what kind of data the object can hold. And object can be from variable, to column, to table, expression, etc. By defining the allowed data for objects, this inadvertently creates set of rules that set the boundaries for each data type.

2016-06-12 21_59_27-R microsoft - Google Search

Example: data type TINY has several rules that defined which data can be stored. So in this case rules would be: non-negative numbers, no alpha-characters, numbers ranging from 0 to 255, etc. This rules apply particular behavior, preferred operations, selecting precedence, conversion rules and many other query, storage, execution plans related rules.

In this blog post I will mainly play with unsupported data types in R and SQL Server integration. So all the data types that can not be used as part of query input data-set for SP_EXECUTE_EXTERNAL_SCRIPT stored procedure.

By definition these are the data types that are not supported:

  • cursor
  • timestamp
  • datetime2, datetimeoffset, time
  • All Unicode text data types: nchar, nvarchar, ntext
  • sql_variant
  • text, image
  • xml
  • hierarchyid, geometry, geography
  • CLR user-defined types

Most common – of course – being nvarchar, nchar or ntext.

Let’s put it to the test. First of all let’s focus on R Code. I write following R code in RTVS environment:

library(RODBC)
myconn <-odbcDriverConnect("driver={SQL Server};Server=SICN-00031\\SQLSERVER2016RC3;database=Adventureworks;trusted_connection=true")

cust.data <- sqlQuery(myconn, "SELECT 1 AS Nof, 'D' AS GROUPs
                     UNION ALL SELECT 2 AS Nof, 'A' AS GROUPs
                     UNION ALL SELECT 3 AS Nof, 'A' AS GROUPs
                     UNION ALL SELECT 1 AS Nof, 'D' AS GROUPs
                     UNION ALL SELECT 3 AS Nof, 'Č' AS GROUPs
                     UNION ALL SELECT 4 AS Nof, 'Č' AS GROUPs")

cust.data2 <- sqlQuery(myconn, "SELECT 1 AS Nof, 'D' AS GROUPs
                      UNION ALL SELECT 2 AS Nof, 'A' AS GROUPs
                      UNION ALL SELECT 3 AS Nof, 'A' AS GROUPs
                      UNION ALL SELECT 1 AS Nof, 'D' AS GROUPs
                      UNION ALL SELECT 3 AS Nof, 'Č' AS GROUPs
                      UNION ALL SELECT 4 AS Nof, 'Č' AS GROUPs
                      UNION ALL SELECT 6 AS Nof, 'C' AS GROUPs")
close(myconn) 

mytable <- with(cust.data, table(Nof,GROUPs))
df <- data.frame(margin.table(mytable,2))
df

mytable2 <- with(cust.data2, table(Nof,GROUPs))
df2 <- data.frame(margin.table(mytable2,2))
df2

Upon running the resutls of both data.frames df and df2 I find the results very disturbing. In both datasets there should be special character Č (say an english word chapter and the first pronounced letter is our special character also represented as C^ or C/ or &ccaron; This letter (character) is a diacritic one as an extended part of ASCII table, available in UTF-8 or Latin-2 code table) but R transforms it into simple C. Which in case of first data frame (df) is okey, with second one is not (df2), because there is Č and C and both became C.

Results of both dataframes:

2016-06-12 21_54_05-RStudio

We get wrong results! So please take into consideration when dealing with special characters in R when using RODBC library! Defining encoding (in my case) did not help at all. If I define a data frame with Slovenian characters in R environment, I get correct results:

cust.data3 <- data.frame(Nof=c(1,2,3,1,3,4,6), GROUPs=c('D','A','A','D','Č','Č','C'))

mytable3 <- with(cust.data3, table(Nof,GROUPs))
df3 <- data.frame(margin.table(mytable3,2))
df3

with results:

2016-06-12 22_16_00-RStudio

So this is a ODBC connection problem. Code is available at github.

Now, switching  back to T-SQL I will run couple of additional tests:

DECLARE @RScript NVARCHAR(MAX)
SET @RScript = N'OutputDataSet <- InputDataSet;'


DECLARE @SQLScript NVARCHAR(MAX)
SET @SQLScript = N'
 SELECT top 10
         sod.[OrderQty] AS OrderQty
        ,so.[DiscountPct] AS Discount
        ,CAST(pc.name aS NVARCHAR(4000)) AS name
        ,pc.name as name2
        ,''ČŽŠ'' AS More_spec_chars
        ,CAST(''ČŽŠ'' AS VARCHAR(400)) AS More_spec_chars2


    FROM  Adventureworks.[Sales].[SalesOrderDetail] sod
    INNER JOIN Adventureworks.[Sales].[SpecialOffer] so
    ON so.[SpecialOfferID] = sod.[SpecialOfferID]
    INNER JOIN Adventureworks.[Production].[Product] p
    ON p.[ProductID] = sod.[ProductID]
    INNER JOIN Adventureworks.[Production].[ProductSubcategory] ps
    ON ps.[ProductSubcategoryID] = p.ProductSubcategoryID
    INNER JOIN Adventureworks.[Production].[ProductCategory] pc
    ON pc.ProductCategoryID = ps.ProductCategoryID'
                        

EXEC sp_execute_external_script 
                @language = N'R'
                , @script = @RScript
                , @input_data_1 = @SQLScript
                , @input_data_1_name = N''
WITH result SETS ( (
                     OrderQty INT
                    ,Discount DECIMAL(10,2)
                    ,name NVARCHAR(MAX)
                    ,name2 VARCHAR(200)
                    ,more_spec_chars NVARCHAR(200)
                    ,more_spec_chars2 VARCHAR(200)
                    ) );

I get a results returned; in slightly weird format which does not look good at all:

2016-06-12 22_36_43-01-DataType_R_SQLServer.sql - SICN-00031_SQLSERVER2016RC3.R_Logging (SPAR_si0101

But neither CAST in select list or specifying NVARCHAR in Result SET will not do the trick of returning and special characters.

So running T-SQL code with R doing the same as the previous R block code; I get the same result.

DECLARE @RScript NVARCHAR(MAX)
SET @RScript = N'
InputDataSet <- cust.data
mytable <- with(cust.data, table(Nof,GROUPs))
OutputDataSet <- data.frame(margin.table(mytable,2));'


DECLARE @SQLScript NVARCHAR(MAX)
SET @SQLScript = N'
          SELECT 1 AS Nof, ''D'' AS GROUPs
UNION ALL SELECT 2 AS Nof, ''D'' AS GROUPs
UNION ALL SELECT 3 AS Nof, ''A'' AS GROUPs
UNION ALL SELECT 1 AS Nof, ''B'' AS GROUPs
UNION ALL SELECT 3 AS Nof, ''B'' AS GROUPs
UNION ALL SELECT 4 AS Nof, ''C'' AS GROUPs'
                        
EXEC sp_execute_external_script 
                @language = N'R'
                , @script = @RScript
                , @input_data_1 = @SQLScript
                , @input_data_1_name = N'cust.data'
WITH result SETS ( (
                      GROUPs CHAR(2)
                     ,Freq INT
                    ) );

2016-06-12 22_44_07-01-DataType_R_SQLServer.sql - SICN-00031_SQLSERVER2016RC3.R_Logging (SPAR_si0101

Until I introduce special characters.

DECLARE @RScript NVARCHAR(MAX)
SET @RScript = N'
InputDataSet <- cust.data
mytable <- with(cust.data, table(Nof,GROUPs))
OutputDataSet <- data.frame(margin.table(mytable,2));'


DECLARE @SQLScript NVARCHAR(MAX)
SET @SQLScript = N'
          SELECT 1 AS Nof, ''D'' AS GROUPs
UNION ALL SELECT 2 AS Nof, ''A'' AS GROUPs
UNION ALL SELECT 3 AS Nof, ''A'' AS GROUPs
UNION ALL SELECT 1 AS Nof, ''D'' AS GROUPs
UNION ALL SELECT 3 AS Nof, ''Č'' AS GROUPs
UNION ALL SELECT 4 AS Nof, ''Č'' AS GROUPs'
                        
EXEC sp_execute_external_script 
                @language = N'R'
                , @script = @RScript
                , @input_data_1 = @SQLScript
                , @input_data_1_name = N'cust.data'
WITH result SETS ( (
                      GROUPs CHAR(2)
                     ,Freq INT
                    ) );

And I get lovely error:
2016-06-12 22_46_20-01-DataType_R_SQLServer.sql - SICN-00031_SQLSERVER2016RC3.R_Logging (SPAR_si0101

Yes Yes, I know. I can recode this categorical chars into ASCII bound characters. But what if I have to do a text analysis or Master data cleaning process of let’s say customer’s First Name? Then this is a no-go!

Since R data.frame returning a standard R set of data types and among is varchar (standard set is: bit, int, float, varchar, datetime).  MSDN has posted Known Issues for SQL Server R Service stating:

Capture.PNG

Which in a way is not good – as I have already mentioned; if you want to do text analysis. So is this really an issue and how it can be solved?

So using RODBC library, using; I can tackle this issue by storing data into table (code is run from RTVS):

library(RODBC)
#myconn <-odbcDriverConnect("driver={SQL Server};Server=SICN-00031\\SQLSERVER2016RC3;database=Adventureworks;trusted_connection=true;DBMSencoding=utf8")
myconn <-odbcDriverConnect("driver={SQL Server};Server=SICN-00031\\SQLSERVER2016RC3;database=Adventureworks;trusted_connection=true")

###|||||||||||||||||
### I have created a table in SQL Server and stored data in table
### CREATE TABLE chartest (Nof INT,GROUPs NVARCHAR(10))
### INSERT INTO chartest
### SELECT 1 AS Nof, 'D' AS GROUPs
### UNION ALL SELECT 2 AS Nof, 'A' AS GROUPs
### UNION ALL SELECT 3 AS Nof, 'A' AS GROUPs
### UNION ALL SELECT 1 AS Nof, 'D' AS GROUPs
### UNION ALL SELECT 3 AS Nof, 'Č' AS GROUPs
### UNION ALL SELECT 4 AS Nof, 'Č' AS GROUPs
###||||||||||||||||

cust.data <- sqlQuery(myconn, "SELECT * FROM NQS.dbo.chartest")

close(myconn) 

#with no special definition of DBMSencoding, I see special characters
cust.data

Which still leaves the question opened, how to workaround by running same problem from SQL Server Management Studio. Same solution is to store data into table:

IF EXISTS (SELECT * FROM sys.objects WHERE [name] = 'chartest')
DROP TABLE dbo.chartest;

CREATE TABLE dbo.chartest 
(
    Nof INT
    ,GROUPs NVARCHAR(10)
)

INSERT INTO dbo.chartest
          SELECT 1 AS Nof, 'D' AS GROUPs
UNION ALL SELECT 2 AS Nof, 'A' AS GROUPs
UNION ALL SELECT 3 AS Nof, 'A' AS GROUPs
UNION ALL SELECT 1 AS Nof, 'D' AS GROUPs
UNION ALL SELECT 3 AS Nof, 'Č' AS GROUPs
UNION ALL SELECT 4 AS Nof, 'Č' AS GROUPs
UNION ALL SELECT 6 AS Nof, 'C' AS GROUPs


-- Same set of R Code
-- With this presentation NVARCHAR is working with no problems
DECLARE @RScript NVARCHAR(MAX)
SET @RScript = N'InputDataSet <- cust.data
mytable <- with(cust.data, table(Nof,GROUPs))
OutputDataSet <- data.frame(margin.table(mytable,2));'


DECLARE @SQLScript NVARCHAR(MAX)
SET @SQLScript = N'SELECT * FROM chartest'
                    
EXEC sp_execute_external_script 
                 @language = N'R'
                ,@script = @RScript
                ,@input_data_1 = @SQLScript
                ,@input_data_1_name = N'cust.data'

WITH result SETS ( (
                      GROUPs NVARCHAR(2)
                     ,Freq INT
                    ) );

So both: table definition and result set have NVARCHAR dataset, meaning, that it send and received special characters from SSMS to R and back! And it was worth testing this limitation.

And this is it! 🙂 All code available here.

For the end; this is my R sesssionInfo on locale:

2016-06-13 10_19_19-RStudio

Cheers!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s