Graph Database with T-SQL

SQL Server 2017 will bring users graph database. Graph database is type of database that uses graph attributes – edges (or relationships) and nodes (or vertices) and features, to store data. Relationship between nodes are through edges and features (also called properties) hold the values to represent the strength or direction of the edges. Many of this links can be retrieved easy with a single operation (in comparison to T-SQL language, which can quickly become complex).

Links are also stored in the database, and when querying the graph database, use join clause, to collect the related data. Especially very useful for hierarchical structures and complex social structures, can be easily stored and queried in comparison to “relational” database.

Network database vs. graph database are similar, whereas, network might have problems when dealing with several chains of edges.

When to use graph/network database should be obvious; when you have hierarchical data or when your data is too complex with many many-to-many relations, that would make a relational database hard to query.

small_undirected_network_labeled

 

To put the graph database to the test, I took bunch of emails from a particular MVP SQL Server distribution list (content will not be shown and all the names will be anonymized). On my gmail account, I have downloaded some 90MiB of emails in mbox file format. With some python scripting,  only FROM and SUBJECTS were extracted:

writer.writerow(['from','subject'])
for index, message in enumerate(mailbox.mbox(infile)):
    content = get_content(message)
    row = [
        message['from'].strip('>').split('<')[-1],
        decode_header(message['subject'])[0][0],"|"
          ]
    writer.writerow(row)

Up to this point, I have extracted a CSV file with only From field and subject field, looking as:

Excel_emails

After importing the CSV file to SQL Server database, I have started to prepare my graph database structures:

CREATE TABLE MVP_DataPlatform
(
 MVP_ID INT IDENTITY(1,1) NOT NULL
 ,MVP_Name VARCHAR(100) NULL
 ,MVP_Email VARCHAR(100) NOT NULL
) AS NODE
GO

CREATE TABLE MVP_Topics
(
 Topic_ID INT IDENTITY(1,1) NOT NULL
 ,Title NVARCHAR(250) NOT NULL
) AS NODE
GO

CREATE TABLE Initiated AS EDGE;
CREATE TABLE Replied AS EDGE;
And prepare the data with:

with mails 
as
(
SELECT 
 froms 
 ,REPLACE(SUBSTRING(froms, 1, CASE CHARINDEX(';', froms) 
  WHEN 0 THEN LEN(froms) ELSE CHARINDEX(';',froms)-1 END),',','') AS from_MVP
 ,REPLACE(SUBSTRING(froms, CHARINDEX(';',froms) + 1, 1000),',|','') AS title

FROM [dbo].[clean_mail5_2] 
)
SELECT 
 from_MVP
 ,SUBSTRING(title,2, LEN(title)) as title
 ,CASE WHEN (
 SUBSTRING(title,1, 10) LIKE '%RE%' 
 OR SUBSTRING(title,1, 10) LIKE '%AW%'
 OR SUBSTRING(title,1, 10) LIKE '%FWD%'
 OR SUBSTRING(title,1, 10) LIKE '%OT%'
 ) THEN 0 ELSE 1 END MVP_start
INTO MVP_Mails
FROM mails
 WHERE
 from_MVP like '%@%'

Populating the nodes (vertices) should be a easy task:

INSERT INTO MVP_DataPlatform (MVP_Name, MVP_Email)
SELECT 
 SUBSTRING(from_MVP, 1, CHARINDEX('@',from_MVP)-1) AS MVP_NAME
 ,from_MVP AS MVP_Email
FROM MVP_Mails
GROUP BY 
 SUBSTRING(from_MVP, 1, CHARINDEX('@',from_MVP)-1)
 ,from_MVP;
GO

INSERT INTO MVP_Topics (Title)
SELECT 
 Title
FROM MVP_Mails
GROUP BY 
 title;
GO

But populating the edges (relationships) is part of how you want to define your graph.

I have created two edges; Replied and initiated. When using replied, logic will be following:

  1. AB -> send a mail to distribution list =  AB -> AB
  2. CD -> replies to AB  = CD -> AB
  3. EF -> replies to CD (and AB) = EF -> CD
  4. and so on.

When Initiated, logic is slightly different:

  1. AB -> send a mail to distribution list =  AB -> AB
  2. CD -> replies to AB  = CD -> AB
  3. EF -> replies to CD (and AB) = EF -> AB
  4. and so on…

So difference is, how I want to store information at step 3.; when EF replies. But this rule is – totally arbitrary and up to your type of business question, hierarchy or many-to-many rules. I can even decide to store information at step 3. normalized, as: ((EF -> CD) AND (EF -> AB)).

To store this set of information into the table of edges, following transformation is needed:

CREATE OR ALTER PROCEDURE INSERT_EDGE
AS
DECLARE @id1 INT;
DECLARE @id2 INT;
DECLARE @i INT;
DECLARE @hid INT;
DECLARE @START INT;

select 
 mail.from_MVP as mail_ID
,pl.MVP_id AS pl_ID
,mail.title
,mail.MVP_Start
,row_number() OVER (partition by title ORDER BY title, MVP_Start DESC) as rn_title
,row_number() Over (order by (select 1)) as rn_global
into #temp
from MVP_Mails as mail
JOIN MVP_DataPlatform AS Pl
ON mail.from_mvp = pl.mvp_email

SELECT @i = MAX(rn_global) FROM #TEMP

SET @hid=1;
WHILE @hid <= @i
BEGIN 
 SELECT @start = MVP_start FROM #temp WHERE rn_global = @hid
 PRINT @start
 SELECT @id1 = pl_id FROM #temp WHERE rn_global = @hid
 PRINT @id1
 IF @start = 1
 SELECT @id2 = pl_id FROM #temp WHERE rn_global = @hid+1
 IF @start = 0
 SET @id2 = @id1
 PRINT @id2

INSERT INTO replied VALUES ((SELECT $NODE_ID FROM MVP_DataPlatform WHERE MVP_ID = @id1), (SELECT $NODE_ID FROM MVP_DataPlatform WHERE MVP_ID = @id2));
 SET @hid = @hid + 1;
END

Executing this procedure:

EXECUTE PROCEDURE INSERT_EDGE

You should be getting a table with relations:

relations

And now you can perform simple queries against the graph – including complex relations, in order to get results relatively simple:

SELECT 
 MVP_1.MVP_Name 
 ,MVP_2.MVP_Name
 ,T.Title
 ,T.Topic_ID
FROM 
 MVP_DataPlatform AS MVP_1
 ,MVP_DataPlatform AS MVP_2
 ,Replied
 ,MVP_Topics AS T
WHERE 
 MATCH(MVP_1-(Replied)->MVP_2)

 

Sure, there are many other ways to do the same steps and also many other ways how you want your data to be represented in a graph.

Happy T-SQL Querying! 🙂

 

Advertisements

R graphs and tables in Power BI Desktop

Power BI Desktop enable users to use R script visual for adding custom visualization generated with R language – regardless of R package used. Before using R script visual, you will need to enable it by setting path to R Engine on your client in the global options. Once this is done, you will be able to enhance your Power BI reports using R visualizations.

1450757873470

My personal favorite R packages I like to use in Power BI are shiny, plotly, ggplot2, googleVis, ggiraph, rCharts and d3js.  Apart from finding the package for data visualization where you feel home, one of the very great thing I find with Power BI + R is the ability to have visualizations created dynamically (more or less like if you would be using plotly or shiny)  with slicers and selectors. And this makes data exploration, data wrangling so much easier.

With October 2016 Power BI custom visuals got additional 7 R visuals to the family (check Power BI Visuals Gallery and fall in love with your favorite one 🙂 ).

2016-12-18-09_26_55-visuals-gallery-_-microsoft-power-bi

Upon using any of the additional R visuals, Power BI will – prior to importing the additional visualizations – ask for installing all the missing packages, as well as inform you of any incompatibility between R engine and package.

So how easy is to create R visualization in Power BI?  Let us first import data to Power BI from SQL Server. I have generated a sample data-set for this demo. Data and all the samples are available at Github.

2016-12-18-11_32_26-preparing-data-sql-sicn-kastrun-plan-spar_si01017988-69_-microsoft-sql-s

With sample data-set I have created a simple data visualization with slicers using Power BI.

2016-12-18-11_34_01-sample_powerbi_r-power-bi-desktop

Now, the same visualization using R.

2016-12-18-11_34_01-sample_powerbi_r-power-bi-desktop_2

Note that I have used same slicers (Year = 2016, Month = 2) and R graph (using ggplot2) was able to dynamically  visualize the numbers. So this mean, that any R visualized graph can be navigated and dynamically updated, which is absolutely great, especially in cases when you do data exploration. Following R code was used:

library(ggplot2)
ggplot(data=dataset, aes(x=Day, y=Value)) +  geom_line(aes(color="red")) + 
ggtitle("Value by day") + theme(legend.position = "none") + 
theme(panel.background = element_rect(fill = "white", color = "lightgray", size = 1))

with additional dataset conversion to data.frame done automatically.

2016-12-18-11_57_51-sample_powerbi_r-power-bi-desktop_code

Where it get’s tricky is if you want to return the R data.frame into your Power BI page in form of a table (imagine you want some results to your graph or that you are calculating some statistics). Well, Power BI does not allow returning tables from R script visual. Unless, table is an image (that is, in a “form” of a visualized “graph”).

So I manage to create a table as a graph using R and gplot function from ggplot2 package.

2016-12-18-12_40_32-sample_powerbi_r-power-bi-desktop

The best part of this table is, that is dynamically changed based on selection on graphs or slicers. The downside is, that it can not be scrolled within the object. So if you end up with bigger table, you have to tweak the font size or show table aggregated or grouped.  In my opinion, I would not change R created table with Power BI table, because Power BI table is simply better and easier to create and maintain. But I would definitely recommend using R tables to support your R visuals. For example: If you are generating clusters using R in power BI, use R generated table to expose additional statistics on clusters.

The code for generating this R table is:

library(gridExtra)
library(ggplot2)

qplot(x=1:5, y=1:5, geom = "blank") + 
theme(line = element_blank(), text = element_blank()) + 
annotation_custom(grob=tableGrob(dataset))

Script generated blank plot, where custom annotation is added with function grob. And this is our data.frame from dataset and is displayed on top of empty graph.In addition to “remove” the background blank graph, blank line and text is added:

theme(line = element_blank(), text = element_blank())

 

Update #1 (19.12.2016)

After reading the comment to this blog post from reader Baptiste Auguie  (with link to his post on displaying tables as grid graphs), that using grid will return the same result, but very much a simpler solution to plotting / drawing table. Thank you Baptiste! I must have missed this one.

R code is far less simpler without  qplot function.

library(gridExtra)
grid.table(dataset)

And the printscreen from Power BI:

2016-12-19-06_13_19-sample_powerbi_r-power-bi-desktop

As always, sample, code and data-set is available at Github.

Happy PowerBI-Rrrring!