Graph Database with T-SQL

SQL Server 2017 will bring users graph database. Graph database is type of database that uses graph attributes – edges (or relationships) and nodes (or vertices) and features, to store data. Relationship between nodes are through edges and features (also called properties) hold the values to represent the strength or direction of the edges. Many of this links can be retrieved easy with a single operation (in comparison to T-SQL language, which can quickly become complex).

Links are also stored in the database, and when querying the graph database, use join clause, to collect the related data. Especially very useful for hierarchical structures and complex social structures, can be easily stored and queried in comparison to “relational” database.

Network database vs. graph database are similar, whereas, network might have problems when dealing with several chains of edges.

When to use graph/network database should be obvious; when you have hierarchical data or when your data is too complex with many many-to-many relations, that would make a relational database hard to query.

small_undirected_network_labeled

 

To put the graph database to the test, I took bunch of emails from a particular MVP SQL Server distribution list (content will not be shown and all the names will be anonymized). On my gmail account, I have downloaded some 90MiB of emails in mbox file format. With some python scripting,  only FROM and SUBJECTS were extracted:

writer.writerow(['from','subject'])
for index, message in enumerate(mailbox.mbox(infile)):
    content = get_content(message)
    row = [
        message['from'].strip('>').split('<')[-1],
        decode_header(message['subject'])[0][0],"|"
          ]
    writer.writerow(row)

Up to this point, I have extracted a CSV file with only From field and subject field, looking as:

Excel_emails

After importing the CSV file to SQL Server database, I have started to prepare my graph database structures:

CREATE TABLE MVP_DataPlatform
(
 MVP_ID INT IDENTITY(1,1) NOT NULL
 ,MVP_Name VARCHAR(100) NULL
 ,MVP_Email VARCHAR(100) NOT NULL
) AS NODE
GO

CREATE TABLE MVP_Topics
(
 Topic_ID INT IDENTITY(1,1) NOT NULL
 ,Title NVARCHAR(250) NOT NULL
) AS NODE
GO

CREATE TABLE Initiated AS EDGE;
CREATE TABLE Replied AS EDGE;
And prepare the data with:

with mails 
as
(
SELECT 
 froms 
 ,REPLACE(SUBSTRING(froms, 1, CASE CHARINDEX(';', froms) 
  WHEN 0 THEN LEN(froms) ELSE CHARINDEX(';',froms)-1 END),',','') AS from_MVP
 ,REPLACE(SUBSTRING(froms, CHARINDEX(';',froms) + 1, 1000),',|','') AS title

FROM [dbo].[clean_mail5_2] 
)
SELECT 
 from_MVP
 ,SUBSTRING(title,2, LEN(title)) as title
 ,CASE WHEN (
 SUBSTRING(title,1, 10) LIKE '%RE%' 
 OR SUBSTRING(title,1, 10) LIKE '%AW%'
 OR SUBSTRING(title,1, 10) LIKE '%FWD%'
 OR SUBSTRING(title,1, 10) LIKE '%OT%'
 ) THEN 0 ELSE 1 END MVP_start
INTO MVP_Mails
FROM mails
 WHERE
 from_MVP like '%@%'

Populating the nodes (vertices) should be a easy task:

INSERT INTO MVP_DataPlatform (MVP_Name, MVP_Email)
SELECT 
 SUBSTRING(from_MVP, 1, CHARINDEX('@',from_MVP)-1) AS MVP_NAME
 ,from_MVP AS MVP_Email
FROM MVP_Mails
GROUP BY 
 SUBSTRING(from_MVP, 1, CHARINDEX('@',from_MVP)-1)
 ,from_MVP;
GO

INSERT INTO MVP_Topics (Title)
SELECT 
 Title
FROM MVP_Mails
GROUP BY 
 title;
GO

But populating the edges (relationships) is part of how you want to define your graph.

I have created two edges; Replied and initiated. When using replied, logic will be following:

  1. AB -> send a mail to distribution list =  AB -> AB
  2. CD -> replies to AB  = CD -> AB
  3. EF -> replies to CD (and AB) = EF -> CD
  4. and so on.

When Initiated, logic is slightly different:

  1. AB -> send a mail to distribution list =  AB -> AB
  2. CD -> replies to AB  = CD -> AB
  3. EF -> replies to CD (and AB) = EF -> AB
  4. and so on…

So difference is, how I want to store information at step 3.; when EF replies. But this rule is – totally arbitrary and up to your type of business question, hierarchy or many-to-many rules. I can even decide to store information at step 3. normalized, as: ((EF -> CD) AND (EF -> AB)).

To store this set of information into the table of edges, following transformation is needed:

CREATE OR ALTER PROCEDURE INSERT_EDGE
AS
DECLARE @id1 INT;
DECLARE @id2 INT;
DECLARE @i INT;
DECLARE @hid INT;
DECLARE @START INT;

select 
 mail.from_MVP as mail_ID
,pl.MVP_id AS pl_ID
,mail.title
,mail.MVP_Start
,row_number() OVER (partition by title ORDER BY title, MVP_Start DESC) as rn_title
,row_number() Over (order by (select 1)) as rn_global
into #temp
from MVP_Mails as mail
JOIN MVP_DataPlatform AS Pl
ON mail.from_mvp = pl.mvp_email

SELECT @i = MAX(rn_global) FROM #TEMP

SET @hid=1;
WHILE @hid <= @i
BEGIN 
 SELECT @start = MVP_start FROM #temp WHERE rn_global = @hid
 PRINT @start
 SELECT @id1 = pl_id FROM #temp WHERE rn_global = @hid
 PRINT @id1
 IF @start = 1
 SELECT @id2 = pl_id FROM #temp WHERE rn_global = @hid+1
 IF @start = 0
 SET @id2 = @id1
 PRINT @id2

INSERT INTO replied VALUES ((SELECT $NODE_ID FROM MVP_DataPlatform WHERE MVP_ID = @id1), (SELECT $NODE_ID FROM MVP_DataPlatform WHERE MVP_ID = @id2));
 SET @hid = @hid + 1;
END

Executing this procedure:

EXECUTE PROCEDURE INSERT_EDGE

You should be getting a table with relations:

relations

And now you can perform simple queries against the graph – including complex relations, in order to get results relatively simple:

SELECT 
 MVP_1.MVP_Name 
 ,MVP_2.MVP_Name
 ,T.Title
 ,T.Topic_ID
FROM 
 MVP_DataPlatform AS MVP_1
 ,MVP_DataPlatform AS MVP_2
 ,Replied
 ,MVP_Topics AS T
WHERE 
 MATCH(MVP_1-(Replied)->MVP_2)

 

Sure, there are many other ways to do the same steps and also many other ways how you want your data to be represented in a graph.

Happy T-SQL Querying! 🙂

 

Tagged with: , , ,
Posted in Uncategorized
4 comments on “Graph Database with T-SQL
  1. […] Tomaz Kastrun uses a set of e-mails as his SQL Server 2017 graph table data source: […]

    Like

  2. […] Graph Database with T-SQL | TomazTsql tomaztsql.wordpress.com/2017/06/28/graph-database-with-t-sql […]

    Like

Leave a comment

Follow TomazTsql on WordPress.com
Programs I Use: SQL Search
Programs I Use: R Studio
Programs I Use: Plan Explorer
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
TomazTsql

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Discover WordPress

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

tenbulls.co.uk

tenbulls.co.uk - attaining enlightenment with the Microsoft Data and Cloud Platforms with a sprinkling of Open Source and supporting technologies!

SQL DBA with A Beard

He's a SQL DBA and he has a beard

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

DevOps could be the disease you die with, but don’t die of.

Paul te Braak

Business Intelligence Blog

Sql Insane Asylum (A Blog by Pat Wright)

Information about SQL (PostgreSQL & SQL Server) from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...