SQL Server 2017 will bring users graph database. Graph database is type of database that uses graph attributes – edges (or relationships) and nodes (or vertices) and features, to store data. Relationship between nodes are through edges and features (also called properties) hold the values to represent the strength or direction of the edges. Many of this links can be retrieved easy with a single operation (in comparison to T-SQL language, which can quickly become complex).
Links are also stored in the database, and when querying the graph database, use join clause, to collect the related data. Especially very useful for hierarchical structures and complex social structures, can be easily stored and queried in comparison to “relational” database.
Network database vs. graph database are similar, whereas, network might have problems when dealing with several chains of edges.
When to use graph/network database should be obvious; when you have hierarchical data or when your data is too complex with many many-to-many relations, that would make a relational database hard to query.
To put the graph database to the test, I took bunch of emails from a particular MVP SQL Server distribution list (content will not be shown and all the names will be anonymized). On my gmail account, I have downloaded some 90MiB of emails in mbox file format. With some python scripting, only FROM and SUBJECTS were extracted:
writer.writerow(['from','subject']) for index, message in enumerate(mailbox.mbox(infile)): content = get_content(message) row = [ message['from'].strip('>').split('<')[-1], decode_header(message['subject'])[0][0],"|" ] writer.writerow(row)
Up to this point, I have extracted a CSV file with only From field and subject field, looking as:
After importing the CSV file to SQL Server database, I have started to prepare my graph database structures:
CREATE TABLE MVP_DataPlatform ( MVP_ID INT IDENTITY(1,1) NOT NULL ,MVP_Name VARCHAR(100) NULL ,MVP_Email VARCHAR(100) NOT NULL ) AS NODE GO CREATE TABLE MVP_Topics ( Topic_ID INT IDENTITY(1,1) NOT NULL ,Title NVARCHAR(250) NOT NULL ) AS NODE GO CREATE TABLE Initiated AS EDGE; CREATE TABLE Replied AS EDGE;
And prepare the data with: with mails as ( SELECT froms ,REPLACE(SUBSTRING(froms, 1, CASE CHARINDEX(';', froms) WHEN 0 THEN LEN(froms) ELSE CHARINDEX(';',froms)-1 END),',','') AS from_MVP ,REPLACE(SUBSTRING(froms, CHARINDEX(';',froms) + 1, 1000),',|','') AS title FROM [dbo].[clean_mail5_2] ) SELECT from_MVP ,SUBSTRING(title,2, LEN(title)) as title ,CASE WHEN ( SUBSTRING(title,1, 10) LIKE '%RE%' OR SUBSTRING(title,1, 10) LIKE '%AW%' OR SUBSTRING(title,1, 10) LIKE '%FWD%' OR SUBSTRING(title,1, 10) LIKE '%OT%' ) THEN 0 ELSE 1 END MVP_start INTO MVP_Mails FROM mails WHERE from_MVP like '%@%'
Populating the nodes (vertices) should be a easy task:
INSERT INTO MVP_DataPlatform (MVP_Name, MVP_Email) SELECT SUBSTRING(from_MVP, 1, CHARINDEX('@',from_MVP)-1) AS MVP_NAME ,from_MVP AS MVP_Email FROM MVP_Mails GROUP BY SUBSTRING(from_MVP, 1, CHARINDEX('@',from_MVP)-1) ,from_MVP; GO INSERT INTO MVP_Topics (Title) SELECT Title FROM MVP_Mails GROUP BY title; GO
But populating the edges (relationships) is part of how you want to define your graph.
I have created two edges; Replied and initiated. When using replied, logic will be following:
- AB -> send a mail to distribution list = AB -> AB
- CD -> replies to AB = CD -> AB
- EF -> replies to CD (and AB) = EF -> CD
- and so on.
When Initiated, logic is slightly different:
- AB -> send a mail to distribution list = AB -> AB
- CD -> replies to AB = CD -> AB
- EF -> replies to CD (and AB) = EF -> AB
- and so on…
So difference is, how I want to store information at step 3.; when EF replies. But this rule is – totally arbitrary and up to your type of business question, hierarchy or many-to-many rules. I can even decide to store information at step 3. normalized, as: ((EF -> CD) AND (EF -> AB)).
To store this set of information into the table of edges, following transformation is needed:
CREATE OR ALTER PROCEDURE INSERT_EDGE AS DECLARE @id1 INT; DECLARE @id2 INT; DECLARE @i INT; DECLARE @hid INT; DECLARE @START INT; select mail.from_MVP as mail_ID ,pl.MVP_id AS pl_ID ,mail.title ,mail.MVP_Start ,row_number() OVER (partition by title ORDER BY title, MVP_Start DESC) as rn_title ,row_number() Over (order by (select 1)) as rn_global into #temp from MVP_Mails as mail JOIN MVP_DataPlatform AS Pl ON mail.from_mvp = pl.mvp_email SELECT @i = MAX(rn_global) FROM #TEMP SET @hid=1; WHILE @hid <= @i BEGIN SELECT @start = MVP_start FROM #temp WHERE rn_global = @hid PRINT @start SELECT @id1 = pl_id FROM #temp WHERE rn_global = @hid PRINT @id1 IF @start = 1 SELECT @id2 = pl_id FROM #temp WHERE rn_global = @hid+1 IF @start = 0 SET @id2 = @id1 PRINT @id2 INSERT INTO replied VALUES ((SELECT $NODE_ID FROM MVP_DataPlatform WHERE MVP_ID = @id1), (SELECT $NODE_ID FROM MVP_DataPlatform WHERE MVP_ID = @id2)); SET @hid = @hid + 1; END
Executing this procedure:
EXECUTE PROCEDURE INSERT_EDGE
You should be getting a table with relations:
And now you can perform simple queries against the graph – including complex relations, in order to get results relatively simple:
SELECT MVP_1.MVP_Name ,MVP_2.MVP_Name ,T.Title ,T.Topic_ID FROM MVP_DataPlatform AS MVP_1 ,MVP_DataPlatform AS MVP_2 ,Replied ,MVP_Topics AS T WHERE MATCH(MVP_1-(Replied)->MVP_2)
Sure, there are many other ways to do the same steps and also many other ways how you want your data to be represented in a graph.
Happy T-SQL Querying! 🙂
[…] Tomaz Kastrun uses a set of e-mails as his SQL Server 2017 graph table data source: […]
LikeLike
[…] Graph Database with T-SQL | TomazTsql https://tomaztsql.wordpress.com/2017/06/28/graph-database-with-t-sql […]
LikeLike
[…] Graph Database with T-SQL | TomazTsql https://tomaztsql.wordpress.com/2017/06/28/graph-database-with-t-sql […]
LikeLike
[…] Graph Database with T-SQL | TomazTsql tomaztsql.wordpress.com/2017/06/28/graph-database-with-t-sql […]
LikeLike