SQL Saturday statistics – Web Scraping with R and SQL Server

I wanted to check a simple query: How many times has a particular topic been presented and from how many different presenters.

Sounds interesting, tackling the problem should not be a problem, just that the end numbers may vary, since there will be some text analysis included.

First of all, some web scraping and getting the information from Sqlsaturday web page. Reading the information from the website, and with R/Python integration into SQL Server, this is fairly straightforward task:

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
 library(rvest)
 library(XML)
 library(dplyr)

#URL to schedule
 url_schedule <- ''http://www.sqlsaturday.com/687/Sessions/Schedule.aspx''

#Read HTML
 webpage <- read_html(url_schedule)

# Event schedule
 schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK

# Extracting HTML content
 ht <- html_text(schedule_info)

df <- data.frame(data=ht)

#create empty DF
 df_res <- data.frame(title=c(), speaker=c())

for (i in 1:nrow(df)){
 #print(df[i])
 if (i %% 2 != 0) #odd flow
 print(paste0("title is: ", df$data[i]))
 if (i %% 2 == 0) #even flow
 print(paste0("speaker is: ", df$data[i]))
 df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1]))
 }
df_res_new = df_res[seq(1, nrow(df_res), 2), ]
OutputDataSet <- df_res_new'

Python offers Beautifulsoup library that will do pretty much the same (or even better) job as rvest and XML packages combined. Nevertheless, once we have the data from a test page out (in this case I am reading the Slovenian SQLSaturday 2017 schedule, simply because, it is awesome), we can “walk though” the whole web page and generate all the needed information.

SQLSaturday website has every event enumerated, making it very easy to parametrize the web scrapping process:

2017-11-12 13_13_30-SQLSaturday #687 - Slovenia 2017 _ Sessions _ Schedule

So we will scrape through last 100 events, by simply incrementing the integer of the event; so input parameter will be parsed as:

http://www.sqlsaturday.com/600/Sessions/Schedule.aspx

http://www.sqlsaturday.com/601/Sessions/Schedule.aspx

http://www.sqlsaturday.com/602/Sessions/Schedule.aspx

and so on, regardless of the fact if the website functions or not. Results will be returned back to the SQL Server database.

Creating stored procedure will go the job:

USE SqlSaturday;
GO

CREATE OR ALTER PROCEDURE GetSessions
 @eventID SMALLINT
AS

DECLARE @URL VARCHAR(500)
SET @URL = 'http://www.sqlsaturday.com/' +CAST(@eventID AS NVARCHAR(5)) + '/Sessions/Schedule.aspx'

PRINT @URL

DECLARE @TEMP TABLE
(
 SqlSatTitle NVARCHAR(500)
 ,SQLSatSpeaker NVARCHAR(200)
)

DECLARE @RCODE NVARCHAR(MAX)
SET @RCODE = N' 
 library(rvest)
 library(XML)
 library(dplyr)
 library(httr)
 library(curl)
 library(selectr)
 
 #URL to schedule
 url_schedule <- "'
 
DECLARE @RCODE2 NVARCHAR(MAX) 
SET @RCODE2 = N'"
 #Read HTML
 webpage <- html_session(url_schedule) %>%
 read_html()

# Event schedule
 schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK

# Extracting HTML content
 ht <- html_text(schedule_info)

df <- data.frame(data=ht)

#create empty DF
 df_res <- data.frame(title=c(), speaker=c())

for (i in 1:nrow(df)){
 #print(df[i])
 if (i %% 2 != 0) #odd flow
 print(paste0("title is: ", df$data[i]))
 if (i %% 2 == 0) #even flow
 print(paste0("speaker is: ", df$data[i]))
 df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1]))
 }

df_res_new = df_res[seq(1, nrow(df_res), 2), ]
 OutputDataSet <- df_res_new ';

DECLARE @FINAL_RCODE NVARCHAR(MAX)
SET @FINAL_RCODE = CONCAT(@RCODE, @URL, @RCODE2)

INSERT INTO @Temp
EXEC sp_execute_external_script
 @language = N'R'
 ,@script = @FINAL_RCODE


INSERT INTO SQLSatSessions (sqlSat,SqlSatTitle,SQLSatSpeaker)
SELECT 
 @EventID AS sqlsat
 ,SqlSatTitle
 ,SqlSatSpeaker
FROM @Temp

 

Before you run this, just a little environement setup:

USE [master];
GO

CREATE DATABASe SQLSaturday;
GO

USE SQLSaturday;
GO

CREATE TABLE SQLSatSessions
(
 id SMALLINT IDENTITY(1,1) NOT NULL
,SqlSat SMALLINT NOT NULL
,SqlSatTitle NVARCHAR(500) NOT NULL
,SQLSatSpeaker NVARCHAR(200) NOT NULL
)

 

There you go! Now you can run a stored procedure for a particular event (in this case SQL Saturday Slovenia 2017):

EXECUTE GetSessions @eventID = 687

or you can run this procedure against multiple SQLSaturday events and web scrape data from SQLSaturday.com website instantly.

For Slovenian SQLSaturday, I get the following sessions and speakers list:

2017-11-13 19_19_46-49_blog_post.sql - SICN-KASTRUN.SQLSaturday (SPAR_si01017988 (57))_ - Microsoft .png

Please note that you are running this code behind the firewall and proxy, so some additional changes for the proxy or firewall might be needed!

So going to original question, how many times has the query store been presented on SQL Saturdays (from SQLSat600 until  SqlSat690), here is the frequency table:

2017-11-13 19_57_04-Statistics_on_web_scraping_results.sql - SICN-KASTRUN.SQLSaturday (SPAR_si010179

Or presented with pandas graph:

session_stats

Query store is popular, beyond all R, Python or Azure ML topics, but Powershell is gaining its popularity like crazy. Good work PowerShell people! 🙂

UPDATE #1: More statistics; in general PowerShell session is presented on every second SQLSaturday, Query Store on every third, whereas there are minimum 2 topics related to Azure on every SQLSat event (relevant for SqlSat events ranging from SqlSat600 to SqlSat690).

As always, code is available at Github.

 

Advertisements

Graph Database with T-SQL

SQL Server 2017 will bring users graph database. Graph database is type of database that uses graph attributes – edges (or relationships) and nodes (or vertices) and features, to store data. Relationship between nodes are through edges and features (also called properties) hold the values to represent the strength or direction of the edges. Many of this links can be retrieved easy with a single operation (in comparison to T-SQL language, which can quickly become complex).

Links are also stored in the database, and when querying the graph database, use join clause, to collect the related data. Especially very useful for hierarchical structures and complex social structures, can be easily stored and queried in comparison to “relational” database.

Network database vs. graph database are similar, whereas, network might have problems when dealing with several chains of edges.

When to use graph/network database should be obvious; when you have hierarchical data or when your data is too complex with many many-to-many relations, that would make a relational database hard to query.

small_undirected_network_labeled

 

To put the graph database to the test, I took bunch of emails from a particular MVP SQL Server distribution list (content will not be shown and all the names will be anonymized). On my gmail account, I have downloaded some 90MiB of emails in mbox file format. With some python scripting,  only FROM and SUBJECTS were extracted:

writer.writerow(['from','subject'])
for index, message in enumerate(mailbox.mbox(infile)):
    content = get_content(message)
    row = [
        message['from'].strip('>').split('<')[-1],
        decode_header(message['subject'])[0][0],"|"
          ]
    writer.writerow(row)

Up to this point, I have extracted a CSV file with only From field and subject field, looking as:

Excel_emails

After importing the CSV file to SQL Server database, I have started to prepare my graph database structures:

CREATE TABLE MVP_DataPlatform
(
 MVP_ID INT IDENTITY(1,1) NOT NULL
 ,MVP_Name VARCHAR(100) NULL
 ,MVP_Email VARCHAR(100) NOT NULL
) AS NODE
GO

CREATE TABLE MVP_Topics
(
 Topic_ID INT IDENTITY(1,1) NOT NULL
 ,Title NVARCHAR(250) NOT NULL
) AS NODE
GO

CREATE TABLE Initiated AS EDGE;
CREATE TABLE Replied AS EDGE;
And prepare the data with:

with mails 
as
(
SELECT 
 froms 
 ,REPLACE(SUBSTRING(froms, 1, CASE CHARINDEX(';', froms) 
  WHEN 0 THEN LEN(froms) ELSE CHARINDEX(';',froms)-1 END),',','') AS from_MVP
 ,REPLACE(SUBSTRING(froms, CHARINDEX(';',froms) + 1, 1000),',|','') AS title

FROM [dbo].[clean_mail5_2] 
)
SELECT 
 from_MVP
 ,SUBSTRING(title,2, LEN(title)) as title
 ,CASE WHEN (
 SUBSTRING(title,1, 10) LIKE '%RE%' 
 OR SUBSTRING(title,1, 10) LIKE '%AW%'
 OR SUBSTRING(title,1, 10) LIKE '%FWD%'
 OR SUBSTRING(title,1, 10) LIKE '%OT%'
 ) THEN 0 ELSE 1 END MVP_start
INTO MVP_Mails
FROM mails
 WHERE
 from_MVP like '%@%'

Populating the nodes (vertices) should be a easy task:

INSERT INTO MVP_DataPlatform (MVP_Name, MVP_Email)
SELECT 
 SUBSTRING(from_MVP, 1, CHARINDEX('@',from_MVP)-1) AS MVP_NAME
 ,from_MVP AS MVP_Email
FROM MVP_Mails
GROUP BY 
 SUBSTRING(from_MVP, 1, CHARINDEX('@',from_MVP)-1)
 ,from_MVP;
GO

INSERT INTO MVP_Topics (Title)
SELECT 
 Title
FROM MVP_Mails
GROUP BY 
 title;
GO

But populating the edges (relationships) is part of how you want to define your graph.

I have created two edges; Replied and initiated. When using replied, logic will be following:

  1. AB -> send a mail to distribution list =  AB -> AB
  2. CD -> replies to AB  = CD -> AB
  3. EF -> replies to CD (and AB) = EF -> CD
  4. and so on.

When Initiated, logic is slightly different:

  1. AB -> send a mail to distribution list =  AB -> AB
  2. CD -> replies to AB  = CD -> AB
  3. EF -> replies to CD (and AB) = EF -> AB
  4. and so on…

So difference is, how I want to store information at step 3.; when EF replies. But this rule is – totally arbitrary and up to your type of business question, hierarchy or many-to-many rules. I can even decide to store information at step 3. normalized, as: ((EF -> CD) AND (EF -> AB)).

To store this set of information into the table of edges, following transformation is needed:

CREATE OR ALTER PROCEDURE INSERT_EDGE
AS
DECLARE @id1 INT;
DECLARE @id2 INT;
DECLARE @i INT;
DECLARE @hid INT;
DECLARE @START INT;

select 
 mail.from_MVP as mail_ID
,pl.MVP_id AS pl_ID
,mail.title
,mail.MVP_Start
,row_number() OVER (partition by title ORDER BY title, MVP_Start DESC) as rn_title
,row_number() Over (order by (select 1)) as rn_global
into #temp
from MVP_Mails as mail
JOIN MVP_DataPlatform AS Pl
ON mail.from_mvp = pl.mvp_email

SELECT @i = MAX(rn_global) FROM #TEMP

SET @hid=1;
WHILE @hid <= @i
BEGIN 
 SELECT @start = MVP_start FROM #temp WHERE rn_global = @hid
 PRINT @start
 SELECT @id1 = pl_id FROM #temp WHERE rn_global = @hid
 PRINT @id1
 IF @start = 1
 SELECT @id2 = pl_id FROM #temp WHERE rn_global = @hid+1
 IF @start = 0
 SET @id2 = @id1
 PRINT @id2

INSERT INTO replied VALUES ((SELECT $NODE_ID FROM MVP_DataPlatform WHERE MVP_ID = @id1), (SELECT $NODE_ID FROM MVP_DataPlatform WHERE MVP_ID = @id2));
 SET @hid = @hid + 1;
END

Executing this procedure:

EXECUTE PROCEDURE INSERT_EDGE

You should be getting a table with relations:

relations

And now you can perform simple queries against the graph – including complex relations, in order to get results relatively simple:

SELECT 
 MVP_1.MVP_Name 
 ,MVP_2.MVP_Name
 ,T.Title
 ,T.Topic_ID
FROM 
 MVP_DataPlatform AS MVP_1
 ,MVP_DataPlatform AS MVP_2
 ,Replied
 ,MVP_Topics AS T
WHERE 
 MATCH(MVP_1-(Replied)->MVP_2)

 

Sure, there are many other ways to do the same steps and also many other ways how you want your data to be represented in a graph.

Happy T-SQL Querying! 🙂

 

R and SQL Server articles

In past couple of months, I have prepared several articles on R and SQL Server that have been published on SQL Server Central.

The idea was, to have couple of articles covering the introduction to R, to basics on R Server, to some practical cases on R with SQL Server.

1) Using Microsoft R in Enterprise Environments

Article covers the concepts on Microsoft R Server, where and how to start with Microsoft R in enterprise environment and give answers to most common concerns people might have when introducing R language into corporation.

1

Link to article: http://www.sqlservercentral.com/articles/R+Language/140422/

 

2) Introduction to Microsoft R Services in SQL Server 2016

Integration and architecture on Microsoft R Services is main focus of this article. It outlinesdifferent flavors of R (Open, Client, Server, Services, Hadoop, etc.), how to deal with installation and basic overview and explanation on extended stored procedure SP_EXECUTE_EXTERNAL_SCRIPT.

2.png

Link to article: http://www.sqlservercentral.com/articles/Microsoft/145393/

 

3) Installing R packages in SQL Server R Services

Expand the functionality of R by adding new packages. Covers many ways how to install and add additional packages to your R environment.

3

Link to article: http://www.sqlservercentral.com/articles/R+Package/145571/

 

4) Using SQL Server and R Services for analyzing Sales data

Providing use cases on analyzing sales data was focus of this article with goal to show readers and users how to ope rationalize and bring R code into use in any enterprise (small or big) environment.

4

Link to article: http://www.sqlservercentral.com/articles/R+Services/145649/

 

5) Using Power BI and SSRS for visualizing SQL Server and R data

Visualizing the data for any use case, is also important aspect of understanding data insights. Article covers Power BI and SSRS visualization and how to embed R code in both tools.

5

Link to article: http://www.sqlservercentral.com/articles/R+Language/151358/

6) Using SQL Server and R Services for analyzing DBA Tasks

Broadening the use of Microsoft R for the DBA tasks was the main goal of this article. With simulation of  the disk usage, showing R example how to switch from monitoring the usage to predicting the usage of disk space. Clustering executed queries to narrow down performance issues and visualizing Query store information with heatmap were also introduced in article.

6

Link to article: http://www.sqlservercentral.com/articles/R+Language/151405/

 

More articles will follow, so stick around.

Happy R-SQLing!

 

 

R graphs and tables in Power BI Desktop

Power BI Desktop enable users to use R script visual for adding custom visualization generated with R language – regardless of R package used. Before using R script visual, you will need to enable it by setting path to R Engine on your client in the global options. Once this is done, you will be able to enhance your Power BI reports using R visualizations.

1450757873470

My personal favorite R packages I like to use in Power BI are shiny, plotly, ggplot2, googleVis, ggiraph, rCharts and d3js.  Apart from finding the package for data visualization where you feel home, one of the very great thing I find with Power BI + R is the ability to have visualizations created dynamically (more or less like if you would be using plotly or shiny)  with slicers and selectors. And this makes data exploration, data wrangling so much easier.

With October 2016 Power BI custom visuals got additional 7 R visuals to the family (check Power BI Visuals Gallery and fall in love with your favorite one 🙂 ).

2016-12-18-09_26_55-visuals-gallery-_-microsoft-power-bi

Upon using any of the additional R visuals, Power BI will – prior to importing the additional visualizations – ask for installing all the missing packages, as well as inform you of any incompatibility between R engine and package.

So how easy is to create R visualization in Power BI?  Let us first import data to Power BI from SQL Server. I have generated a sample data-set for this demo. Data and all the samples are available at Github.

2016-12-18-11_32_26-preparing-data-sql-sicn-kastrun-plan-spar_si01017988-69_-microsoft-sql-s

With sample data-set I have created a simple data visualization with slicers using Power BI.

2016-12-18-11_34_01-sample_powerbi_r-power-bi-desktop

Now, the same visualization using R.

2016-12-18-11_34_01-sample_powerbi_r-power-bi-desktop_2

Note that I have used same slicers (Year = 2016, Month = 2) and R graph (using ggplot2) was able to dynamically  visualize the numbers. So this mean, that any R visualized graph can be navigated and dynamically updated, which is absolutely great, especially in cases when you do data exploration. Following R code was used:

library(ggplot2)
ggplot(data=dataset, aes(x=Day, y=Value)) +  geom_line(aes(color="red")) + 
ggtitle("Value by day") + theme(legend.position = "none") + 
theme(panel.background = element_rect(fill = "white", color = "lightgray", size = 1))

with additional dataset conversion to data.frame done automatically.

2016-12-18-11_57_51-sample_powerbi_r-power-bi-desktop_code

Where it get’s tricky is if you want to return the R data.frame into your Power BI page in form of a table (imagine you want some results to your graph or that you are calculating some statistics). Well, Power BI does not allow returning tables from R script visual. Unless, table is an image (that is, in a “form” of a visualized “graph”).

So I manage to create a table as a graph using R and gplot function from ggplot2 package.

2016-12-18-12_40_32-sample_powerbi_r-power-bi-desktop

The best part of this table is, that is dynamically changed based on selection on graphs or slicers. The downside is, that it can not be scrolled within the object. So if you end up with bigger table, you have to tweak the font size or show table aggregated or grouped.  In my opinion, I would not change R created table with Power BI table, because Power BI table is simply better and easier to create and maintain. But I would definitely recommend using R tables to support your R visuals. For example: If you are generating clusters using R in power BI, use R generated table to expose additional statistics on clusters.

The code for generating this R table is:

library(gridExtra)
library(ggplot2)

qplot(x=1:5, y=1:5, geom = "blank") + 
theme(line = element_blank(), text = element_blank()) + 
annotation_custom(grob=tableGrob(dataset))

Script generated blank plot, where custom annotation is added with function grob. And this is our data.frame from dataset and is displayed on top of empty graph.In addition to “remove” the background blank graph, blank line and text is added:

theme(line = element_blank(), text = element_blank())

 

Update #1 (19.12.2016)

After reading the comment to this blog post from reader Baptiste Auguie  (with link to his post on displaying tables as grid graphs), that using grid will return the same result, but very much a simpler solution to plotting / drawing table. Thank you Baptiste! I must have missed this one.

R code is far less simpler without  qplot function.

library(gridExtra)
grid.table(dataset)

And the printscreen from Power BI:

2016-12-19-06_13_19-sample_powerbi_r-power-bi-desktop

As always, sample, code and data-set is available at Github.

Happy PowerBI-Rrrring!

#SQLSatDenmark 2016 wrap up

SQLSatDenmark 2016 took place in Lyngby at Microsoft Denmark headquarters. Apart from the fact that Lyngby is absolutely cute town, the Microsoft HQ is nice as well.

At Microsoft Denmark:2016-09-23-09_46_29-presentation1-pptx-powerpoint

Lyngy:2016-09-23-09_48_21-presentation1-pptx-powerpoint

On the evening before the precons day, dinner at MASH restaurant was prepared for all the precon speakers Tim Chapman, Andre Kamman, Kevin Kline and myself with hosts Regis Baccaro and Kenneth M. Nielsen.

2016-09-23-19_26_08-presentation1-pptx-powerpoint

After delicious steaks, pints of beer and interesting conversations, the precon day started.

My precon room was full and I had 30 attendees, lots of material and ending with demos on R and SQL Server integration from the fields. Feedback was great, obviously. The problem I had was that I prepared too much material, which was anyways handed over (all the code) for people to learn more when back at home and that I was focusing too much on statistics. But finishing at 16.30 and was available in the Microsoft HQ until 17.30 for any other questions.

Next day, SQLSaturday Denmark started early and in total 280 attendees showed up. It was easy going and well organized event, great sponsors, nice swag, raffle and overall good stuff one can find at the event – juice bar, good food and ending the event with hot dog stand and SQL Beer. Yes, traditional Danish SQL saturday beer 🙂

2016-09-23-19_43_41-presentation1-pptx-powerpoint

I delivered a session on Machine Learning algorithms in Microsoft Azure, explaining which algorithm to use with which dataset and what kind of statistical problem can be solved with. Great feedback from the crowd and very interesting questions – more or less very statistical and data mining questions. And I truly loved it.

Thanks to all the sponsors, organizers, attendees and the SQL family. It was great to see you.