SQL Saturday statistics – Web Scraping with R and SQL Server

I wanted to check a simple query: How many times has a particular topic been presented and from how many different presenters.

Sounds interesting, tackling the problem should not be a problem, just that the end numbers may vary, since there will be some text analysis included.

First of all, some web scraping and getting the information from Sqlsaturday web page. Reading the information from the website, and with R/Python integration into SQL Server, this is fairly straightforward task:

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
 library(rvest)
 library(XML)
 library(dplyr)

#URL to schedule
 url_schedule <- ''http://www.sqlsaturday.com/687/Sessions/Schedule.aspx''

#Read HTML
 webpage <- read_html(url_schedule)

# Event schedule
 schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK

# Extracting HTML content
 ht <- html_text(schedule_info)

df <- data.frame(data=ht)

#create empty DF
 df_res <- data.frame(title=c(), speaker=c())

for (i in 1:nrow(df)){
 #print(df[i])
 if (i %% 2 != 0) #odd flow
 print(paste0("title is: ", df$data[i]))
 if (i %% 2 == 0) #even flow
 print(paste0("speaker is: ", df$data[i]))
 df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1]))
 }
df_res_new = df_res[seq(1, nrow(df_res), 2), ]
OutputDataSet <- df_res_new'

Python offers Beautifulsoup library that will do pretty much the same (or even better) job as rvest and XML packages combined. Nevertheless, once we have the data from a test page out (in this case I am reading the Slovenian SQLSaturday 2017 schedule, simply because, it is awesome), we can “walk though” the whole web page and generate all the needed information.

SQLSaturday website has every event enumerated, making it very easy to parametrize the web scrapping process:

2017-11-12 13_13_30-SQLSaturday #687 - Slovenia 2017 _ Sessions _ Schedule

So we will scrape through last 100 events, by simply incrementing the integer of the event; so input parameter will be parsed as:

http://www.sqlsaturday.com/600/Sessions/Schedule.aspx

http://www.sqlsaturday.com/601/Sessions/Schedule.aspx

http://www.sqlsaturday.com/602/Sessions/Schedule.aspx

and so on, regardless of the fact if the website functions or not. Results will be returned back to the SQL Server database.

Creating stored procedure will go the job:

USE SqlSaturday;
GO

CREATE OR ALTER PROCEDURE GetSessions
 @eventID SMALLINT
AS

DECLARE @URL VARCHAR(500)
SET @URL = 'http://www.sqlsaturday.com/' +CAST(@eventID AS NVARCHAR(5)) + '/Sessions/Schedule.aspx'

PRINT @URL

DECLARE @TEMP TABLE
(
 SqlSatTitle NVARCHAR(500)
 ,SQLSatSpeaker NVARCHAR(200)
)

DECLARE @RCODE NVARCHAR(MAX)
SET @RCODE = N' 
 library(rvest)
 library(XML)
 library(dplyr)
 library(httr)
 library(curl)
 library(selectr)
 
 #URL to schedule
 url_schedule <- "'
 
DECLARE @RCODE2 NVARCHAR(MAX) 
SET @RCODE2 = N'"
 #Read HTML
 webpage <- html_session(url_schedule) %>%
 read_html()

# Event schedule
 schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK

# Extracting HTML content
 ht <- html_text(schedule_info)

df <- data.frame(data=ht)

#create empty DF
 df_res <- data.frame(title=c(), speaker=c())

for (i in 1:nrow(df)){
 #print(df[i])
 if (i %% 2 != 0) #odd flow
 print(paste0("title is: ", df$data[i]))
 if (i %% 2 == 0) #even flow
 print(paste0("speaker is: ", df$data[i]))
 df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1]))
 }

df_res_new = df_res[seq(1, nrow(df_res), 2), ]
 OutputDataSet <- df_res_new ';

DECLARE @FINAL_RCODE NVARCHAR(MAX)
SET @FINAL_RCODE = CONCAT(@RCODE, @URL, @RCODE2)

INSERT INTO @Temp
EXEC sp_execute_external_script
 @language = N'R'
 ,@script = @FINAL_RCODE


INSERT INTO SQLSatSessions (sqlSat,SqlSatTitle,SQLSatSpeaker)
SELECT 
 @EventID AS sqlsat
 ,SqlSatTitle
 ,SqlSatSpeaker
FROM @Temp

 

Before you run this, just a little environement setup:

USE [master];
GO

CREATE DATABASe SQLSaturday;
GO

USE SQLSaturday;
GO

CREATE TABLE SQLSatSessions
(
 id SMALLINT IDENTITY(1,1) NOT NULL
,SqlSat SMALLINT NOT NULL
,SqlSatTitle NVARCHAR(500) NOT NULL
,SQLSatSpeaker NVARCHAR(200) NOT NULL
)

 

There you go! Now you can run a stored procedure for a particular event (in this case SQL Saturday Slovenia 2017):

EXECUTE GetSessions @eventID = 687

or you can run this procedure against multiple SQLSaturday events and web scrape data from SQLSaturday.com website instantly.

For Slovenian SQLSaturday, I get the following sessions and speakers list:

2017-11-13 19_19_46-49_blog_post.sql - SICN-KASTRUN.SQLSaturday (SPAR_si01017988 (57))_ - Microsoft .png

Please note that you are running this code behind the firewall and proxy, so some additional changes for the proxy or firewall might be needed!

So going to original question, how many times has the query store been presented on SQL Saturdays (from SQLSat600 until  SqlSat690), here is the frequency table:

2017-11-13 19_57_04-Statistics_on_web_scraping_results.sql - SICN-KASTRUN.SQLSaturday (SPAR_si010179

Or presented with pandas graph:

session_stats

Query store is popular, beyond all R, Python or Azure ML topics, but Powershell is gaining its popularity like crazy. Good work PowerShell people! 🙂

UPDATE #1: More statistics; in general PowerShell session is presented on every second SQLSaturday, Query Store on every third, whereas there are minimum 2 topics related to Azure on every SQLSat event (relevant for SqlSat events ranging from SqlSat600 to SqlSat690).

As always, code is available at Github.

 

Advertisements

SQL Saturday Vienna 2017 #sqlsatVienna

SQL Saturday Vienna 2017 is just around the corner. On Friday, January 20, 2017, a lot of local and international speakers will gather to deliver sessions relating to SQL Server and all related services. With great agenda – available  here, the attendees will surely enjoy different topics as well as have the opportunity to talk to Austrian PASS and SQL community as well as speakers, as well as SQL Server MVP.

2016-12-29-18_53_40-sqlsaturday-579-vienna-2017-_-event-home

 

My session at SQL Sat Vienna 2017 will be focused on what can database administrators gain from R integration with SQL Server 2016. Said that, we will look how statistics  from main DBA tasks can be gathered, stored and later analyzed for better prediction, for uncovering patters in baseline that might be overlooked and of course how to play with information gathered from Query store and DMV query plans. Session will also serve with field examples each enterprise can have.

This year, I will have the pleasure to do the pre-con on Thursday, January 19, 2017 at the JUFA Hotel in Vienna. A full day pre-con workshop on SQL Server and R integration with all the major topics covered, where and how to start using R, how the R integration works, deep dive into R package for high performance work and dive into statistics – from uni-variate to multi-variate as well as methods for data mining and machine learning. Everybody welcome to join, it will be a great day for a workshop! 🙂

 

2016-12-29 19_01_20-BI and Analytics with SQL Server and R - Tomaz Kastrun Tickets, Thu, 19 Jan 2017.png

Tickets are available here via Eventbrite.

 

Falco is already playing Vienna Calling   🙂

 

 

#SQLSatDenmark 2016 wrap up

SQLSatDenmark 2016 took place in Lyngby at Microsoft Denmark headquarters. Apart from the fact that Lyngby is absolutely cute town, the Microsoft HQ is nice as well.

At Microsoft Denmark:2016-09-23-09_46_29-presentation1-pptx-powerpoint

Lyngy:2016-09-23-09_48_21-presentation1-pptx-powerpoint

On the evening before the precons day, dinner at MASH restaurant was prepared for all the precon speakers Tim Chapman, Andre Kamman, Kevin Kline and myself with hosts Regis Baccaro and Kenneth M. Nielsen.

2016-09-23-19_26_08-presentation1-pptx-powerpoint

After delicious steaks, pints of beer and interesting conversations, the precon day started.

My precon room was full and I had 30 attendees, lots of material and ending with demos on R and SQL Server integration from the fields. Feedback was great, obviously. The problem I had was that I prepared too much material, which was anyways handed over (all the code) for people to learn more when back at home and that I was focusing too much on statistics. But finishing at 16.30 and was available in the Microsoft HQ until 17.30 for any other questions.

Next day, SQLSaturday Denmark started early and in total 280 attendees showed up. It was easy going and well organized event, great sponsors, nice swag, raffle and overall good stuff one can find at the event – juice bar, good food and ending the event with hot dog stand and SQL Beer. Yes, traditional Danish SQL saturday beer 🙂

2016-09-23-19_43_41-presentation1-pptx-powerpoint

I delivered a session on Machine Learning algorithms in Microsoft Azure, explaining which algorithm to use with which dataset and what kind of statistical problem can be solved with. Great feedback from the crowd and very interesting questions – more or less very statistical and data mining questions. And I truly loved it.

Thanks to all the sponsors, organizers, attendees and the SQL family. It was great to see you.

 

Speaking at SQL Saturday Lisbon 2015

So excited to be speaking at SQL Saturday Lisbon this year. I am sure, it will be a great event in a beautiful city, lovely people and great food. (not to mention my favorite – coffee).

My session will be about making customer segmentation using SSAS and SQL Server. One of these things, people usually like to talk about, but nobody is doing it. Hands on session will explore all the steps and statistics and mathematics behind.

sqlsatLisbon

Everybody is welcome to participate, especially listeners who will come with a great Portuguese coffee 🙂

See you!

SQLSaturday 367 – Pordenone, Italy

I will be speaking at SQLSaturday #367 in Pordenone, Italy on February 28th, 2015.

My session will be focused on using R (http://www.r-project.org/) in SQL Server environment for purposes of statistical analysis and purposes of data cleaning, data importing and data exploring.

Using R with SQL server data will help data scientists and data analysts prepare, explore and validate data much easier, as well as to use wide range of statistics; from uni-variate to multivariate.Session will focus mainly on:1) on connecting R Language with SQL server using standard ODBC connectors and T-SQL procedures. 2) how to validate data with using classical statistical methods on SQL transactional data. 3) how to use R output in SSRS and bring extra information to reports.

Session Level: Intermediate