SQL Saturday statistics – Web Scraping with R and SQL Server

I wanted to check a simple query: How many times has a particular topic been presented and from how many different presenters.

Sounds interesting, tackling the problem should not be a problem, just that the end numbers may vary, since there will be some text analysis included.

First of all, some web scraping and getting the information from Sqlsaturday web page. Reading the information from the website, and with R/Python integration into SQL Server, this is fairly straightforward task:

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
 library(rvest)
 library(XML)
 library(dplyr)

#URL to schedule
 url_schedule <- ''http://www.sqlsaturday.com/687/Sessions/Schedule.aspx''

#Read HTML
 webpage <- read_html(url_schedule)

# Event schedule
 schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK

# Extracting HTML content
 ht <- html_text(schedule_info)

df <- data.frame(data=ht)

#create empty DF
 df_res <- data.frame(title=c(), speaker=c())

for (i in 1:nrow(df)){
 #print(df[i])
 if (i %% 2 != 0) #odd flow
 print(paste0("title is: ", df$data[i]))
 if (i %% 2 == 0) #even flow
 print(paste0("speaker is: ", df$data[i]))
 df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1]))
 }
df_res_new = df_res[seq(1, nrow(df_res), 2), ]
OutputDataSet <- df_res_new'

Python offers Beautifulsoup library that will do pretty much the same (or even better) job as rvest and XML packages combined. Nevertheless, once we have the data from a test page out (in this case I am reading the Slovenian SQLSaturday 2017 schedule, simply because, it is awesome), we can “walk though” the whole web page and generate all the needed information.

SQLSaturday website has every event enumerated, making it very easy to parametrize the web scrapping process:

2017-11-12 13_13_30-SQLSaturday #687 - Slovenia 2017 _ Sessions _ Schedule

So we will scrape through last 100 events, by simply incrementing the integer of the event; so input parameter will be parsed as:

http://www.sqlsaturday.com/600/Sessions/Schedule.aspx

http://www.sqlsaturday.com/601/Sessions/Schedule.aspx

http://www.sqlsaturday.com/602/Sessions/Schedule.aspx

and so on, regardless of the fact if the website functions or not. Results will be returned back to the SQL Server database.

Creating stored procedure will go the job:

USE SqlSaturday;
GO

CREATE OR ALTER PROCEDURE GetSessions
 @eventID SMALLINT
AS

DECLARE @URL VARCHAR(500)
SET @URL = 'http://www.sqlsaturday.com/' +CAST(@eventID AS NVARCHAR(5)) + '/Sessions/Schedule.aspx'

PRINT @URL

DECLARE @TEMP TABLE
(
 SqlSatTitle NVARCHAR(500)
 ,SQLSatSpeaker NVARCHAR(200)
)

DECLARE @RCODE NVARCHAR(MAX)
SET @RCODE = N' 
 library(rvest)
 library(XML)
 library(dplyr)
 library(httr)
 library(curl)
 library(selectr)
 
 #URL to schedule
 url_schedule <- "'
 
DECLARE @RCODE2 NVARCHAR(MAX) 
SET @RCODE2 = N'"
 #Read HTML
 webpage <- html_session(url_schedule) %>%
 read_html()

# Event schedule
 schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK

# Extracting HTML content
 ht <- html_text(schedule_info)

df <- data.frame(data=ht)

#create empty DF
 df_res <- data.frame(title=c(), speaker=c())

for (i in 1:nrow(df)){
 #print(df[i])
 if (i %% 2 != 0) #odd flow
 print(paste0("title is: ", df$data[i]))
 if (i %% 2 == 0) #even flow
 print(paste0("speaker is: ", df$data[i]))
 df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1]))
 }

df_res_new = df_res[seq(1, nrow(df_res), 2), ]
 OutputDataSet <- df_res_new ';

DECLARE @FINAL_RCODE NVARCHAR(MAX)
SET @FINAL_RCODE = CONCAT(@RCODE, @URL, @RCODE2)

INSERT INTO @Temp
EXEC sp_execute_external_script
 @language = N'R'
 ,@script = @FINAL_RCODE


INSERT INTO SQLSatSessions (sqlSat,SqlSatTitle,SQLSatSpeaker)
SELECT 
 @EventID AS sqlsat
 ,SqlSatTitle
 ,SqlSatSpeaker
FROM @Temp

 

Before you run this, just a little environement setup:

USE [master];
GO

CREATE DATABASe SQLSaturday;
GO

USE SQLSaturday;
GO

CREATE TABLE SQLSatSessions
(
 id SMALLINT IDENTITY(1,1) NOT NULL
,SqlSat SMALLINT NOT NULL
,SqlSatTitle NVARCHAR(500) NOT NULL
,SQLSatSpeaker NVARCHAR(200) NOT NULL
)

 

There you go! Now you can run a stored procedure for a particular event (in this case SQL Saturday Slovenia 2017):

EXECUTE GetSessions @eventID = 687

or you can run this procedure against multiple SQLSaturday events and web scrape data from SQLSaturday.com website instantly.

For Slovenian SQLSaturday, I get the following sessions and speakers list:

2017-11-13 19_19_46-49_blog_post.sql - SICN-KASTRUN.SQLSaturday (SPAR_si01017988 (57))_ - Microsoft .png

Please note that you are running this code behind the firewall and proxy, so some additional changes for the proxy or firewall might be needed!

So going to original question, how many times has the query store been presented on SQL Saturdays (from SQLSat600 untilย  SqlSat690), here is the frequency table:

2017-11-13 19_57_04-Statistics_on_web_scraping_results.sql - SICN-KASTRUN.SQLSaturday (SPAR_si010179

Or presented with pandas graph:

session_stats

Query store is popular, beyond all R, Python or Azure ML topics, but Powershell is gaining its popularity like crazy. Good work PowerShell people! ๐Ÿ™‚

UPDATE #1: More statistics; in general PowerShell session is presented on every second SQLSaturday, Query Store on every third, whereas there are minimum 2 topics related to Azure on every SQLSat event (relevant for SqlSat events ranging from SqlSat600 to SqlSat690).

As always, code is available at Github.

 

Advertisements

Robust IRIS Dataset?

This blog post was born out of pure curiosity about the robustness of the IRIS Dataset. Biological datasets do not need to be that big in comparison to datasets of customers, consumption, stock and anything that might be volatile.

When still at the university, on one occasion I can remember, we were measuring the length of the frog legs and other frogy features. And after just a couple of measures, the further prediction was steady. Also, any kind of sampling was (RS and SRS, cluster/stratified sampling, sampling with replacements and many other creative ways of sampling) proven to be rigid, robust and would converge quickly to a good result.

Therefore, I have decided to put the IRIS dataset to the test, using a simple classification method. Calculating first the simple euclidian distance, following by finding the neighbour and based on that checking the membership of the type of flowers with the labels.

Accuracy of the prediction was tested by mapping the original species with predicted ones. And the test was, how large can a train dataset be in order to still get a good result.

After some Python and R code, the results were in.

I have tested following pairs (train:test sample size):

  • 80% – 20%
  • 60% – 40%
  • 50% – 50%
  • 30% – 70%
  • 10% – 90%

Note, that the IRIS dataset has 150 observations, each evenly distributed among three species. Following Python code loop through the calculation of euclidean distance.

for x in range(3000):
    exec(open("./classification.py").read(), globals())
    x += 1

At the end I have generated the file:

predictions

With these results, simple R code to generate the scatter plot was used:

library(ggplot2)
setwd("C:\\Predictions\\")
df_pred <- data.frame(read.table("results_split.txt", sep=";"))
p <- ggplot(df_pred, aes(df_pred$V3, df_pred$V1)) 
p <- p + geom_point(aes(df_pred$V3))
p <- p + labs(x="Accuracy (%) of predictions", y="Size of training subset")
p

Which resulted as:

Plot

The graph clearly shows that 10% of training set (10% out of 150 observations) can generate very accurate predictions every 1,35x times.

Other pairs, when taking 30% or 50% of training set, will for sure give close to 100% accuracy almost every time.

Snippet of Python code to generate euclidean distance:

def eucl_dist(set1, set2, length):
    distance = 0
    for x in range(length):
        distance += pow(set1[x] - set2[x], 2)
    return math.sqrt(distance)

and neighbours:

def find_neighbors(train, test, nof_class):
    distances = []
    length_dist = len(test) - 1
    for x in range(len(train)):
        dist = eucl_dist(test, train[x], length_dist)
        distances.append((train[x],dist))
    distances.sort(key=operator.itemgetter(1))
    neighbour = []
    for x in range(nof_class):
        neighbour.append(distances[x][0])
    return neighbour

 

Conclusion, IRIS dataset is – due to the nature of the measurments and observations – robust and rigid; one can get very good accuracy results on a small training set. Everything beyond 30% for training the model, is for this particular case, just additional overload.

Happy R & Python coding! ๐Ÿ™‚