First release and update dates of R Packages statistics

R has been around long time and the packages have evolved through the years as well. From the initial releases, updates, to new packages. Like many open-source and community driven languages, R is not an exception. And getting the first release dates of R packages requires little bit of web scrapping and lots of fun.

CRAN – Comprehensive R Archive Network – has invested a lot of people, rules and hours of work to have the packages available for general public in tidy, ready-to-use and easy-to-use fashion.

 

Last R Package updates

First, let’s check the last package update dates. By loading rvest and getting the data from CRAN web site: https://cran.r-project.org/web/packages/available_packages_by_date.html we are able to turn the HTML table into usable data.frame in R.

library(rvest)
library(ggplot2)

url = 'https://cran.r-project.org/web/packages/available_packages_by_date.html'

CRANpage <- read_html(url)
tbls <- html_nodes(CRANpage, "table") # since HTML is in table; no need to scrape td/tr elements
table1 <- html_table(tbls[1], fill = TRUE)
dd <- data.frame(table1[1])

#house cleaning
dd$Date <- as.Date(dd$Date)

### simple graph
ggplot(dd, aes(x=Date)) +
geom_dotplot(binwidth =12) + 
labs(x = "Dates", y = "Number of packages updates by Year of last update") +
scale_x_date(date_breaks= "2 years", date_labels = "%Y/%m", limits = as.Date(c("2005-01-01", "2018-10-10")))

Based on this graph, we can see that many of the R packages have been updated in past year or two.

2018-10-07 18_49_23-Plot Zoom

So, how many? So we run the following statement:

library(dplyr)
library(lubridate)

# updates by year
dd_y <- dd %>%
mutate( PYear= year(Date)) %>%
select (PYear) %>%
group_by(PYear) %>%
summarise(
  nof = n()
)

with the results:

# A tibble: 14 x 2
   PYear   nof
   <dbl> <int>
 1  2005     1
 2  2006     4
 3  2007     1
 4  2008    10
 5  2009    25
 6  2010    32
 7  2011    65
 8  2012   464
 9  2013   575
10  2014   764
11  2015  1158
12  2016  1772
13  2017  2683
14  2018  5583

So out of 13137 packages (on October 7th, 2018), 5583 have been updated in year 2018 and additional 2683 in 2017.

By running a simple stats:

dd_y %>% 
mutate(cumsum = cumsum(nof)
,percY = nof/cumsum(nof)
,percC = cumsum(nof)/sum(nof))

 

we can see how active many of the packages have been in terms of updates.

# A tibble: 14 x 5
   PYear   nof cumsum percY     percC
   <dbl> <int>  <int> <dbl>     <dbl>
 1  2005     1      1 1.00  0.0000761
 2  2006     4      5 0.800 0.000381 
 3  2007     1      6 0.167 0.000457 
 4  2008    10     16 0.625 0.00122  
 5  2009    25     41 0.610 0.00312  
 6  2010    32     73 0.438 0.00556  
 7  2011    65    138 0.471 0.0105   
 8  2012   464    602 0.771 0.0458   
 9  2013   575   1177 0.489 0.0896   
10  2014   764   1941 0.394 0.148    
11  2015  1158   3099 0.374 0.236    
12  2016  1772   4871 0.364 0.371    
13  2017  2683   7554 0.355 0.575    
14  2018  5583  13137 0.425 1.00 

So majority (or 2/3) of the packages have been actively updated in last 2 years (in order to fit the latest R engine updates). A simple correlation will also support this:

#simple correlation
cor(dd_y)[1,2]

with the value of 0.77.

Funny question: Is there any correlation of the update package and the month?

And the answer is: No 🙂

dd_ym <- dd %>%
mutate( PYear= year(Date)
,month_name = month(Date, label = FALSE)) %>%
select (PYear,month_name) %>%
group_by(PYear,month_name) %>%
summarise(
nof = n()
)
cor(dd_ym)[1,2]

with the correlation coefficient of -0.06. So Month does not play any particular importance. But Since the year 2018 is not over yet, it might be slightly unfair. So, to further check and support this, the distribution of the updates of R packages over months, I have excluded the year 2018 and anything prior to 2010.:

#check distribution over months
dd_ym2010 <- dd_ym %>%
  filter(PYear > 2010 & PYear < 2018)

boxplot(dd_ym2010$nof~dd_ym2010$month_name, 
main="R Packages update over months", xlab = "Month", 
ylab="Number of Packages")

and we can see the boxplot:

2018-10-07 19_11_23-Plot Zoom

So more updates are coming in autumn times. But the results of correlation:

cor(dd_ym2010)[2,3]

is still just 0.155, making it hard to draw any concrete conclusions. Adding year 2018 will skew the picture and add several outliers, as the fact that year 2018 is still a running year (as of writing this blog post).

 

Initial dates of R Package Release

To get the complete picture, not just last updates of the packages, but the complete First or initial release dates of all the packages, some further digging was involved. Again, from CRAN archive web pages, the dates of updates and number of updates have been scrapped, in order for these statistics to be prepared.

A loop over all the package archives, has resulted in in final data frame.

###########################
### Get initial Dates #####
###########################

rm(list = Filter(exists, c("packageNames")))
packageNames <- dd$Package

# rm(df_first)
#create a dataframe to keep the data types in order
df_first <- data.frame(name=c("TK_NA")
              ,firstRelease=c(as.Date("1900-12-31"))
              ,nofUpdates=c(0))

for (i in 1:length(packageNames)){
     url1 <- 'https://cran.r-project.org/src/contrib/Archive/'
     name1 <- packageNames[i]
     url2 <- paste0(url1,name1,'/')

ifErrorPass <- tryCatch(read_html(url2), error=function(e) e) 
if(inherits(ifErrorPass, "error")) next # if package does not have archive!!!

   cp <- read_html(url2)
   t2 <- html_nodes(cp, "table") 
   t2 <- html_table(t2[1], fill = TRUE)
   rm(list = Filter(exists, c("dd2")))
   dd2 <- data.frame(t2[1])
   dat <- dd2$Last.modified
   dat <- as.Date(dat, format = '%Y-%m-%d')
   firstRelease <- dat[order(format(as.Date(dat),"%Y%m%d"))[1]]
   numberOfUpdates <- length(dat) 
   df_first <- rbind(df_first,data.frame(name=name1,firstRelease=as.Date(firstRelease, format='%Y-%m-%d'),nofUpdates=numberOfUpdates))
}

# clean my initial row when creating data.frame
myData = df_first[df_first$firstRelease > '1900-12-31',]

After leaving this part running for roughly 10 minutes,  the code has successfully scraped all the archives of the CRAN web repository.  But not all packages have archive folder yet. And this should mean, that there is not yet any updates for these packages (correct me, If I am wrong. thanks). So some additional data wrangling was needed:

# add missing packages that did not fall into archive folder on CRAN

myDataNonArchive <- dd$Package[!dd$Package %in% myData$name]
myDataNonArchive2 <- cbind(dd[dd$Package %in% myDataNonArchive,c(2,1)],1)

names(myData) <- c("Name","firstRelease","nofUpdates")
names(myDataNonArchive2) <- c("Name","firstRelease","nofUpdates")

finalArchive <- data.frame(rbind(myData, myDataNonArchive2))

And final graph of the inital release year of packages, can be plotted:

hist(year(finalArchive$firstRelease),
main = paste("Histogram of First year of R Package Release")
,xlab="Year",ylab="Number of Packages"
,col="lightblue", border="Black"
,xlim = c(1995, 2020), las=1, ylim=c(0,3000))

And the graph:

2018-10-07 19_33_56-Plot Zoom

With the following numbers (focusing only on past years):

2018-10-07 19_45_20-Book1 - Excel.png

We can conclude that in year 2018, we might not see a positive trend in new package development as in the past years (this is my personal view and conclusion). Another indicator showing this is the number of updates in year 2018 – the year of a major R upgrade – for all the packages released in year 2018 is declining in comparison with previous years. I guess, years 2016 and 2017 were “data science years” and golden years for R.

 

As always, complete code is available at Github.

 

 

Happy R-coding!

Tagged with: , , , ,
Posted in Uncategorized
One comment on “First release and update dates of R Packages statistics
  1. […] Tomaz Kastrun takes a look at CRAN package update dates: […]

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Categories
Follow TomazTsql on WordPress.com
Programs I Use
Programs I Use
Programs I Use
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
TomazTsql

Tomaz doing BI and DEV with SQL Server and R, Python and beyond

Discover

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python and beyond

tenbulls.co.uk

attaining enlightenment with sql server, .net, biztalk, windows and linux

SQL DBA with A Beard

He's a SQL DBA and he has a beard

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python and beyond

Ms SQL Girl

Julie Koesmarno's Journey In Data, BI and SQL World

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

SQL Server, Azure and DLM in a nutshell :D

Paul te Braak

Business Intelligence Blog

Sql Server Insane Asylum (A Blog by Pat Wright)

Information about SQL Server from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...

SQLPam's Blog

Life changes fast and this is where I occasionally take time to ponder what I have learned and experienced. A lot of focus will be on SQL and the SQL community – but life varies.

%d bloggers like this: