Essential list of useful R packages for data scientists

Posted on April 27, 2020 by tomaztsql — 5 Comments

I have written couple of blog posts on R packages (here | here ) and this blog post is sort of a preset of all the most needed packages for data science, statistical usage and every-day usage with R.

Among thousand of R packages available on CRAN (with all the mirror sites) or Github and any developer’s repository.

Many useful functions are available in many different R packages, many of the same functionalities also in different packages, so it all boils down to user preferences and work, that one decides to use particular package. From the perspective of a statistician and data scientist, I will cover the essential and major packages in sections. And by no means, this is not a definite list, and only a personal preference.

Screenshot 2020-04-26 at 07.44.18

1. Loading and importing data

Loading and read data into R environment is most likely one of the first steps if not the most important. Data is the fuel.

Breaking it into the further sections, reading data from binary files, from ODBC drivers and from SQL databases.

1.1. Importing from binary files

# Reading from SAS and SPSS
install.packages("Hmisc", dependencies = TRUE)
# Reading from Stata, Systat and Weka
install.packages("foreign", dependencies = TRUE)
# Reading from KNIME
install.packages(c("protr","foreign"), dependencies = TRUE)
# Reading from EXCEL
install.packages(c("readxl","xlsx"), dependencies = TRUE)
# Reading from TXT, CSV
install.packages(c("csv","readr","tidyverse"), dependencies = TRUE)
# Reading from JSON
install.packages(c("jsonLite","rjson","RJSONIO","jsonvalidate"), dependencies = TRUE)
# Reading from AVRO
install.packages("sparkavro", dependencies = TRUE)
# Reading from Parquet file
install.packages("arrow", dependencies = TRUE)
devtools::install_github("apache/arrow/r")
# Reading from XML
install.packages("XML", dependencies = TRUE)

1.2. Importing from ODBC

This will cover most of the used work for ODBC drives:

install.packages(c("odbc", "RODBC"), dependencies = TRUE)

1.3. Importing from SQL Databases

Accessing SQL database with a particular package can also have great benefits when pulling data from database into R data frame. In addition, I have added some useful R packages that will help you query data in R much easier (RSQL) or even directly write SQL Statements (sqldf) and other great features.

#Microsoft MSSQL Server
install.packages(c("mssqlR", "RODBC"), dependencies = TRUE)
#MySQL 
install.packages(c("RMySQL","dbConnect"), dependencies = TRUE)
#PostgreSQL
install.packages(c("postGIStools","RPostgreSQL"), dependencies = TRUE)
#Oracle
install.packages(c("ODBC"), dependencies = TRUE)
#Amazon
install.packages(c("RRedshiftSQL"), dependencies = TRUE)
#SQL Lite
install.packages(c("RSQLite","sqliter","dbflobr"), dependencies = TRUE)
#General SQL packages
install.packages(c("RSQL","sqldf","poplite","queryparser"), dependencies = TRUE)

2. Manipulating Data

Data Engineering, data copying, data wrangling and data manipulating data is the very next task in the journey.

2.1. Cleaning data

Data cleaning is essential for cleaning out all the outliers, NULL, N/A values, wrong values, doing imputation or replacing them, checking up frequencies and descriptive and applying different single- , bi-, and multi-variate statistical analysis to tackle this issue. The list is by no means the complete list, but can be a good starting point:

install.packages(c("janitor","outliers","missForest","frequency","Amelia",
                   "diffobj","mice","VIM","Bioconductor","mi",
                    "wrangle"), dependencies = TRUE)

2.2. Dealing with R data types and formats

Working with correct data types and knowing your ways around handling formatting of your data-set can be overlooked and yet important. List of the must have packages:

install.packages(c("stringr","lubridate","glue",
                   "scales","hablar","readr"), dependencies = TRUE)

2.3. Wrangling, subseting and aggregating data

There are many packages available to do the task of wrangling, engineering and aggregating, especially {base} R package should not be overlooked, since it offers a lot of great and powerful features. But following is a list of those most widely used in the R community and easy to maneuver data:

install.packages(c("dplyr","tidyverse","purr","magrittr",
                   "data.table","plyr","tidyr","tibble",
                   "reshape2"), dependencies = TRUE)

3. Statistical tests and Sampling Data

3.1. Statistical tests

Many of the statistical tests (Shapiro, T-test, Wilcox, equality, …) are available in base and stats package that are available with R engine. Which is great, because primarily R is a statistical language, and many of the tests are already included. But adding additional packages, that I have used:

install.packages(c("stats","ggpubr","lme4","MASS","car"), 
                   dependencies = TRUE)

3.2. Data Sampling

Data sampling, working with samples and population, working with inference, weights, and type of statistical data sampling can be find in these brilliant packages, also including those that are great for surveying data.

install.packages(c("sampling","icarus","sampler","SamplingStrata",
                    "survey","laeken","stratification","simPop"), 
                     dependencies = TRUE)

4. Statistical Analysis

Regarding of type of the variable, type of the analysis, and results a statistician wants to get, there are list of packages that should be part of daily R environment, when it comes to statistical analysis.

4.1. Regression Analysis

Frankly, one of the most important analysis

install.packages(c("stats","Lars","caret","survival","gam","glmnet",
                  "quantreg","sgd","BLR","MASS","car","mlogit","earth",
                  "faraway","nortest","lmtest","nlme","splines",
                  "sem","WLS","OLS","pls","2SLS","3SLS","tree","rpart"), 
dependencies = TRUE)

4.2. Analysis of variance

Distribution and and data dispersion is core to understanding the data. Many of the tests for variance are already built-in in R engine (package stats), but here are also some, that might be useful for analyzing variance.

install.packages(c("caret","rio","car","MASS","FuzzyNumbers",
                   "stats","ez"), dependencies = TRUE)

4.3. Multivariate analysis

Using more than two variables is considered multi-variate analysis. Excluding regression analysis and analysis of variance (between 2+ variables), since it is introduced in section 4.1., covering statistical analysis with working on many variables like factor analysis, principal axis component, canonical analysis, discrete analysis, and others:

install.packages(c("psych","CCA","CCP","MASS","icapca","gvlma","smacof",
                 "MVN","rpca","gpca","EFA.MRFA","MFAg","MVar","fabMix",
                 "fad","spBFA","cate","mnlfa","CSFA","GFA","lmds","SPCALDA",
                 "semds", "superMDS", "vcd", "vcdExtra"), 
 dependencies = TRUE)

4.4. Classification and Clustering

Based on different type of clustering and classification, there are many packages to cover both. Some of the essential packages for clustering:

install.packages(c("fpc","cluster","treeClust","e1071","NbClust","skmeans",
                "kml","compHclust","protoclust","pvclust","genie", "tclust",
                "ClusterR","dbscan","CEC","GMCM","EMCluster","randomLCA",
                "MOCCA","factoextra",poLCA), dependencies = TRUE)

and for classification:

install.packages("tree", "e1071")

4.5. Analysis of Time-series

Analysing time series and time-serie type of data will be done easier with the following packages:

install.packages(c("ts","zoo","xts","timeSeries","tsModel", "TSMining",
              "TSA","fma","fpp2","fpp3","tsfa","TSdist","TSclust","feasts",
              "MTS", "dse","sazedR","kza","fable","forecast","tseries",
              "nnfor","quantmod"), dependencies = TRUE)

4.6. Network analysis

Analyzing networks is also part of statistical analysis. And some of the relevant packages:

install.packages(c("fastnet","tsna","sna","networkR","InteractiveIGraph",
                 "SemNeT","igraph","NetworkToolbox","dyads", 
                  "staTools","CINNA"), dependencies = TRUE)

4.7. Analysis of text

Besides analyzing open text, once can analyse any kind of text, including the word corpus, the semantics and many more. Couple of starting packages:

install.packages(c("tm","tau","koRpus","lexicon","sylly","textir",
         "textmineR","MediaNews", "lsa","SemNeT","ngram","ngramrr",
         "corpustools","udpipe","textstem", "tidytext","text2vec"), 
          dependencies = TRUE)

5. Machine Learning

R has variety of good machine learning packages that are powerfull and give you the full Machine Learning cycle. Breaking down the sections by it’s natural way.

5.1. Building and validating the models

Once you build one or more models, after comparing the results of each models, it is also important to validate the models against the test or any other datasets. Here are powerfull packages to do model validation.

install.packages(c("tree", "e1071","crossval","caret","rpart","bcv",
                  "klaR","EnsembleCV","gencve","cvAUC","CVThresh",
                  "cvTools","dcv","cvms","blockCV"), dependencies = TRUE)

5.2. Random forests packages

sdfs

install.packages(c("randomForest","grf","ipred","party","randomForestSRC",
                  "grf","BART","Boruta","LTRCtrees","REEMtree","refr",
                  "binomialRF","superml"), dependencies = TRUE)

5.3. Regression type (regression, boosting, Gradient descent) algoritms packages

Regression type of machine learning algorithm are many, with additional boosting or gradient. Some of very usable packages:

install.packages(c("earth", "gbm","GAMBoost", "GMMBoost", "bst","superml",
                   "sboost"), dependencies = TRUE)

5.4. Classification algorithms

Classifying problems have many of the packages and many are also great for machine learning cases. Handful.

install.packages(c("rpart", "tree", "C50", "RWeka","klar", "e1071",
                   "kernlab","svmpath","superml","sboost"), 
dependencies = TRUE)

5.5. Neural networks

There are many types of Neural networks and many of different packages will give you all types of NN. Only couple of very useful R packages to tackle the neural networks.

install.packages(c("nnet","gnn","rnn","spnn","brnn","RSNNS","AMORE",
                   "simpleNeural","ANN2","yap","yager","deep","neuralnet",
                   "nnfor","TeachNet"), dependencies = TRUE)

5.6. Deep Learning

R had embraced deep learning and many of the powerfull SDK and packages have been converted to R, making it very usable for R developers and R machine learning community.

install.packages(c("deepnet","RcppDL","tensorflow","h2o","kerasR",
                   "deepNN", "Buddle","automl"), dependencies = TRUE)

5.7. Reinforcement Learning

Reinforcement learning is gaining popularity and more and more packages are being developered in R as well. Some of the very userful packages:

devtools::install_github("nproellochs/ReinforcementLearning")
install.packages(c("RLT","ReinforcementLearning","MDPtoolbox"), 
dependencies = TRUE)

5.8. Model interpretability and explainability

Results of machine learning models can be a black-box. Many of the packages are dealing to have black-box more like “glass box”, making the models more understandable, interpretable and explainable. Very powerfull packages to do just that for many different machine learning algorithms.

install.packages(c("lime","localModel","iml","EIX","flashlight",
                    "interpret","outliertree","breakDown"), 
dependencies = TRUE)

6. Visualisation

Visualisation of the data is not only the final step to understanding the data, but can also bring clarity to interpretation and buidling the mental model around the data. Couple of packages, that will help boost the visualization:

install.packages(c("ggvis","htmlwidgets","maps","sunburstR", "lattice",
  "predict3d","rgl","rglwidget","plot3Drgl","ggmap","ggplot2","plotly",
  "RColorBrewer","dygraphs","canvasXpress","qgraph","moveVis","ggcharts",
  "igraph","visNetwork","visreg", "VIM", "sjPlot", "plotKML", "squash",
  "statVisual", "mlr3viz", "klaR","DiagrammeR","pavo","rasterVis",
  "timelineR","DataViz","d3r","d3heatmap","dashboard" "highcharter",
  "rbokeh"), dependencies = TRUE)

7. Web Scraping

Many R packages are specificly designed to scrape (harvest) data from particular website, API or archive. Here are only couple of very generic:

install.packages(c("rvest","Rcrawler","ralger","scrapeR"), 
             dependencies = TRUE)

8. Documents and books organisation

Organizing your documents (file, code, packages, diagrams, pictures) in readable document and have it as a dashboard or book view, there are couple of packages for this purpose:

install.packages(c("devtools","usethis","roxygen2","knitr",
                    "rmarkdown","flexdashboard","Shiny",
                    "xtable","httr","profvis"), dependencies = TRUE)

Wrap up

The R script for loading and installing the packages is available at Github. Make sure to check the Github repository for latest list updates. And as always, feel free to fork the code or commit updates, add essentials packages to list, comment, improve and agree or disagree.

You can also run the following command to install all of the packages in a single run:

install.packages(c("Hmisc","foreign","protr","readxl","xlsx",
                 "csv","readr","tidyverse","jsonLite","rjson",
                 "RJSONIO","jsonvalidate","sparkavro","arrow","feather",
                 "XML","odbc","RODBC","mssqlR","RMySQL",
                 "dbConnect","postGIStools","RPostgreSQL","ODBC",
                 "RSQLite","sqliter","dbflobr","RSQL","sqldf",
                 "poplite","queryparser","influxdbr","janitor","outliers",
                 "missForest","frequency","Amelia","diffobj","mice",
                 "VIM","Bioconductor","mi","wrangle","mitools",
                 "stringr","lubridate","glue","scales","hablar",
                 "dplyr","purr","magrittr","data.table","plyr",
                 "tidyr","tibble","reshape2","stats","Lars",
                 "caret","survival","gam","glmnet","quantreg",
                 "sgd","BLR","MASS","car","mlogit","RRedshiftSQL",
                 "earth","faraway","nortest","lmtest","nlme",
                 "splines","sem","WLS","OLS","pls",
                 "2SLS","3SLS","tree","rpart","rio",
                 "FuzzyNumbers","ez","psych","CCA","CCP",
                 "icapca","gvlma","smacof","MVN","rpca",
                 "gpca","EFA.MRFA","MFAg","MVar","fabMix",
                 "fad","spBFA","cate","mnlfa","CSFA",
                 "GFA","lmds","SPCALDA","semds","superMDS",
                 "vcd","vcdExtra","ks","rrcov","eRm",
                 "MNP","bayesm","ltm","fpc","cluster",
                 "treeClust","e1071","NbClust","skmeans","kml",
                 "compHclust","protoclust","pvclust","genie","tclust",
                 "ClusterR","dbscan","CEC","GMCM","EMCluster",
                 "randomLCA","MOCCA","factoextra","poLCA","ts",
                 "zoo","xts","timeSeries","tsModel","TSMining",
                 "TSA","fma","fpp2","fpp3","tsfa",
                 "TSdist","TSclust","feasts","MTS","dse",
                 "sazedR","kza","fable","forecast","tseries",
                 "nnfor","quantmod","fastnet","tsna","sna",
                 "networkR","InteractiveIGraph","SemNeT","igraph",
                 "dyads","staTools","CINNA","tm","tau","NetworkToolbox"
                 "koRpus","lexicon","sylly","textir","textmineR",
                 "MediaNews","lsa","ngram","ngramrr","corpustools",
                 "udpipe","textstem","tidytext","text2vec","crossval",
                 "bcv","klaR","EnsembleCV","gencve","cvAUC",
                 "CVThresh","cvTools","dcv","cvms","blockCV",
                 "randomForest","grf","ipred","party","randomForestSRC",
                 "BART","Boruta","LTRCtrees","REEMtree","refr",
                 "binomialRF","superml","gbm","GAMBoost","GMMBoost",
                 "bst","sboost","C50","RWeka","klar",
                 "kernlab","svmpath","nnet","gnn","rnn",
                 "spnn","brnn","RSNNS","AMORE","simpleNeural",
                 "ANN2","yap","yager","deep","neuralnet",
                 "TeachNet","deepnet","RcppDL","tensorflow","h2o",
                 "kerasR","deepNN","Buddle","automl","RLT",
                 "ReinforcementLearning","MDPtoolbox","lime","localModel",
                 "iml","EIX","flashlight","interpret","outliertree",
                 "dockerfiler","azuremlsdk","sparklyr","cloudml","ggvis",
                 "htmlwidgets","maps","sunburstR","lattice","predict3d",
                 "rgl","rglwidget","plot3Drgl","ggmap","ggplot2",
                 "plotly","RColorBrewer","dygraphs","canvasXpress","qgraph",
                 "moveVis","ggcharts","visNetwork","visreg","sjPlot",
                 "plotKML","squash","statVisual","mlr3viz","DiagrammeR",
                 "pavo","rasterVis","timelineR","DataViz","d3r","breakDown",
                 "d3heatmap","dashboard","highcharter","rbokeh","rvest",
                 "Rcrawler","ralger","scrapeR","devtools","usethis",
                 "roxygen2","knitr","rmarkdown","flexdashboard","Shiny",
                 "xtable","httr","profvis"), dependencies = TRUE)

Happy R-ing. 🙂

Tagged with: data science, environment, essential, installed packages, library, Microsoft R Server, packages, R, R Packages, setup, statistics
Posted in Uncategorized

5 comments on “Essential list of useful R packages for data scientists”

Useful R Packages for Data Scientists – Curated SQL says:

April 27, 2020 at 2:11 pm

[…] Tomaz Kastrun has a nice collection of useful R packages for data scientists: […]

LikeLike

Reply
Essential list of useful R packages for data scientists – Data Science Austria says:

April 27, 2020 at 3:48 pm

[…] by data_admin [This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]

LikeLike

Reply
Holger K. von Jouanne-Diedrich says:

April 27, 2020 at 6:25 pm

Perhaps my OneR machine learning package would be a worthwhile addition: https://blog.ephorie.de/category/oner

LikeLiked by 1 person

Reply
Important checklist of helpful R packages for knowledge scientists – JobsandVisa says:

April 27, 2020 at 10:50 pm

[…] on Posted onApril 27, 2020April 27, 2020By Bill [This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You possibly can report challenge concerning the content […]

LikeLike

Reply
Using SQL for R data.frames with sqldf | TomazTsql says:

April 27, 2021 at 4:54 pm

[…] There are also many great packages for manipulating, wrangling and engineering data frames. Tidyverse, dplyr, data.table, purr, tibble, magrittr are many more. A curated list of relevant packages for data scientists can be found here. […]

LikeLike

Reply

	tomaztsql on Retrieving user access list to…
	Paola A Zambrano on Retrieving user access list to…
	“Reverse Hello… on Little useless-useful R functi…
	Max Petter on Using R and Python in Microsof…
	detlef kissel on Using R and Python in Microsof…