Tips for organising your R code

Keeping your R code organised is not as straightforward as one might think. Just think about the libraries, variables, functions, and many more. All these objects can be defined and later rewritten, some might get obsolete during the process.

This process is proven to be even more crucial when you are part of a larger group of engineers, and scientists, who collaborate with you.

Motivation

The most important step toward code reproducibility is to keep code organised, and atomic, storing it by layers or components and keeping up-to-date documentation. Because R language is a scripting language, organising files into directories and subdirectories is important for later re-usage, and collaboration with different departments in an organisation.

Ideally, the names of files and subdirectories are self-explanatory, so that one can tell at a glance what data files contain, what scripts do, and what came from what.

The following R tips are based on frequent problems, many organisations are facing. All R samples and themes are created to be fictional and can be attached to your organisational environment. Dataset used is the iris dataset, used to show the custom theme and use of functions. All images unless otherwise noted are by the author.

1. Organising R files

You can always call R files, libraries, functions and settings from a different file. This gives a great segue to creating a folder structure, where each developer can clone or access and get all necessary files, themes, and functions that the organisation is pushing.

Organising R folders and R scripts

Structuring R files, functions, data and many more is an essential step toward reproducibility.

Using Projects is a great place to start (also available in Posit — RStudio), but you can always create your own structure, that will help you with code and file organisation.

2. Installing and attaching R libraries

Installing and attaching R libraries is in almost all cases part of the R code. Whenever you are writing R code, there will be a point, that you will be referencing to an external library.

You don’t want to install and attach single or multiple libraries as the sample of the pseudo-code below.

install.package("a")
install.package("b")
library(a)
library(b)

Instead, you can create a string vector with libraries names and install them if they do not exist and attach them with a shorter code:

required_Packages_Install <- c("ggplot2", "caret", "leaflet", "plotly", "magick")

for(Package in required_Packages_Install){
  if(!require(Package,character.only = TRUE)) { 
      install.packages(Package, dependencies=TRUE)
  }
  library(Package,character.only = TRUE)
}

In many enterprise environments, you might have issues with installing some packages on your local disk. These packages may contain *.zip or *.exe files and the security policy will deny them. In this case, the best solution is to install dedicated folder(s), as introduced in “Organise R files”. You will have to add the installation path and loading path for the packages.

The next step is to use a TXT file and write down all the packages needed for an R project/script. Let’s create a requirements.txt file (just like with YAML, Python,…) and put it inside package names.

Consider two R packages to help you achieve installation from the requirements.txt file. These two are requiRements and versions. Both are similar if your package list is stored in requirements.txt, but the versions function will also take the package version as input, which brings a whole new capability. On the other hand, the base function install.packages() gives you the possibility to specify in detail the arguments as reposlib (path), destdir and many system variables. But these will be essential to store packages at the desired location.

If you want to simplify the process, you can always create a ZIP file of all working packages and restore and install it at any given time. In this case, it is advised to add the R version in the ZIP file as well.

Protip: use library() instead of require(). The first one will fail and give you a warning, whereas, the require() will silently fail, causing you later failures in the code.

3. Use the corporate themes

Every corporate environment should follow the theme, with predefined colours, table design, pixel-perfect diagrams and positions. With the ggplot package, you can create a theme, that will follow your design guidelines.

Furthermore, adding a condition to your theme, that the same colours will always reflect the same KPI, can also be achieved with the themes.

library(ggplot2)
iris <- iris
ggplot(data = iris, aes(Sepal.Length)) + geom_bar(color="grey", fill="red") +
  labs(x = "Length of Sepal", 
       y = "Count of flowers", 
       title = "Number of flowers \nby sepal length",
       caption = "Source: IRIS Dataset \nBase R Package")
theme_organisation <- function(){
font <- "Times New Roman"
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.grid.minor.y = element_blank(),    
    panel.border = element_rect(colour = "black", fill = NA, linetype = 3, size = 0.5),
    panel.background = element_rect(fill = "#05a6f0"),
    legend.position = "bottom",
    plot.title = element_text( family = font, size = 20, face = 'bold', hjust = 0,vjust = 2),               
    axis.text = element_text(family = font,size = 9),                
    axis.text.x = element_text(margin=margin(5, b = 10))
  )
}
ggplot(data = iris, aes(Sepal.Length)) + geom_bar(color="grey", fill="red") +
labs(x = "Length of Sepal", 
     y = "Count of flowers", 
     title = "Number of flowers \nby sepal length",
     caption = "Source: IRIS Dataset \nBase R Package") +
  theme_organisation()
library(magick)
logo <- image_read("../Useless_R_functions/image/myiriscompany.png")
#adding the logo
grid::grid.raster(logo, x = 0.1, y = 0.02, just = c('left', 'bottom'), width = unit(1.9, 'inches'))

Besides graphs, tables can also follow a similar theme. R package flextable offers a great framework for creating tables with astonishing formats, layouts, cell formats and plotting capabilities.

library(flextable)
myiris <- flextable(head(iris), 
                         col_keys = c("Species", "Petal.Width", "Sepal.Length", "Sepal.Width" ))
myiris <- color(myiris, ~ Sepal.Length > 4.5, ~ Sepal.Length, color = "red")
myiris <- add_header_row(
  x = myiris, values = c("Name and Petals", "Measures on Sepal"),colwidths = c(2, 2))
myiris

With both packages, you will be able to create corporative reports, with capabilities to export them to different tools or formats (word, PDF, PowerPoint, HTML, and others).

4. Use coding practices and never forget to document

There are many sections, that will improve your code readability and reusability. I have grouped them into scopes, that each delivers better code.

Documenting code

  • Starting your code with an annotated description of what the code does when it is run will help you when you have to look at or change it in the future. Give the author name, date and, change log.
  • Loading all of the dependencies, packages and files in accordance with your file structure. Also, add the global environments and R engine version. In addition, a nice way to do this is also to indicate which packages are necessary to run your code.
  • Use setwd()to determine the files (script, project or packages) location, unless there are standards in your organisation, that make this obvious.
  • Use comments to mark off sections of code.
  • Comment your code with care. Comments should explain the why, not the what. Add comments to your function with the added description of all input arguments and result set.

Syntax practices

  • Place spaces around all operators (=, +, -, <-, boolean, etc).
  • Use <-, not =, for the assignment.
  • When using packages with similar function names, add a package name to the function: dplyr::filter() and a Filter() function from base R.
  • To improve readability, indent the code inside the curly braces. You can also use the formatR package to help you refactor and indent your code.
  • Factor out common operations rather than repeating them. And keep your code in smaller chunks. If a single function or loop gets too long, consider looking for ways to break it into smaller pieces.
  • There is a 80 characters line, that will help you comfortably fit code on a printed page at a reasonable size. If you find yourself running out of room, consider encapsulating some of the work in a separate function.

Naming convention

There are many naming conventions to choose from and all are ok, as long as you are using the selected one consistently. I will just list a few:

  • alllowercase: e.g. irisdataset
  • period.separated: e.g. iris.dataset
  • underscore_separated: e.g. iris_dataset
  • lowerCamelCase: e.g. addIrisDataset
  • UpperCamelCase: e.g. AddIrisDataset

Keep names concise and meaningful, nouns and verbs should be used in functions and variables. Give the function a verb, eg.: add, calculate, reduce, and give a variable a noun, eg.: calculatedNumbers, vectorOfValues.

In general, you can separate helper functions with a prefix of “.”. And also distinguish between local and global variables, data objects and functions.

Also, store your files with meaningful names and always store them in *.R

Posit — RStudio tips

  • Choose your IDE. Consider using Posit — RStudio. My second favourite for writing R code is Visual Studio code.
  • There is no need to save the current workspace if you are writing reproducible code. You should be able to reproduce the workspace by re-running your script.
  • Keep track of data, variables, and functions versions, and use also integrated facilities to access SVN or Github.
  • R projects are a great way to organize your script files, and your outputs consider using Markdown to prepare finalised reports of your analysis
  • Check the memory used, use a garbage collector (gc()) and it always helps to keep session information in your project.

The complete code is available on Github in this repository.

This Article was originally published on Medium : https://medium.com/@tomazkastrun/tips-for-organising-your-r-code-ebbda2309b8

Tagged with: ,
Posted in Useless R functions
4 comments on “Tips for organising your R code
  1. […] Tomaz Kastrun tidies up: […]

    Like

  2. Thanks for this post, Thomas!
    So important to organize code, and to achieve consistency across projects.

    Don’t agree with all details, though …
    Wouldn’t use setwd(). Breaks code whenever locations change (e. g. archiving projects, working on a laptop on a business trip as opposed to a desktop pc at the office, …). Prefer to use RStudio projects, the here package, and relative file paths.

    Wouldn’t automate installing packages. Hadley is strongly opposed to this – the user should actively agree to doing that. May not always be welcome in any given environment.

    Like

    • tomaztsql says:

      Thanks for your point of view. I highly appreciate the debate.

      I must say, that I absolutely agree that RStudio project is the way to go. Outside of that, I was exploring the organization of the code.

      As for installing packages, I like the Python(ian) way of automation of environment preparation and installing packages with particular version can be an advantage. Especially, when reproducing the existing code.

      Like

Leave a comment

Follow TomazTsql on WordPress.com
Programs I Use: SQL Search
Programs I Use: R Studio
Programs I Use: Plan Explorer
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
TomazTsql

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Discover WordPress

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

tenbulls.co.uk

tenbulls.co.uk - attaining enlightenment with the Microsoft Data and Cloud Platforms with a sprinkling of Open Source and supporting technologies!

SQL DBA with A Beard

He's a SQL DBA and he has a beard

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

DevOps could be the disease you die with, but don’t die of.

Paul te Braak

Business Intelligence Blog

Sql Insane Asylum (A Blog by Pat Wright)

Information about SQL (PostgreSQL & SQL Server) from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...