Eight R Tidyverse tips for everyday data engineering

Tidyverse is a collection of R packages, primarily for data engineering and analytics. These packages are ggplot2, purrr, tibble, dplyr, tidyr, stringr, readr, and forcats. And all combine the same language, design and “grammar” structures.

Collection of Tidyverse resources. Source: Tidyverse

1. Use pipe “%>%” for nesting functions within the pipe

Piping (or chaining) is a great way to link the data manipulation functions without storing intermediate results.

# inner piping inside mutate
airlines %>%
mutate(name_short = name %>%
     str_to_upper() %>%
     str_replace_all (" (INC|CO)\\.?$", "") %>%
     str_replace_all (" AIR ?(LINES|WAYS)?( CORPORATION)?$", "") %>%
     str_to_title() %>%
     str_replace_all("\\bUs\b", "US")
      ) %>%
mutate(FullName_length = nchar(name)) %>%
select (name_short, FullName_length) %>%
arrange(desc(FullName_length))

This code is clearly readable, understandable and easy to curate, despite the heavy use of REGEX and str_replace_all function. If we would replace the inner piping with normal function wrapping, the mutate for “name_short” would be fairly unreadable.

# without inner piping
airlines %>%
mutate(name_short = str_replace_all( str_to_title(                                                   str_replace_all(str_replace_all(str_to_upper(name)," (INC|CO)\\.?$", "")," AIR ?(LINES|WAYS)?( CORPORATION)?$", "")),"\\bUs\b","US")) %>%
mutate(FullName_length = nchar(name)) %>%
select (name_short, FullName_length) %>%
arrange(desc(FullName_length))

Both code snippets return the same result and use essentially the same functions. Whereas, the first one intelligently does this, using pipes.

2. Operations across multiple columns at once

Tidyverse provides a handful of great functions for operating across multiple columns simultaneously. Across is a function, that makes it easy to apply the same transformation over numerous columns in summarise() and mutate() functions.

Across accepts two arguments; a) array of columns and b) function or list of functions to be applied to selected columns.

# Simple across with two selected columns
flights %>%
  mutate(across(c(dep_delay, arr_delay), abs))

# Using function where to select columns
flights %>%
  summarise(across(where(is.integer), n_distinct))

# Using where and purr-style lamba function
flights %>%
  summarise(across(where(is.integer), ~ sum(.x, na.rm = TRUE)))

# using list of functions, defining column names and removing NA
flights %>%
  group_by(carrier) %>%
  summarise(across(ends_with("time"), list(AVG = mean, SD = sd, GrandTotal= ~ sum(is.na(.x))),  na.rm=TRUE, .names = "{.col}.{.fn}")) %>%
 ungroup()

3. Case statement to create a column based on a condition

Creating case statements is often a required task and case_when() function enables simple, fast transformation. A new column is simply added on top of other tidyverse functions; such as group_by and count.

flights %>%
  group_by(carrier) %>%
  mutate(new_classification = case_when(
    (origin == "EWR") & (dep_delay <= 0) ~ "EWR with negative delay",
    (origin == "EWR") & (dep_delay > 0) ~ "EWR with positive delay",
    (origin == "JFK") ~ "Stats for JFK Airport",
    (origin == "LGA") &  (air_time <= 220) ~ "La Guardia under 6 hours flights",
    TRUE   ~ "La Guardia above 6 hours flights"
  )) %>%
  count(new_classification) %>%
  ungroup()

4. Using transmute

This function executes same functions as mutate — create new column. But transmute() adds new variable(s) and drop existing ones, where mutate() preserves the existing ones.

library(lubridate)

# Combination of mutate and select
flights %>%
 mutate(date = make_date(year, month, day), carrier, tailnum) %>%
 select(date, carrier, tailnum)

# using transmute
flights %>%
 transmute(date = make_date(year, month, day), carrier, tailnum)

The example above returns essentially the same result set, whereas transmute() is cleaner and easier to read. If one would remove the select() function from the first example, one would get the complete dataset, along with a new column.

5. Lumping levels

Lumping together levels for a given factor variable is an extremely powerful function when there are many levels but one is interested in only the top most frequent levels.

Given the vector of 15 letters and their frequency, one can immediately see, that out of 15 different letters, roughly 80% of the observations fall into 5 letters.

# using transmute
x <- factor(rep(LETTERS[1:15], times = c(20,15,23,2,4,3,1,1,1,5,2,8,3,1,1))) 
x %>% table()

And to achieve this, one can use fct_lump_n() function or fct_lump_min().

x %>% 
fct_lump_n(5) %>% 
table()

# or alternatively
# x %>% fct_lump_min(5) %>% table()

## result:
>   A     B     C     J     L Other 
>  20    15    23     5     8    19

Now displaying the long tail of different levels with low frequency is grouped into “other” bucket, great for visualising data.

# using ggplot to show top 5 frequent carriers and "other" group
flights %>%
  mutate(name = fct_lump_n(carrier, 5)) %>%
  count(name) %>%
  mutate(name = fct_reorder(name, n)) %>%
  ggplot(aes(x=name, y=n)) +
  geom_col()

Quick overview graph using fct_lump_n() function

6. Generating all possible combinations

Generating all the possible combinations out of a set of variables is not your everyday scenario. But Tidyverse offers the function called crossing() or expand() to do the job.

Following example will create all possible combinations for variables age, status, values and temperature.

# pseudo sample
crossing(
  age = c(30,40,50,60,70),
  status = c("New", "Used"),
  values = c("0-100EUR", "101-200E", "201-300", "301-400"),
  temperature = c(30,35,34)
)

Working with flights dataset, let’s create all possible combinations for couple of selected variables. One would use function expand().

# generating all possible combinations
flights %>% expand(origin, dest, dep_time, carrier)

But finding all existing and unique combinations

# getting all possible combinations that are present in dataset using crossing
flights %>%
  select(origin, dest, dep_time, carrier) %>%
  crossing()
# > 97,946 more rows

# getting all possible combinations using expand and nesting
flights %>% expand(nesting(origin, dest, dep_time,carrier))
# > 97,946 more rows

7. Reshaping data with pivot and spread

Another important task that will usually occur with data preparation is data reshaping.

First function one must (!) understand and be comfortable work with is pivot_wider(). This function takes the values from one variable (in this case variable: origin) and transpose (reshape / pivot) the data (dep_time and arr_time) for given statistics (average of values).

#pivot_wider
flights %>%
  group_by(carrier) %>%
  select(origin, dep_time, arr_time) %>%
  pivot_wider(
          names_from = origin, 
          values_from = c(dep_time,arr_time), 
          values_fn = ~mean(.x, na.rm = TRUE),
          names_glue = "{origin}_{.value}"
          )

#check calculation for carrier UA and origin EWR using summarise
flights %>%
  filter(carrier == 'UA' & origin == 'EWR') %>%
  group_by(carrier) %>%
    summarise(
    avg_dep_time = mean(dep_time, na.rm = TRUE)
  )

Contra to pivot_wider() is the function pivot_longer(). This function does the exact opposite. Wider function adds more columns and reduces the number of rows, longer function reduces the number of columns and creates new rows of data.

Let’s persist the dataframe from pivot_wider example and name the dataframe flights_wider.

#create and persist dataframe called: flights_wider
flights_wider <- flights %>%
  group_by(carrier) %>%
  select(origin, dep_time, arr_time) %>%
  pivot_wider(
    names_from = origin, 
    values_from = c(dep_time,arr_time), 
    values_fn = ~mean(.x, na.rm = TRUE),
    names_glue = "{origin}_{.value}"
  )

# excluding carrier
flights_wider %>%
  pivot_longer(
    !carrier,
    names_to = "origin",
    values_to = "time",
    values_drop_na = TRUE
  )

#or with cols parameter and defining the pattern for column selection
flights_wider %>%
  pivot_longer(
    cols = ends_with("time"),
    names_to = "origin",
    values_to = "time",
    values_drop_na = TRUE
  )

8. Adding a running ID to your dataframe

The last tip is the easiest and yet can save tons of time. Adding a running ID to each row of the dataframe. This ID will create “uniqueness” overall attributes (columns) for a given row. This can save you a lot of filtering and writing code, whereas, one can say, give me a row with ID = 42.

# Associate ID with every row in dataset
flights %>% 
 mutate(running_id = row_number())

# or using tibble row_to_column function
flights2 <- tibble::rowid_to_column(flights, ID)

This list is by no means the ultimate list and one could add many honourable mentions. But despite all that, the list has been curated over a longer period of time and this cherry-picking is based on work and the use of Tidyverse.

As always, the code is available at my Github repository.

Happy Tidyversing and stay healthy!

p.s.: Article was originally published on Medium.

Tagged with: , , , ,
Posted in Uncategorized, Useless R functions
5 comments on “Eight R Tidyverse tips for everyday data engineering
  1. Alex says:

    Great article thank you

    Like

  2. Lukasz says:

    Oh… how I hate across() 🙂

    I switched to tidyverse to avoid nested parentheses like

    flights %>% summarise(across(where(is.integer), n_distinct))

    Old good *_if functions were much better:

    flights %>% summarise_if(is.integer, n_distinct)

    Like

    • tomaztsql says:

      I feel you 🙂
      all functions: summarise_at(), summarise_all(), summarise_if() have been superseeded by the use of across(). the functions with suffix _at(), _all(), _if() will in future get only critical bug fixes and nothing more. But will stay around.

      What new functionalities bring across():
      – it offers multiple use of _if and _at : e.g.: across(where(is.numeric) & ends_with(“EWG”))
      – significantly reduces the number of functions that dplyr needs to provide
      – across makes it possible to express better, complex and useful sumarise.
      – better wrangling with dataframes and structures
      – ….

      Like

  3. […] article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) […]

    Like

  4. […] Tomaz Kastrun shares some advice: […]

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Follow TomazTsql on WordPress.com
Programs I Use: SQL Search
Programs I Use: R Studio
Programs I Use: Plan Explorer
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
TomazTsql

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Discover WordPress

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

tenbulls.co.uk

tenbulls.co.uk - attaining enlightenment with the Microsoft Data and Cloud Platforms with a sprinkling of Open Source and supporting technologies!

SQL DBA with A Beard

He's a SQL DBA and he has a beard

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

MsSQLGirl

Bringing value to data & insights through experiences users love

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

DevOps could be the disease you die with, but don’t die of.

Paul te Braak

Business Intelligence Blog

Sql Server Insane Asylum (A Blog by Pat Wright)

Information about SQL Server from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...

%d bloggers like this: