Comparing performances of CSV to RDS, Parquet, and Feather file formats in R

From the previous blogpost:
CSV or alternatives? Exporting data from SQL Server data to ORC, AVRO, Parquet, Feather files and store them into Azure data lake

we have created Azure blob storage, connected secure connection using Python and started uploading files to blob store from SQL Server. Alongside, we compared the performance of different file types. ORC, AVRO, Parquet, CSV and Feather. Coming to conclusion, CSV is great for its readability, but not suitable (as a file format) for all types of workloads.

We will be doing a similar benchmark with R language. The goal is to see, if CSV file format can be replaced by a file type that better, both in performance and storage.

Creating a random dataset (10K, 1MIO rows with 30 columns) we have tested the speed of writing and speed of reading from local machine. For the benchmark, we have used package microbenchmark. File formats were: CSV, Feather, Parquet and Rdata (or RDS).

This is the result of running 10 times each operation (writing and reading) for the 10K rows.

The test was performed with R version 4.1.2 and M1 MacOS. For the 10.000 rows (with 30 columns), I have the average CSV size 3,3MiB and Feather and Parquet circa 1,4MiB, and less than 1MiB for RData and rds R format. By far the best reads and write times the tests show a big advantage with Feather file format, following with Parquet and data.table package (reading from CSV file).

Test was conducted using a sample dataframe:

#file names
file_csv <- 'test_df.csv'
file_csv_readr <- 'test_df_readr.csv'
file_csv_datatable <- 'test_df_datatable.csv'
file_feather <- 'test_df.feather'
file_rdata <- 'test_df.RData'
file_rds <- 'test_df.rds'
file_parquet <- 'test_df.parquet'

files <- file.info(c('test_df.csv','test_df_readr.csv','test_df_datatable.csv', 'test_df.feather', 'test_df.RData', 'test_df.rds', 'test_df.parquet'))

test_df <- data.frame(
                replicate(10, sample(0:10000, nof_rows, rep = TRUE)),
                replicate(10, runif(nof_rows, 0.0, 1.0)),
                replicate(10, stri_rand_strings(1000, 10))
                 )

and running benchmark:

benchmark_write <- data.frame(summary(microbenchmark(
          "test_df.csv"     = write.csv(test_df, file = file_csv),
          "test_df_readr.csv"     = readr::write_csv(test_df, file = file_csv_readr),
          "test_df_datatable.csv"     = data.table::fwrite(test_df, file = file_csv_datatable),
          "test_df.feather" = write_feather(test_df, file_feather),
          "test_df.parquet" = write_parquet(test_df, file_parquet),
          "test_df.rds"     = save(test_df, file = file_rdata),
          "test_df.RData"   = saveRDS(test_df, file_rds), 
  times = nof_repeat)))

Same was repeated for reading operation and results were merged:

benchmark_read <- data.frame(summary(microbenchmark(
                            "test_df.csv" = read.csv(file_csv),
                            "test_df_readr.csv" = readr::read_csv(file_csv_readr),
                            "test_df_datatable.csv" = data.table::fread(file_csv_datatable),
                            "test_df.rds"  = load(file_rdata),
                            "test_df.RData" = readRDS(file_rds),
                            "test_df.feather" = arrow::read_feather(file_feather),
                            "test_df.parquet" = read_parquet(file_parquet), times = nof_repeat)))

colnames(benchmark_read) <- c("names", "read_min", "read_lq", "read_mean", "read_median", "read_uq", "read_max", "read_repeat")


#merge results and create factors
results <- inner_join(inner_join(benchmark_read, files, by = "names"), benchmark_write, by = "names")
results <- results[,c("names","size_mb","read_min", "read_max", "read_median","write_min","write_max", "write_median")]

And finally, have plotted the results for the graph above:

library(glue)
library(ggtext)

title_lab_adjusted <- glue::glue(
  "File types comparison on<br><span style = 'color:red;'>read operation</span> and <br><span style='color:darkgreen';>write operation</span>") 

ggplot(results, aes(x=names, y=size_mb)) + 
     geom_bar(stat="identity", fill="lightblue") +
     geom_text(aes(label=paste0(format(round(size_mb, 2), nsmall = 2), " MiB", collapse=NULL)), vjust=-0.3, size=3.5)+
     theme(axis.text.x = element_text(angle = 45, hjust = 1.3)) +
     coord_cartesian(ylim = c(0, 5), expand = F) +
     scale_y_continuous(breaks = seq(0, 5, 1),labels = scales::label_comma(accuracy = 1)) +
    theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(size = 0.1, linetype = "solid", colour = "grey50")) +
    ylab(label = 'Time (sec.) + File_Size') +  xlab(label = 'Files') +
  labs(title = title_lab_adjusted) +
  theme(
    plot.title = element_markdown(),
    panel.background = element_rect(color = NA, fill = 'white')) +

  geom_point (aes(y=write_median/100, group=names),
        col = "darkgreen",
        size = 2,
        stat ="identity",
        alpha=.8) +
  geom_point(aes(y=read_median/100, group=names),
           col = "red",
           size = 2,
           stat ="identity",
           alpha=.8 ) 

To do a little bit more stress on the CPU, I have created a data.frame with 1Mio rows.

When comparing file size, we can see that CSV files are all around 330MiB in size, RData and RDS around 80MiB and Feather/Parquet around 140MiB. So CSV is already 2x bigger than Parquet and 4x bigger than RData file.

Looking into performance (median for write/read), we can see Feather is by far the most efficient file format. out of 10 runs, reading the complete dataset (1Mio rows), took only 0.4 seconds, whereas CSV was lagging behind significantly. With write operations, Feather’s file format was also the fastest.

Exploring the CSV file format, I have used three different R packages (base, readr and data.table), and data.table outperformed the other two significantly.

To conclude, storing data in feather format will yield performance statistics. If you want to save on storage space, use RData or rds, these two will save the most space. On the other side, if you are stuck with CSV file format, use data.table, as it is by far the fastest for data wrangling and data manipulation tasks.

As always, complete code is available at Github, in the benchmark repository. In the same repository, you will find also python code from the previous blog post, but if you are looking only for code, that is described in this blog post, click here. Follow up for more updates.

Happy R coding! And stay healthy!

Tagged with: , , , , , , ,
Posted in Uncategorized
7 comments on “Comparing performances of CSV to RDS, Parquet, and Feather file formats in R
  1. […] article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) […]

    Like

  2. psyoskeptic says:

    It would be great if you added .zip to the csv file. R will just read in the zip file using read.csv.

    Like

    • tomaztsql says:

      Hi,
      You are spot on. I would need to benchmark zip/unzip process separately, but for uploading to the cloud, ZIP format is also very good solution.

      Best, Tomaž

      Like

  3. FS says:

    Hi Tomaž,
    Have you also experimented with the fst package (https://www.fstpackage.org/)?

    Like

  4. Big Dave says:

    You should check out fst package which blows all of these out the water in terms of speed and compression.

    Like

  5. Emil ErikPulaBellamy Begtrup-Bright says:

    …and then there’s the qs package which blows fst out if the water.

    Liked by 1 person

  6. Emil ErikPulaBellamy Begtrup-Bright says:

    It can save files in native r objects, too, just like rda but faster and more efficient. Fast is good but it can’t save lists, for example.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Follow TomazTsql on WordPress.com
Programs I Use: SQL Search
Programs I Use: R Studio
Programs I Use: Plan Explorer
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
TomazTsql

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Discover WordPress

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

tenbulls.co.uk

tenbulls.co.uk - attaining enlightenment with the Microsoft Data and Cloud Platforms with a sprinkling of Open Source and supporting technologies!

SQL DBA with A Beard

He's a SQL DBA and he has a beard

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

MsSQLGirl

Bringing value to data & insights through experiences users love

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

DevOps could be the disease you die with, but don’t die of.

Paul te Braak

Business Intelligence Blog

Sql Server Insane Asylum (A Blog by Pat Wright)

Information about SQL Server from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...

%d bloggers like this: