Comparing performances of CSV to RDS, Parquet, and Feather file formats in R

Posted on May 8, 2022 by tomaztsql — 9 Comments

From the previous blogpost:
– CSV or alternatives? Exporting data from SQL Server data to ORC, AVRO, Parquet, Feather files and store them into Azure data lake

we have created Azure blob storage, connected secure connection using Python and started uploading files to blob store from SQL Server. Alongside, we compared the performance of different file types. ORC, AVRO, Parquet, CSV and Feather. Coming to conclusion, CSV is great for its readability, but not suitable (as a file format) for all types of workloads.

We will be doing a similar benchmark with R language. The goal is to see, if CSV file format can be replaced by a file type that better, both in performance and storage.

Creating a random dataset (10K, 1MIO rows with 30 columns) we have tested the speed of writing and speed of reading from local machine. For the benchmark, we have used package microbenchmark. File formats were: CSV, Feather, Parquet and Rdata (or RDS).

This is the result of running 10 times each operation (writing and reading) for the 10K rows.

The test was performed with R version 4.1.2 and M1 MacOS. For the 10.000 rows (with 30 columns), I have the average CSV size 3,3MiB and Feather and Parquet circa 1,4MiB, and less than 1MiB for RData and rds R format. By far the best reads and write times the tests show a big advantage with Feather file format, following with Parquet and data.table package (reading from CSV file).

Test was conducted using a sample dataframe:

#file names
file_csv <- 'test_df.csv'
file_csv_readr <- 'test_df_readr.csv'
file_csv_datatable <- 'test_df_datatable.csv'
file_feather <- 'test_df.feather'
file_rdata <- 'test_df.RData'
file_rds <- 'test_df.rds'
file_parquet <- 'test_df.parquet'

files <- file.info(c('test_df.csv','test_df_readr.csv','test_df_datatable.csv', 'test_df.feather', 'test_df.RData', 'test_df.rds', 'test_df.parquet'))

test_df <- data.frame(
                replicate(10, sample(0:10000, nof_rows, rep = TRUE)),
                replicate(10, runif(nof_rows, 0.0, 1.0)),
                replicate(10, stri_rand_strings(1000, 10))
                 )

and running benchmark:

benchmark_write <- data.frame(summary(microbenchmark(
          "test_df.csv"     = write.csv(test_df, file = file_csv),
          "test_df_readr.csv"     = readr::write_csv(test_df, file = file_csv_readr),
          "test_df_datatable.csv"     = data.table::fwrite(test_df, file = file_csv_datatable),
          "test_df.feather" = write_feather(test_df, file_feather),
          "test_df.parquet" = write_parquet(test_df, file_parquet),
          "test_df.rds"     = save(test_df, file = file_rdata),
          "test_df.RData"   = saveRDS(test_df, file_rds), 
  times = nof_repeat)))

Same was repeated for reading operation and results were merged:

benchmark_read <- data.frame(summary(microbenchmark(
                            "test_df.csv" = read.csv(file_csv),
                            "test_df_readr.csv" = readr::read_csv(file_csv_readr),
                            "test_df_datatable.csv" = data.table::fread(file_csv_datatable),
                            "test_df.rds"  = load(file_rdata),
                            "test_df.RData" = readRDS(file_rds),
                            "test_df.feather" = arrow::read_feather(file_feather),
                            "test_df.parquet" = read_parquet(file_parquet), times = nof_repeat)))

colnames(benchmark_read) <- c("names", "read_min", "read_lq", "read_mean", "read_median", "read_uq", "read_max", "read_repeat")


#merge results and create factors
results <- inner_join(inner_join(benchmark_read, files, by = "names"), benchmark_write, by = "names")
results <- results[,c("names","size_mb","read_min", "read_max", "read_median","write_min","write_max", "write_median")]

And finally, have plotted the results for the graph above:

library(glue)
library(ggtext)

title_lab_adjusted <- glue::glue(
  "File types comparison on<br><span style = 'color:red;'>read operation</span> and <br><span style='color:darkgreen';>write operation</span>") 

ggplot(results, aes(x=names, y=size_mb)) + 
     geom_bar(stat="identity", fill="lightblue") +
     geom_text(aes(label=paste0(format(round(size_mb, 2), nsmall = 2), " MiB", collapse=NULL)), vjust=-0.3, size=3.5)+
     theme(axis.text.x = element_text(angle = 45, hjust = 1.3)) +
     coord_cartesian(ylim = c(0, 5), expand = F) +
     scale_y_continuous(breaks = seq(0, 5, 1),labels = scales::label_comma(accuracy = 1)) +
    theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(size = 0.1, linetype = "solid", colour = "grey50")) +
    ylab(label = 'Time (sec.) + File_Size') +  xlab(label = 'Files') +
  labs(title = title_lab_adjusted) +
  theme(
    plot.title = element_markdown(),
    panel.background = element_rect(color = NA, fill = 'white')) +

  geom_point (aes(y=write_median/100, group=names),
        col = "darkgreen",
        size = 2,
        stat ="identity",
        alpha=.8) +
  geom_point(aes(y=read_median/100, group=names),
           col = "red",
           size = 2,
           stat ="identity",
           alpha=.8 )

To do a little bit more stress on the CPU, I have created a data.frame with 1Mio rows.

When comparing file size, we can see that CSV files are all around 330MiB in size, RData and RDS around 80MiB and Feather/Parquet around 140MiB. So CSV is already 2x bigger than Parquet and 4x bigger than RData file.

Looking into performance (median for write/read), we can see Feather is by far the most efficient file format. out of 10 runs, reading the complete dataset (1Mio rows), took only 0.4 seconds, whereas CSV was lagging behind significantly. With write operations, Feather’s file format was also the fastest.

Exploring the CSV file format, I have used three different R packages (base, readr and data.table), and data.table outperformed the other two significantly.

To conclude, storing data in feather format will yield performance statistics. If you want to save on storage space, use RData or rds, these two will save the most space. On the other side, if you are stuck with CSV file format, use data.table, as it is by far the fastest for data wrangling and data manipulation tasks.

As always, complete code is available at Github, in the benchmark repository. In the same repository, you will find also python code from the previous blog post, but if you are looking only for code, that is described in this blog post, click here. Follow up for more updates.

Happy R coding! And stay healthy!

Tagged with: arrow, benchmark, CSV, feather, file compare, Parquet, R, Spark
Posted in Uncategorized

9 comments on “Comparing performances of CSV to RDS, Parquet, and Feather file formats in R”

Comparing performances of CSV to RDS, Parquet, and Feather file formats in R | R-bloggers says:

May 9, 2022 at 8:10 am

[…] article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) […]

LikeLike

Reply
psyoskeptic says:

May 9, 2022 at 7:15 pm

It would be great if you added .zip to the csv file. R will just read in the zip file using read.csv.

LikeLike

Reply
- tomaztsql says:
  
  May 10, 2022 at 6:45 am
  
  Hi,
  You are spot on. I would need to benchmark zip/unzip process separately, but for uploading to the cloud, ZIP format is also very good solution.
  
  Best, Tomaž
  
  LikeLiked by 1 person
  
  Reply
FS says:

May 10, 2022 at 8:17 am

Hi Tomaž,
Have you also experimented with the fst package (https://www.fstpackage.org/)?

LikeLiked by 1 person

Reply
Big Dave says:

May 10, 2022 at 12:03 pm

You should check out fst package which blows all of these out the water in terms of speed and compression.

LikeLiked by 1 person

Reply
Emil ErikPulaBellamy Begtrup-Bright says:

May 10, 2022 at 12:23 pm

…and then there’s the qs package which blows fst out if the water.

LikeLiked by 2 people

Reply
Emil ErikPulaBellamy Begtrup-Bright says:

May 10, 2022 at 12:24 pm

It can save files in native r objects, too, just like rda but faster and more efficient. Fast is good but it can’t save lists, for example.

LikeLiked by 1 person

Reply
D. says:

June 12, 2022 at 10:53 pm

Hi! The arrow package works with lazy evaluations, so there is actually NO data in memory after you run read_feather function. To check it, just add any manipulation to the benchmarked formula (head() for example) or run an another benchmark for head(test_df.csv), head(test_df.feather) etc. and you’ll see that there are no miracles with the feather.

LikeLike

Reply
Shiny and Arrow | R-bloggers says:

June 28, 2022 at 8:12 am

[…] that same data were stored in a typical delimited file. From our experience – and also backed by this great blog post by Tomaž Kaštrun – .parquet typically comes in at a little less than half the size of a data-equivalent .csv; […]

LikeLike

Reply

	tomaztsql on Retrieving user access list to…
	Paola A Zambrano on Retrieving user access list to…
	“Reverse Hello… on Little useless-useful R functi…
	Max Petter on Using R and Python in Microsof…
	detlef kissel on Using R and Python in Microsof…

Comparing performances of CSV to RDS, Parquet, and Feather file formats in R

Share this:

Related

9 comments on “Comparing performances of CSV to RDS, Parquet, and Feather file formats in R”

Leave a comment Cancel reply