Performance comparison of converting list to data.frame with R language

When you are working with large datasets performance comes to everyone’s mind. Especially when converting datasets from one data type to another. And choosing the right method can make a huge difference.

So in this case, I will be creating a dummy list, and I will convert the values in the list into data.frame.

Simple function to create a large list (approx. 46MB with 250.000 elements and each element consists of 10 measurements).

cre_l <- function(len,start,end){ return(round(runif(len,start,end),8)) }
myl2 <- list()
# 250.000 elements is approx 46Mb in Size
# 2.500 elements for demo
for (i in 1:2500){ myl2[[i]] <- (cre_l(10,0,50))  } 

The list will be transformed into data.frame in such a way that data.frame will have 10 (ten) variables with a number of observations corresponding with the length of the list. To give you the perspective, this code does exactly this:

for (i in 1:250){ myl2[[i]] <- (cre_l(10,0,50))  } 
df <- data.frame(do.call(rbind, myl2))

And you end up from list to data.frame:

Fig 1: from List to Data.frame

There are many ways to convert a list to data.frame. But where it becomes important is, when your list is a larger object. I have written 8 ways to do the conversion (and I know there are at least 20 more).

By far the fastest method were do.call and sapply ways. Both outperforming all other methods with the following snippets:

#do.call
data.frame(do.call(rbind, myl2))
#sapply
data.frame(t(sapply(myl2,c)))

Both methods were consistent with larger list conversions.

And the worst was for loop solution, reduce and as.data.frame. There are no surprises here, just a pause here, that for loop was performing so poorly due to constant row binding to an existing data.frame.

Complete comparison and graph code:

library(data.table)
library(plyr)
library(ggplot2) 

res <- summary(microbenchmark::microbenchmark(
    do_call_solution = {
          sol1 <- NULL
          sol1 <- data.frame(do.call(rbind, myl2))
    }, 
    for_loop_solution = { 
        sol2 <- NULL
        for (i in 1:length(myl2)){ sol2 <- rbind(sol2, data.frame(t(unlist(myl2[i])))) }
    },
     ldply_to_df = { 
          sol3 <- NULL 
          sol3 <- ldply(myl2, sol3)
     },
    ldply_to_c = {
      sol4 <- NULL
      sol4 <- ldply(myl2, c())
    },
    sapply = {
        sol5 <- NULL
        sol5 <- data.frame(t(sapply(myl2,c)))
        },
    recude = {
      sol6 <- NULL
      sol6 <- data.frame(Reduce(rbind, myl2))
    },
    data_table_rbindlist = {
      sol7 <- NULL
      sol7 <- data.frame(t(rbindlist(list(myl2))))
    },
    as_data_frame = {
      sol8 <- NULL
      sol8 <- data.frame(t(as.data.frame(myl2)))
    },
        times = 10L))

# producing graph
ggplot(res, aes(x=expr, y=(mean/1000/60))) + geom_bar(stat="identity", fill = "lightblue") +
     coord_flip() +
     labs(title = "Perfomance comparison", subtitle = "Converting list with 2.500 element to data.frame ") +
     xlab("Methods") + ylab("Conversion time (s)") +
     theme_light() +
     geom_text(aes(label=(round(mean/1000/60*1.000,3))))

Fig 2 : Comparison results with different converting methods on list with 2500 elements

I have also removed the slowest performing conversions and created a 250.000 elements list and compared only the fastest methods with 10 consecutive runs (using the microbenchmark library) and mean values.

Fig 3 : Comparing the fastest methods on a list with 250000 elements

So, using for loops is super slow, and do.call with rbind or sapply will for sure deliver the best performances.

As always, the code is available at Github in repository Useless_R_function and the file is here.

Happy R-coding!

Tagged with: , , ,
Posted in Uncategorized, Useless R functions
5 comments on “Performance comparison of converting list to data.frame with R language
  1. frenchstick says:

    Your data.table solution is far from being optimised.
    Try “`sol9 <- setDT(transpose(myl2))“`
    It beats all your other solutions on my machine with either 2,500 or 250,000 elements.

    Like

  2. Tony says:

    I think frenchstick has the fastest solution overall, but this one is only in base R and gets very close (and still beats all others solutions):
    sol10 <- as.data.frame(matrix(unlist(myl2), ncol = 10L, byrow = TRUE))

    Like

  3. Nice to see the creativity in the different approaches! As for timings, I have used microbenchmark as well and really liked it. But since I became aware of bench, I haven’t looked back. One reason is that bench::mark() is more explicit about garbage collection, which can have a huge impact on timings.

    Like

Leave a comment

Follow TomazTsql on WordPress.com
Programs I Use: SQL Search
Programs I Use: R Studio
Programs I Use: Plan Explorer
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
TomazTsql

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Discover WordPress

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

tenbulls.co.uk

tenbulls.co.uk - attaining enlightenment with the Microsoft Data and Cloud Platforms with a sprinkling of Open Source and supporting technologies!

SQL DBA with A Beard

He's a SQL DBA and he has a beard

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

DevOps could be the disease you die with, but don’t die of.

Paul te Braak

Business Intelligence Blog

Sql Insane Asylum (A Blog by Pat Wright)

Information about SQL (PostgreSQL & SQL Server) from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...