The age old battle: tidyverse vs data.table

During my PhD, I took a statistical analysis in R course – perfect for a complete newbie to R like myself. If you have used R before, you know that it gains a lot of its flexibility and functionality from packages, easily installable collections of functions to extent base R capabilities. Already in the first lesson, we were taught that certain calculations run faster with a package called data.table, which was the default for this course. Without realising it, I had been locked into one of the two teams, with no chance of stepping back…

Okay, that sounds dramatic – but there is no mercy in our internal slack communications. If you take your own course in R, you will hopefully be taught that there is a much more approachable alternative to base R out there, called the tidyverse. This neat name represents a fantastic collection of packages that all work together to build a cohesive, easily understandable syntax for reproducible data analysis and visualization. Now, the tidyverse (with its data manipulation package dplyr) and data.table are not arch-enemies, they can actually work well together – in direct comparison, they actually have different strengths (adapted from this fantastic comparison):

base Rtidyverse (dplyr)data.table
+ stable
+ always available
+ intuitive verb syntax
+ broad functionality and support
+ concise syntax
+ very fast
– clunky
– slow
– can be verbose and lengthy
– slower than data.table
– difficult to interpret
– limited functionality scope

So, to be clear, I don’t think that one package is better than the other. I actually wish I had gotten started with the tidyverse. It’s intuitive and has a broad userbase with abundant support online. But I do think that there is one good argument to get familiar with data.table, especially when working in proteomics, and that is big data. Buzzword – check! But how big can big data be in proteomics?

Lately, I have been quite excited about the possibilities of applying data-independent acquisition (DIA) methods for measuring hundreds of samples in phospho-proteomics, the study of phosphorylated signalling peptides. And in this context, report files coming out of our data analysis pipelines can easily be 15 GB or more! Even if I wanted to for some masochistic reason, Excel would just flatout refuse to open this file, and base R causes session crashes. This is where data.table begins to shine.

To illustrate this, I performed a benchmark experiment. On my pretty decent workstation PC, I loaded a 4.3 GB DIA report (generated by DIA-NN) and performed a range of operations ten times each using data.table and the tidyverse. I tried to keep it simple, sticking to standard operations used everyday for proteomics data: 1) loading the report, 2) filtering it based on a numeric column, 3) calculating log10 intensities, 4) counting number of peptides per sample and protein group and 5) calculating the protein log10 intensity median per sample. Each operation was performed ten times and timed by the fantastic microbenchmark package, with the results illustrated below:

This figure shows two plots, both together illustrating how datatable outperforms tidyverse functions with regards to speed for big data. The left bar plot shows that datatable is mostly limited by loading the data for 23 out of the 25 seconds total it needs. The tidyverse spends 83 seconds on this step. It further spends another 11 and 24 seconds on grouped operations, which datatable performs almost instantaneously. The right bar plot shows the relative increases in speed for datatable, resulting in 4, 96 and 120 times speed-up of data loading, grouped counting and grouped median calculation, respectively.
Grouped operations, expressed by the “by” operator in data.table, are sped up significantly compared to dplyr. The executed code with additional annotations is presented on my Github page.

As illustrated, importing the data for the first time into R is by far the most time-intensive step. Keeping the test dataset at 4.3 GB for a 32 GB RAM workstation assured that at no time memory had to be written to the SSD hard drive. Although data.table can reduce this step already from ~84 to ~23 seconds, it really can flex its muscles during grouped operations. While calculating the number of peptides per protein as well their median intensities takes a quite noticable ~11 and ~24 seconds in the tidyverse, respectively, data.table performs both tasks almost instantaneously. This might not seem like a big deal, but executing a whole document of tidyverse operations can easily keep R busy for an afternoon.

How does data.table manage these impressive jumps in speed? There are a number of reasons, but two most important ones here are 1) its high memory efficiency due to performing operations by reference only (meaning the data is not copied in the process), and 2) its highly speed-optimized group operations. Taken together with its concise syntax, this makes data.table a fantastic choice for my DIA datasets – and I encourage you to try it too!

In my following blog entries, I plan to illustrate how I perform certain operations in data.table, including data parsing, normalization and neat visualizations such as heatmaps and volcano plots. Stay tuned!