data.table syntax 1: Assigning by reference

When arguing with my colleagues about how amazing data table is, I sometimes bring up the beauty of the data.table syntax in screenshots such as the one below. Now I assume you can immediately appreciate the elegance of the conciseness – right after you worked through the realization that, yes, this absolut mess is indeed one continuous line of code. A bit longer than the average German compound work, so not that long really. But before you leave me again here, it’s so MESSY that it should indeed only ever be used to ironacally mock oneself’s favorite R package – not that my real code ever looks like this…

A very messy example of a single R data.table comman – real friends don’t allow each other to code like this

The code works, but I’ll remain the only witness of that, because granted, who wants to try to work through that when you can literally do anything else instead? One of the big criticisms of data.table is that while it’s syntax is concise, it isn’t necessarily intuitive. But once you get the hang of it, it’s really not that difficult – as long as things don’t… escalate… So let’s start at the beginning – how to crate your first data.table object.

Creating a data.table object

The most common way to load proteomics data is surely by loading e.g. a .csv file created by a search engine or processing pipeline of choice. The corresponding command is called fread() and I used it already in my second blog post to highlight its impressive speed performance at importing several gigabytes worth of data. This time however, let’s create an artificial proteomics dataset with six columns of one million peptide intensity measurements in log10 space each, all drawn from normal distributions via rnorm().

> #load data table library
> library(data.table)
> #define number of desired rows
> n_rows <- 1e6
> #create data table without column names
> dt <- data.table(rnorm(n_rows, 7.5, 1),
                   rnorm(n_rows, 8.25, 1),
                   rnorm(n_rows, 7, 0.25),
                   rnorm(n_rows, 7.75, 1.5),
                   rnorm(n_rows, 8, 1.25),
                   rnorm(n_rows, 7.25, 0.5))

In its easiest way, a data table can be initiated just like a data frame. (Importantly, every data table IS also a data frame – more on that later.) Typing dt <- data.table(column1 = vector1) will create a data table object called dt. Within its brackets, one can define individual columns separated by comma, and optionally name them using “=” in the format column1 = vector1, column2 = vector2, … Since we did not name our columns, typing dt in the console will display our new object in the data table default way: with row numbers on the left, default column names at the top, and the first and last six rows shown.

> dt
               V1        V2       V3        V4        V5       V6
      1: 6.939524  7.241929 6.943532  6.946577  7.333815 7.469874
      2: 7.269823  9.604939 7.122192  6.049676  6.167697 7.427711
      3: 9.058708  7.781025 6.991703  8.969706  7.655564 7.359278
      4: 7.570508  9.718194 6.940954  7.884989  6.178652 7.494571
      5: 7.629288  8.692556 6.745723  8.151116  6.193122 7.331779
 999996: 7.407785 10.055196 6.893077 10.916839  8.593864 6.808222
 999997: 8.232436  8.569065 6.979251  7.006356  8.506607 7.437157
 999998: 7.393297  8.364103 6.865374 10.078586  7.607759 7.740510
 999999: 7.176292  6.876891 7.185694  7.506618  8.112996 7.087074
1000000: 9.325311  8.133197 6.451843  6.429322 11.483790 7.735583

Our very first data.table – this worked well, but adding intuitive column names to the data.table would indeed be nice. Let’s assume that our fake experiment consists of two conditions named control and treated combined with three biological replicates each. And here, we recall that every data table object is also a data frame, meaning that all operations that can be used on data frames work for data.table objects too. So if we have our column names stored in a vector called sample_val, we could just type names(dt) <- sample_val, praise our work, and finish the tutorial here… What we will do is do none of that, and dive instead into the fascinating world of assigning properties by reference.

The core principle: Assigning by reference

The historically most important reason why data table is so fast, is because it allows assigning properties by reference. Simplified, this means that when performing an operation, instead of copying the entire data table within computer memory, the result of the operation is instead appended to the original data table. Skipping the step of copying the entire data content saves a lot processing time, and incidentally – memory. Assigning by reference is done by using a dedicated set of commands and operators, such as setnames(), setkey() and most importantly the := operator.

> #rename columns BY REFERENCE with column names sample_val
> setnames(dt, sample_val)
> #define peptide_ID column BY REFERENCE
> dt[, peptide_ID := 1:.N]
> #set key for dt, which will sort dt by peptide_ID
> setkey(dt, peptide_ID)
> key(dt)
[1] "peptide_ID"

What all of these do, is that they assign a property by reference:

  • setnames assigns the character vector sample_val to the name attribute of dt
  • := assigns a new column called peptide_ID with values 1 to .N to dt
    • := is called within square brackets following a comma: dt[, ...]
      • := assigns the vector right of := to a column name on its left
      • := can overwrite a column, e.g.
        dt[, intensity := log10(intensity)]
    • .N is an integer variable that represents the number of rows in dt
      • 1:.N will create a vector 1 to 1,000,000, i.e. the length of dt
  • setkey assigns column peptide_ID as the key of our data.table dt
    • Setting the key of a data table sorts it by the column(s) and enables special operations
    • key() reports the current key column(s)

The line dt[, peptide_ID := 1:.N] particularly highlights how I perceive the beauty of data.table conciseness, once one understands its syntax *insert wise and knowing nod here*. The majority of operations that manipulate columns and rows of a data table happen within these square brackets, that start off with a comma: dt[, ...]. We’ll come back to that comma in the next tutorial, but for now need to remember that := is the main way of assigning columns by reference in data.table: dt[, new_col := old_col].

And with that, we are done for today. We created our first fake proteomics data table that is barely distinguishable from real world data. And before we publish it, all that is left to do is to call dt again, and admire the result of our hard work.

> dt
         control_1 control_2 control_3 treated_1 treated_2 treated_3 peptide_ID
      1:  6.939524  7.241929  6.943532  6.946577  7.333815  7.469874          1
      2:  7.269823  9.604939  7.122192  6.049676  6.167697  7.427711          2
      3:  9.058708  7.781025  6.991703  8.969706  7.655564  7.359278          3
      4:  7.570508  9.718194  6.940954  7.884989  6.178652  7.494571          4
      5:  7.629288  8.692556  6.745723  8.151116  6.193122  7.331779          5
 999996:  7.407785 10.055196  6.893077 10.916839  8.593864  6.808222     999996
 999997:  8.232436  8.569065  6.979251  7.006356  8.506607  7.437157     999997
 999998:  7.393297  8.364103  6.865374 10.078586  7.607759  7.740510     999998
 999999:  7.176292  6.876891  7.185694  7.506618  8.112996  7.087074     999999
1000000:  9.325311  8.133197  6.451843  6.429322 11.483790  7.735583    1000000

The age old battle: tidyverse vs data.table

During my PhD, I took a statistical analysis in R course – perfect for a complete newbie to R like myself. If you have used R before, you know that it gains a lot of its flexibility and functionality from packages, easily installable collections of functions to extent base R capabilities. Already in the first lesson, we were taught that certain calculations run faster with a package called data.table, which was the default for this course. Without realising it, I had been locked into one of the two teams, with no chance of stepping back…

Okay, that sounds dramatic – but there is no mercy in our internal slack communications. If you take your own course in R, you will hopefully be taught that there is a much more approachable alternative to base R out there, called the tidyverse. This neat name represents a fantastic collection of packages that all work together to build a cohesive, easily understandable syntax for reproducible data analysis and visualization. Now, the tidyverse (with its data manipulation package dplyr) and data.table are not arch-enemies, they can actually work well together – in direct comparison, they actually have different strengths (adapted from this fantastic comparison):

base Rtidyverse (dplyr)data.table
+ stable
+ always available
+ intuitive verb syntax
+ broad functionality and support
+ concise syntax
+ very fast
– clunky
– slow
– can be verbose and lengthy
– slower than data.table
– difficult to interpret
– limited functionality scope

So, to be clear, I don’t think that one package is better than the other. I actually wish I had gotten started with the tidyverse. It’s intuitive and has a broad userbase with abundant support online. But I do think that there is one good argument to get familiar with data.table, especially when working in proteomics, and that is big data. Buzzword – check! But how big can big data be in proteomics?

Lately, I have been quite excited about the possibilities of applying data-independent acquisition (DIA) methods for measuring hundreds of samples in phospho-proteomics, the study of phosphorylated signalling peptides. And in this context, report files coming out of our data analysis pipelines can easily be 15 GB or more! Even if I wanted to for some masochistic reason, Excel would just flatout refuse to open this file, and base R causes session crashes. This is where data.table begins to shine.

To illustrate this, I performed a benchmark experiment. On my pretty decent workstation PC, I loaded a 4.3 GB DIA report (generated by DIA-NN) and performed a range of operations ten times each using data.table and the tidyverse. I tried to keep it simple, sticking to standard operations used everyday for proteomics data: 1) loading the report, 2) filtering it based on a numeric column, 3) calculating log10 intensities, 4) counting number of peptides per sample and protein group and 5) calculating the protein log10 intensity median per sample. Each operation was performed ten times and timed by the fantastic microbenchmark package, with the results illustrated below:

This figure shows two plots, both together illustrating how datatable outperforms tidyverse functions with regards to speed for big data. The left bar plot shows that datatable is mostly limited by loading the data for 23 out of the 25 seconds total it needs. The tidyverse spends 83 seconds on this step. It further spends another 11 and 24 seconds on grouped operations, which datatable performs almost instantaneously. The right bar plot shows the relative increases in speed for datatable, resulting in 4, 96 and 120 times speed-up of data loading, grouped counting and grouped median calculation, respectively.
Grouped operations, expressed by the “by” operator in data.table, are sped up significantly compared to dplyr. The executed code with additional annotations is presented on my Github page.

As illustrated, importing the data for the first time into R is by far the most time-intensive step. Keeping the test dataset at 4.3 GB for a 32 GB RAM workstation assured that at no time memory had to be written to the SSD hard drive. Although data.table can reduce this step already from ~84 to ~23 seconds, it really can flex its muscles during grouped operations. While calculating the number of peptides per protein as well their median intensities takes a quite noticable ~11 and ~24 seconds in the tidyverse, respectively, data.table performs both tasks almost instantaneously. This might not seem like a big deal, but executing a whole document of tidyverse operations can easily keep R busy for an afternoon.

How does data.table manage these impressive jumps in speed? There are a number of reasons, but two most important ones here are 1) its high memory efficiency due to performing operations by reference only (meaning the data is not copied in the process), and 2) its highly speed-optimized group operations. Taken together with its concise syntax, this makes data.table a fantastic choice for my DIA datasets – and I encourage you to try it too!

In my following blog entries, I plan to illustrate how I perform certain operations in data.table, including data parsing, normalization and neat visualizations such as heatmaps and volcano plots. Stay tuned!

Proteomics data analysis using R data.table

In a recent Twitter poll posted by the Young Proteomics Investigator Club, YPIC was asking for which topic the early career researchers wanted to see at a proteomics educational day – and the answer was clear: an introduction to data analysis in R

Of course, this was by no means a representative survey – and yet, it confirmed that there’s an explosion of interest in how to analyze data using powerful scripting and coding languages such as R and Python. The reasons for that are broad – in contrast to Excel, Perseus or other GUI-based analysis software, data analysis in R allows user to:

  • Generate reproducible code that can easily be rerun after slight adjustments of input data or upstream processing steps
  • Gain flexibility in their analysis tools, either by using one of the many powerful packages, code shared online, or by just scripting it yourself
  • Process big data that causes other software to crash or cannot even be opened

Eventually, all of these were reasons that I finally forced myself to get familiar with R in the 2nd year of my PhD. Until then, I had traveled well with the fantastic statistical analysis software Perseus, developed by the Cox lab with the specific purpose of analyzing proteomics data. But after countless times of having to redo the same analyses steps after changing a single parameter; after countless screams of fury when too many chained calculations caused the software to crash; and after the final realization that I needed to do linear modeling for a project, I knew the time had come. I signed up for an intensive course in R that taught how to perform statistical analysis and visualization in R in an example-oriented manner. Little did I know that this decision would lock me onto a path of becoming a pawn in a battle of epic proportions…

Going forward, I hope to share insights here on my journey into proteomics analysis with R, with a specific focus on data table. Are there specific topics you would like to see addressed? Drop me a message via the contact form or on Twitter. And for now – stay tuned!