data.table syntax 1: Assigning by reference

When arguing with my colleagues about how amazing data table is, I sometimes bring up the beauty of the data.table syntax in screenshots such as the one below. Now I assume you can immediately appreciate the elegance of the conciseness – right after you worked through the realization that, yes, this absolut mess is indeed one continuous line of code. A bit longer than the average German compound work, so not that long really. But before you leave me again here, it’s so MESSY that it should indeed only ever be used to ironacally mock oneself’s favorite R package – not that my real code ever looks like this…

A very messy example of a single R data.table comman – real friends don’t allow each other to code like this

The code works, but I’ll remain the only witness of that, because granted, who wants to try to work through that when you can literally do anything else instead? One of the big criticisms of data.table is that while it’s syntax is concise, it isn’t necessarily intuitive. But once you get the hang of it, it’s really not that difficult – as long as things don’t… escalate… So let’s start at the beginning – how to crate your first data.table object.

Creating a data.table object

The most common way to load proteomics data is surely by loading e.g. a .csv file created by a search engine or processing pipeline of choice. The corresponding command is called fread() and I used it already in my second blog post to highlight its impressive speed performance at importing several gigabytes worth of data. This time however, let’s create an artificial proteomics dataset with six columns of one million peptide intensity measurements in log10 space each, all drawn from normal distributions via rnorm().

> #load data table library
> library(data.table)
> 
> #define number of desired rows
> n_rows <- 1e6
> 
> #create data table without column names
> dt <- data.table(rnorm(n_rows, 7.5, 1),
                   rnorm(n_rows, 8.25, 1),
                   rnorm(n_rows, 7, 0.25),
                   rnorm(n_rows, 7.75, 1.5),
                   rnorm(n_rows, 8, 1.25),
                   rnorm(n_rows, 7.25, 0.5))

In its easiest way, a data table can be initiated just like a data frame. (Importantly, every data table IS also a data frame – more on that later.) Typing dt <- data.table(column1 = vector1) will create a data table object called dt. Within its brackets, one can define individual columns separated by comma, and optionally name them using “=” in the format column1 = vector1, column2 = vector2, … Since we did not name our columns, typing dt in the console will display our new object in the data table default way: with row numbers on the left, default column names at the top, and the first and last six rows shown.

> dt
               V1        V2       V3        V4        V5       V6
      1: 6.939524  7.241929 6.943532  6.946577  7.333815 7.469874
      2: 7.269823  9.604939 7.122192  6.049676  6.167697 7.427711
      3: 9.058708  7.781025 6.991703  8.969706  7.655564 7.359278
      4: 7.570508  9.718194 6.940954  7.884989  6.178652 7.494571
      5: 7.629288  8.692556 6.745723  8.151116  6.193122 7.331779
     ---                                                         
 999996: 7.407785 10.055196 6.893077 10.916839  8.593864 6.808222
 999997: 8.232436  8.569065 6.979251  7.006356  8.506607 7.437157
 999998: 7.393297  8.364103 6.865374 10.078586  7.607759 7.740510
 999999: 7.176292  6.876891 7.185694  7.506618  8.112996 7.087074
1000000: 9.325311  8.133197 6.451843  6.429322 11.483790 7.735583

Our very first data.table – this worked well, but adding intuitive column names to the data.table would indeed be nice. Let’s assume that our fake experiment consists of two conditions named control and treated combined with three biological replicates each. And here, we recall that every data table object is also a data frame, meaning that all operations that can be used on data frames work for data.table objects too. So if we have our column names stored in a vector called sample_val, we could just type names(dt) <- sample_val, praise our work, and finish the tutorial here… What we will do is do none of that, and dive instead into the fascinating world of assigning properties by reference.

The core principle: Assigning by reference

The historically most important reason why data table is so fast, is because it allows assigning properties by reference. Simplified, this means that when performing an operation, instead of copying the entire data table within computer memory, the result of the operation is instead appended to the original data table. Skipping the step of copying the entire data content saves a lot processing time, and incidentally – memory. Assigning by reference is done by using a dedicated set of commands and operators, such as setnames(), setkey() and most importantly the := operator.

> #rename columns BY REFERENCE with column names sample_val
> setnames(dt, sample_val)
> 
> #define peptide_ID column BY REFERENCE
> dt[, peptide_ID := 1:.N]
> 
> #set key for dt, which will sort dt by peptide_ID
> setkey(dt, peptide_ID)
> key(dt)
[1] "peptide_ID"

What all of these do, is that they assign a property by reference:

  • setnames assigns the character vector sample_val to the name attribute of dt
  • := assigns a new column called peptide_ID with values 1 to .N to dt
    • := is called within square brackets following a comma: dt[, ...]
      • := assigns the vector right of := to a column name on its left
      • := can overwrite a column, e.g.
        dt[, intensity := log10(intensity)]
    • .N is an integer variable that represents the number of rows in dt
      • 1:.N will create a vector 1 to 1,000,000, i.e. the length of dt
  • setkey assigns column peptide_ID as the key of our data.table dt
    • Setting the key of a data table sorts it by the column(s) and enables special operations
    • key() reports the current key column(s)

The line dt[, peptide_ID := 1:.N] particularly highlights how I perceive the beauty of data.table conciseness, once one understands its syntax *insert wise and knowing nod here*. The majority of operations that manipulate columns and rows of a data table happen within these square brackets, that start off with a comma: dt[, ...]. We’ll come back to that comma in the next tutorial, but for now need to remember that := is the main way of assigning columns by reference in data.table: dt[, new_col := old_col].

And with that, we are done for today. We created our first fake proteomics data table that is barely distinguishable from real world data. And before we publish it, all that is left to do is to call dt again, and admire the result of our hard work.

> dt
         control_1 control_2 control_3 treated_1 treated_2 treated_3 peptide_ID
      1:  6.939524  7.241929  6.943532  6.946577  7.333815  7.469874          1
      2:  7.269823  9.604939  7.122192  6.049676  6.167697  7.427711          2
      3:  9.058708  7.781025  6.991703  8.969706  7.655564  7.359278          3
      4:  7.570508  9.718194  6.940954  7.884989  6.178652  7.494571          4
      5:  7.629288  8.692556  6.745723  8.151116  6.193122  7.331779          5
     ---                                                                       
 999996:  7.407785 10.055196  6.893077 10.916839  8.593864  6.808222     999996
 999997:  8.232436  8.569065  6.979251  7.006356  8.506607  7.437157     999997
 999998:  7.393297  8.364103  6.865374 10.078586  7.607759  7.740510     999998
 999999:  7.176292  6.876891  7.185694  7.506618  8.112996  7.087074     999999
1000000:  9.325311  8.133197  6.451843  6.429322 11.483790  7.735583    1000000

Leave a Reply

Your email address will not be published. Required fields are marked *