C Thinking Big: Data Tables

Data frames are core elements for data handling in R. However, they suffer from several limitations. One of the major issue with data frames is that they are memory hungry and slow. This is not an issue when working with relatively small datasets (say, up to 100,000 rows). However, when your dataset size exceed gigabytes, dataframes may be infeasibly slow and too memory hungry.

C.1 Background: Passing By Value And Passing By Reference

R is (mostly) a pass-by-value language. This means that when you modify the data, at every step a new copy of the complete modified object is created, stored in memory, and the former object is freed (carbage-collected) if not in use any more.

The main advantage of this approach is consistency: we have the guarantee that functions do not modify their inputs. However, in case of large objects, copying may be slow, and even more, it requires at least twice as much memory before the old object is freed. In case of more complex process pipelines, the memory consumption may be even more than twice of the size of the original object.

Data tables implement a number of pass-by-reference functions. In pass-by-reference, the function is not given a fresh copy of the inputs, but is instead told where the object is in memory. Instead of copying gigabytes of data, only a single tiny memory pointer is passed. But this also means the function now is accessing and modifying the original object, not a copy of it. This may sometimes lead to bugs and unexpected behavior, but professional use of pass-by-reference approach may improve the speed and lower the memory footprint substantially.

C.2 Data Tables: Introduction

Data tables and most of the related goodies live in data.table library, so you either have to load the library or specify the namespace when using the functions.

C.2.1 Replacement for Data Frames (Sort of)

Data tables are designed to be largely a replacement to data frames. The syntax is similar and they are largely replaceable. For instance, we can create and play with a data table as

##    id          x         y
## 1:  1 -0.1759081 0.3587206
## 2:  2 -1.0397919 0.7036122
## 3:  3  1.3415712 0.6285581
## 4:  4 -0.7327195 0.6126032
## 5:  5 -1.7042843 0.7579853

The result looks almost identical to a similar data frame (the only difference are the colons after the row numbers). Behind the scenes these objects are almost identical too–both objects are lists of vectors. This structural similarity allows to use data tables as drop-in replacements for dataframes, at least in some circumstances. For instance, we can extract variables with $:

## [1] -0.1759081 -1.0397919  1.3415712 -0.7327195 -1.7042843

or rows with row indices:

##    id          x         y
## 1:  2 -1.0397919 0.7036122
## 2:  4 -0.7327195 0.6126032

However, data tables use unquoted variables names (like dplyr) by default:

## [1] -0.1759081 -1.0397919  1.3415712 -0.7327195 -1.7042843

In case we need to store the variable name into another variable, with have to use the additional argument with:

##             x
## 1: -0.1759081
## 2: -1.0397919
## 3:  1.3415712
## 4: -0.7327195
## 5: -1.7042843

Note also that instead of getting a vector, now we get a data.table with a single column “x” in the first. This behavior is the main culprit that when replacing data frames with data tables one may need to change quite a bit of code.

C.2.2 Fast Reading and Writing

Many data frame users may appreciate the fact that the data input-output function fread and fwrite run at least a magnitude faster on large files. These are largely replacement for read.table and write.table, however they syntax differs noticeably in places. In particular, fread accepts either a file name, http-url, or a shell command that prints output; it automatically detects the column separator, but it does not automatically open compressed files. The latter is not a big deal when using unix where one can just issue

However, the decompression is not that simple on windows and hence it is hard to write platform-independent code that opens compressed files.2

If your computer has enough memory and speed is not an issue, your interest for data tables may end here. You can just transform data table into a data frame with setDF (and the other way around with setDT). Let’s transform our data table to data frame:

##   id          x         y
## 1  1 -0.1759081 0.3587206
## 2  2 -1.0397919 0.7036122
## 3  3  1.3415712 0.6285581
## 4  4 -0.7327195 0.6126032
## 5  5 -1.7042843 0.7579853

Do you see that the colons after row names are gone? This means dt now is a data frame.

Note that this function behaves very differently from what we have learned earlier: it modifies the object in place (by reference). We do not have to assign the result into a new variable using a construct like df <- setDF(dt) (but we still can write like this, handy when using magrittr pipes). This is a manifestation of the power of data.tables: the object is not copied but the same object is modified in memory instead. setDF and setDT are very efficient, even huge tables are converted instantly with virtually no need for any additional memory.

However, big powers come hand-in-hand with big responsibility: it is easy to forget that setDF modifies the function argument.

C.3 Indexing: The Major Powerhorse of Data Tables

Data tables’ indexing is much more powerful than that of data frames. The single-bracket indexing is a powerful (albeit confusing) set of functions. It’s general syntax is as follows:

where i specifies what to do with rows (for instance, select certain rows), j tells what to do with columns (such as select columns, compute new columns, aggregate columns), and by contains the grouping variables.

Let’s demonstrate this with the flights data from nycflights13 package. We load the data and transform it into data.table:

##    year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013     1   1      517            515         2      830
## 2: 2013     1   1      533            529         4      850
## 3: 2013     1   1      542            540         2      923
## 4: 2013     1   1      544            545        -1     1004
## 5: 2013     1   1      554            600        -6      812
## 6: 2013     1   1      554            558        -4      740
##    sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1:            819        11      UA   1545  N14228    EWR  IAH      227
## 2:            830        20      UA   1714  N24211    LGA  IAH      227
## 3:            850        33      AA   1141  N619AA    JFK  MIA      160
## 4:           1022       -18      B6    725  N804JB    JFK  BQN      183
## 5:            837       -25      DL    461  N668DN    LGA  ATL      116
## 6:            728        12      UA   1696  N39463    EWR  ORD      150
##    distance hour minute           time_hour
## 1:     1400    5     15 2013-01-01 05:00:00
## 2:     1416    5     29 2013-01-01 05:00:00
## 3:     1089    5     40 2013-01-01 05:00:00
## 4:     1576    5     45 2013-01-01 05:00:00
## 5:      762    6      0 2013-01-01 06:00:00
## 6:      719    5     58 2013-01-01 05:00:00

C.3.1 i: Select Observations

Obviously, we can always just tell which observations we want:

##    year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013     1   1      517            515         2      830
## 2: 2013     1   1      533            529         4      850
## 3: 2013     1   1      542            540         2      923
##    sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1:            819        11      UA   1545  N14228    EWR  IAH      227
## 2:            830        20      UA   1714  N24211    LGA  IAH      227
## 3:            850        33      AA   1141  N619AA    JFK  MIA      160
##    distance hour minute           time_hour
## 1:     1400    5     15 2013-01-01 05:00:00
## 2:     1416    5     29 2013-01-01 05:00:00
## 3:     1089    5     40 2013-01-01 05:00:00

picks the first three lines from the data. Maybe more interestingly, we can use the special variable .N (the number of rows), to get the penultimate row:

##    year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013     9  30       NA           1159        NA       NA
##    sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1:           1344        NA      MQ   3572  N511MQ    LGA  CLE       NA
##    distance hour minute           time_hour
## 1:      419   11     59 2013-09-30 11:00:00

We can select observations with logical index vector in the same way as in data frames:

##    year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013     1   1      724            725        -1     1020
## 2: 2013     1   1      857            851         6     1157
## 3: 2013     1   1     1418           1419        -1     1726
##    sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1:           1030       -10      AS     11  N594AS    EWR  SEA      338
## 2:           1222       -25      UA   1670  N45440    EWR  SEA      343
## 3:           1732        -6      UA     16  N37464    EWR  SEA      348
##    distance hour minute           time_hour
## 1:     2402    7     25 2013-01-01 07:00:00
## 2:     2402    8     51 2013-01-01 08:00:00
## 3:     2402   14     19 2013-01-01 14:00:00

will create a new data table including only flights from Newark to Seattle. However, note that we just use origin, and not flights$origin as were the case with data frames. Data tables evaluate the arguments as if inside with-function.

The first, integer indexing corresponds to dplyr’s slice function while the other one is equivalent to filter.

C.3.2 j: Work with Columns

j is perhaps the most powerful (and most confusing) of all arguments for data table indexing. It allows both to select and do more complex tasks. Lets start with selection:

## [1] "IAH" "IAH" "MIA"

selects only the dest variable from the data. Note this results in a vector, not in a single-variable data table. If you want to get that, you can do

##    dest
## 1:  IAH
## 2:  IAH
## 3:  MIA

.() is just an alias for list(), encoded differently in data tables to improve readability and make it easier to type. If we want to select more that one variable, we can use the latter syntax:

##    origin dest
## 1:    EWR  IAH
## 2:    LGA  IAH
## 3:    JFK  MIA

Selection supports a number of goodies, such as ranges of variables with : (for instance, dep_time:arr_delay) and excluding variables with ! or - (for instance, -year).

Obviously we can combine both i and j: let’s select origin and departure delay for flights to Seattle:

##    origin dep_delay
## 1:    EWR        -1
## 2:    JFK        13
## 3:    EWR         6

The example so far broadly corresponds to dplyr’s select.

But j is not just for selecting. It is also for computing. Let’s find the mean arrival delay for flights to Seattle:

## [1] -1.099099

Several variables can be returned by wrapping, and optionally named, these in .(). For instance, find the average departure and arrival delay for all flights to Seattle, given the flight was delayed on arrival, and name these dep and arr:

##         dep      arr
## 1: 33.98266 39.79984

The result is a data table with two variables.

We can use the special variable .N to count the rows:

## [1] 1269

will tell us how many flights to Seattle were delayed at arrival.

Handling the case where the variable names are stored in other variables is not that hard, but still adds a layer of complexity. We can specify the variables in .SDcols parameter. This parameter determines which columns go into .SD (=Subset Data) special variable. Afterwards we make an lapply expression in j:

##    arr_delay dep_delay
## 1:  39.79984  33.98266

Let’s repeat: .SDcols determines which variables will go into the special .SD list (default: all). lapply in j computes mean values of each of the variables in the .SD list. This procedure feels complex, although it is internally optimized.

These examples correspond to dplyr’s aggregate function. One can argue, however, that data tables’ syntax is more confusing and harder to read. Note also that the functionality data tables offer here is optimized for speed and memory efficiency but still return a new object. Aggregation does not work by reference.

C.3.3 Group in by

Finally, all of the above can by computed by groups using by. Let’s compute the average delays above by carrier and origin:

##    carrier origin      dep      arr
## 1:      DL    JFK 29.82373 39.49831
## 2:      B6    JFK 28.49767 40.32093
## 3:      UA    EWR 40.24053 41.85078
## 4:      AS    EWR 31.80952 34.36508
## 5:      AA    JFK 34.04132 40.48760

We just had to specify the by argument that lists the grouping variables. If more than one, these should be wrapped in a list with .() function.

We can use the .N variable to get the group size. How many flights did each carrier from each origin?

##    carrier origin     N
## 1:      UA    EWR 46087
## 2:      UA    LGA  8044
## 3:      AA    JFK 13783

Finally, we can also use quoted variables for grouping too just be replacing .() with c():

##    carrier origin     N
## 1:      UA    EWR 46087
## 2:      UA    LGA  8044
## 3:      AA    JFK 13783

In dplyr context, the examples here include group_by and summarize verbs.

Read more about the basic usage in data.table the vignette Data analysis using data.table.

C.4 :=–Create variables by reference

While summarizing we compute values in j, these will always create a new data.table. Reducing operations are not possible to do in-place. But computing new variables can be done in-place.

In place variable computations (without summarizing) can be done with := assignment operator in j. Let’s compute a new variable–speed–for each flight. We can do this as follows:

##    year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013     1   1      517            515         2      830
## 2: 2013     1   1      533            529         4      850
## 3: 2013     1   1      542            540         2      923
##    sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1:            819        11      UA   1545  N14228    EWR  IAH      227
## 2:            830        20      UA   1714  N24211    LGA  IAH      227
## 3:            850        33      AA   1141  N619AA    JFK  MIA      160
##    distance hour minute           time_hour    speed
## 1:     1400    5     15 2013-01-01 05:00:00 370.0441
## 2:     1416    5     29 2013-01-01 05:00:00 374.2731
## 3:     1089    5     40 2013-01-01 05:00:00 408.3750

We see the new variable, speed, included as the last variable in the data. Note we did this operation by reference, i.e. we did not assign the result to a new data table. The existing table was modified in place.

The same assignment operator also permits us to remove variables by setting these to NULL. Let’s remove speed:

##    year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013     1   1      517            515         2      830
## 2: 2013     1   1      533            529         4      850
## 3: 2013     1   1      542            540         2      923
##    sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1:            819        11      UA   1545  N14228    EWR  IAH      227
## 2:            830        20      UA   1714  N24211    LGA  IAH      227
## 3:            850        33      AA   1141  N619AA    JFK  MIA      160
##    distance hour minute           time_hour
## 1:     1400    5     15 2013-01-01 05:00:00
## 2:     1416    5     29 2013-01-01 05:00:00
## 3:     1089    5     40 2013-01-01 05:00:00

Indeed, there is no speed any more.

Assigning more that one variable by reference may feel somewhat more intimidating:

##    year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013     1   1      517            515         2      830
## 2: 2013     1   1      533            529         4      850
## 3: 2013     1   1      542            540         2      923
##    sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1:            819        11      UA   1545  N14228    EWR  IAH      227
## 2:            830        20      UA   1714  N24211    LGA  IAH      227
## 3:            850        33      AA   1141  N619AA    JFK  MIA      160
##    distance hour minute           time_hour    speed meanDelay
## 1:     1400    5     15 2013-01-01 05:00:00 370.0441       6.5
## 2:     1416    5     29 2013-01-01 05:00:00 374.2731      12.0
## 3:     1089    5     40 2013-01-01 05:00:00 408.3750      17.5

Assignment works together with both selection and grouping. For instance, we may want to replace negative delay by zeros:

##  [1] 11 20 33  0  0 12 19  0  0  8  0  0  7  0 31  0  0  0 12  0

Indeed, we only see positive numbers and zeros. But be careful: now we have overwritten the arr_delay in the original data. We cannot restore the previous state any more without re-loading the dataset.

As an example of groupings, let’s compute the maximum departure delay by origin:

##    year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013     1   1      517            515         2      830
## 2: 2013     1   1      533            529         4      850
## 3: 2013     1   1      542            540         2      923
## 4: 2013     1   1      544            545        -1     1004
##    sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1:            819        11      UA   1545  N14228    EWR  IAH      227
## 2:            830        20      UA   1714  N24211    LGA  IAH      227
## 3:            850        33      AA   1141  N619AA    JFK  MIA      160
## 4:           1022         0      B6    725  N804JB    JFK  BQN      183
##    distance hour minute           time_hour    speed meanDelay maxDelay
## 1:     1400    5     15 2013-01-01 05:00:00 370.0441       6.5     1126
## 2:     1416    5     29 2013-01-01 05:00:00 374.2731      12.0      911
## 3:     1089    5     40 2013-01-01 05:00:00 408.3750      17.5     1301
## 4:     1576    5     45 2013-01-01 05:00:00 516.7213      -9.5     1301

We can see that by caused the delay to be computed for each group, however, the data is not summarized, just the max delay is added to every single row.

Finally, if you do not want to modify the original data, you should use copy function. This makes a deep copy of the data, and you can modify the copy afterwards:

##    origin dest
## 1:    EWR  IAH
## 2:    LGA  IAH
## 3:    JFK  MIA
##    year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013     1   1      517            515         2      830
## 2: 2013     1   1      533            529         4      850
## 3: 2013     1   1      542            540         2      923
##    sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1:            819        11      UA   1545  N14228    EWR  IAH      227
## 2:            830        20      UA   1714  N24211    LGA  IAH      227
## 3:            850        33      AA   1141  N619AA    JFK  MIA      160
##    distance hour minute           time_hour    speed meanDelay maxDelay
## 1:     1400    5     15 2013-01-01 05:00:00 370.0441       6.5     1126
## 2:     1416    5     29 2013-01-01 05:00:00 374.2731      12.0      911
## 3:     1089    5     40 2013-01-01 05:00:00 408.3750      17.5     1301

As you see, the flights data has not changed.

These operations correspond to the dplyr’s mutate verb. However, mutate always makes a copy of the original dataset, something that may well make your analysis slow and sluggish with large data.

Read more in vignette Data.table reference semantics

C.5 keys

Data tables allow fast lookup based on key. In it’s simplest version, a key is a column (or several columns) which is used to pre-sort the data table. Pre-sorting makes it much faster to look up certain values, perform grouping operations and merges. As data can only be sorted according to one rule at a time, there can only be one key on data.table (but a key may be based on several variables).

Let’s set origin and destination as keys for the data table:

## Warning in data(flights, pkg = "nycflights13"): data set 'flights' not
## found
## Warning in data(flights, pkg = "nycflights13"): data set 'nycflights13' not
## found
##    origin dest arr_delay
## 1:    EWR  ALB         0
## 2:    EWR  ALB        40
## 3:    EWR  ALB        44

We see that both origin and destination are alphabetically ordered. Note that when selecting variables, the resulting data table fl will have the same keys as the original one.

When set, we can easily subset by key by just feeding the key values in i:

##    origin dest arr_delay
## 1:    LGA  ATL         0
## 2:    LGA  ATL        12
## 3:    LGA  ATL         5
## 4:    LGA  ATL         0
## 5:    LGA  ATL        17

will extract all LaGuardia-originating flights. In terms of output, this is equivalent to fl[origin == "LGA"], just much more efficient. When you want to extract flights based on origin-destination pair, you can just add both key columns:

##    origin dest arr_delay
## 1:    EWR  SEA         0
## 2:    EWR  SEA         0
## 3:    EWR  SEA         0
## 4:    EWR  SEA         0

Again, this can be achieved in other ways, just keys are more efficient. Finally, if we want to extract based on the second key, the syntax is more confusing:

##    origin dest arr_delay
## 1:    EWR  SEA         0
## 2:    EWR  SEA         0
## 3:    EWR  SEA         0
## 4:    EWR  SEA         0

We have to tell the [ that we want to extract all observations where the first key is everything, and the second one is “SEA”.

Read more in the vignette Keys and fast binary search based subset.

C.6 Resources


  1. Automatic decompression is a feature request for data tables