C Thinking Big: Data Tables
Data frames are core elements for data handling in R. However, they suffer from several limitations. One of the major issue with data frames is that they are memory hungry and slow. This is not an issue when working with relatively small datasets (say, up to 100,000 rows). However, when your dataset size exceed gigabytes, dataframes may be infeasibly slow and too memory hungry.
C.1 Background: Passing By Value And Passing By Reference
R is (mostly) a pass-by-value language. This means that when you modify the data, at every step a new copy of the complete modified object is created, stored in memory, and the former object is freed (carbage-collected) if not in use any more.
The main advantage of this approach is consistency: we have the guarantee that functions do not modify their inputs. However, in case of large objects, copying may be slow, and even more, it requires at least twice as much memory before the old object is freed. In case of more complex process pipelines, the memory consumption may be even more than twice of the size of the original object.
Data tables implement a number of pass-by-reference functions. In pass-by-reference, the function is not given a fresh copy of the inputs, but is instead told where the object is in memory. Instead of copying gigabytes of data, only a single tiny memory pointer is passed. But this also means the function now is accessing and modifying the original object, not a copy of it. This may sometimes lead to bugs and unexpected behavior, but professional use of pass-by-reference approach may improve the speed and lower the memory footprint substantially.
C.2 Data Tables: Introduction
Data tables and most of the related goodies live in data.table library, so you either have to load the library or specify the namespace when using the functions.
C.2.1 Replacement for Data Frames (Sort of)
Data tables are designed to be largely a replacement to data frames. The syntax is similar and they are largely replaceable. For instance, we can create and play with a data table as
## id x y
## 1: 1 -0.1759081 0.3587206
## 2: 2 -1.0397919 0.7036122
## 3: 3 1.3415712 0.6285581
## 4: 4 -0.7327195 0.6126032
## 5: 5 -1.7042843 0.7579853
The result looks almost identical to a similar data frame (the only
difference are the colons after the row numbers). Behind the scenes
these objects are almost identical too–both objects are lists of
vectors. This structural similarity allows to use data tables
as drop-in replacements for dataframes, at least in some
circumstances. For instance, we can extract variables with $
:
## [1] -0.1759081 -1.0397919 1.3415712 -0.7327195 -1.7042843
or rows with row indices:
## id x y
## 1: 2 -1.0397919 0.7036122
## 2: 4 -0.7327195 0.6126032
However, data tables use unquoted variables names (like dplyr) by default:
## [1] -0.1759081 -1.0397919 1.3415712 -0.7327195 -1.7042843
In case we need to store the variable name into another variable, with
have to use the additional argument with
:
## x
## 1: -0.1759081
## 2: -1.0397919
## 3: 1.3415712
## 4: -0.7327195
## 5: -1.7042843
Note also that instead of getting a vector, now we get a data.table with a single column “x” in the first. This behavior is the main culprit that when replacing data frames with data tables one may need to change quite a bit of code.
C.2.2 Fast Reading and Writing
Many data frame users may appreciate the fact that the data
input-output function fread
and fwrite
run at least a magnitude
faster on large files. These are largely replacement for read.table
and write.table
, however they syntax differs noticeably in places.
In particular, fread
accepts either a file name, http-url, or a shell command
that prints output; it automatically detects the column separator,
but it
does not automatically open compressed files. The latter is not a big
deal when using unix where one can just issue
However, the decompression is not that simple on windows and hence it is hard to write platform-independent code that opens compressed files.2
If your computer has enough memory and speed is not an issue, your
interest for data tables may end here. You can just transform data
table into a data frame with setDF
(and the other way around with setDT
). Let’s transform our data table to data
frame:
## id x y
## 1 1 -0.1759081 0.3587206
## 2 2 -1.0397919 0.7036122
## 3 3 1.3415712 0.6285581
## 4 4 -0.7327195 0.6126032
## 5 5 -1.7042843 0.7579853
Do you see that the colons after row names are gone? This means dt
now is a data frame.
Note that this function behaves very differently from what we have
learned earlier: it modifies the object in place (by reference). We
do not have to assign the result into a new variable using a construct
like df <- setDF(dt)
(but we still can write like this, handy when
using magrittr pipes). This is a manifestation of the power of
data.tables: the object is not copied but the same object is modified
in memory instead. setDF
and setDT
are very efficient, even huge
tables are converted instantly with virtually no need for any
additional memory.
However, big powers come hand-in-hand with big responsibility:
it is easy to forget that setDF
modifies the function argument.
C.3 Indexing: The Major Powerhorse of Data Tables
Data tables’ indexing is much more powerful than that of data frames. The single-bracket indexing is a powerful (albeit confusing) set of functions. It’s general syntax is as follows:
where i
specifies what to do with rows (for instance, select certain
rows), j
tells what to do with columns (such as select columns,
compute new columns, aggregate columns), and by
contains the
grouping variables.
Let’s demonstrate this with the flights data from nycflights13 package. We load the data and transform it into data.table:
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 517 515 2 830
## 2: 2013 1 1 533 529 4 850
## 3: 2013 1 1 542 540 2 923
## 4: 2013 1 1 544 545 -1 1004
## 5: 2013 1 1 554 600 -6 812
## 6: 2013 1 1 554 558 -4 740
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1: 819 11 UA 1545 N14228 EWR IAH 227
## 2: 830 20 UA 1714 N24211 LGA IAH 227
## 3: 850 33 AA 1141 N619AA JFK MIA 160
## 4: 1022 -18 B6 725 N804JB JFK BQN 183
## 5: 837 -25 DL 461 N668DN LGA ATL 116
## 6: 728 12 UA 1696 N39463 EWR ORD 150
## distance hour minute time_hour
## 1: 1400 5 15 2013-01-01 05:00:00
## 2: 1416 5 29 2013-01-01 05:00:00
## 3: 1089 5 40 2013-01-01 05:00:00
## 4: 1576 5 45 2013-01-01 05:00:00
## 5: 762 6 0 2013-01-01 06:00:00
## 6: 719 5 58 2013-01-01 05:00:00
C.3.1 i: Select Observations
Obviously, we can always just tell which observations we want:
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 517 515 2 830
## 2: 2013 1 1 533 529 4 850
## 3: 2013 1 1 542 540 2 923
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1: 819 11 UA 1545 N14228 EWR IAH 227
## 2: 830 20 UA 1714 N24211 LGA IAH 227
## 3: 850 33 AA 1141 N619AA JFK MIA 160
## distance hour minute time_hour
## 1: 1400 5 15 2013-01-01 05:00:00
## 2: 1416 5 29 2013-01-01 05:00:00
## 3: 1089 5 40 2013-01-01 05:00:00
picks the first three lines from the data. Maybe more interestingly,
we can use the special variable .N
(the number of rows), to get the
penultimate row:
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 9 30 NA 1159 NA NA
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1: 1344 NA MQ 3572 N511MQ LGA CLE NA
## distance hour minute time_hour
## 1: 419 11 59 2013-09-30 11:00:00
We can select observations with logical index vector in the same way as in data frames:
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 724 725 -1 1020
## 2: 2013 1 1 857 851 6 1157
## 3: 2013 1 1 1418 1419 -1 1726
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1: 1030 -10 AS 11 N594AS EWR SEA 338
## 2: 1222 -25 UA 1670 N45440 EWR SEA 343
## 3: 1732 -6 UA 16 N37464 EWR SEA 348
## distance hour minute time_hour
## 1: 2402 7 25 2013-01-01 07:00:00
## 2: 2402 8 51 2013-01-01 08:00:00
## 3: 2402 14 19 2013-01-01 14:00:00
will create a new data table including only flights from Newark to
Seattle. However, note that we just use origin
, and not
flights$origin
as were the case with data frames. Data tables
evaluate the arguments as if inside with
-function.
The first, integer indexing corresponds to dplyr’s slice
function
while the other one is equivalent to filter
.
C.3.2 j: Work with Columns
j
is perhaps the most powerful (and most confusing) of all arguments
for data table indexing. It allows both to select and do more complex
tasks. Lets start with selection:
## [1] "IAH" "IAH" "MIA"
selects only the dest
variable from the data. Note this results in
a vector, not in a single-variable data table. If you want to get
that, you can do
## dest
## 1: IAH
## 2: IAH
## 3: MIA
.()
is just an alias for list()
, encoded differently in data
tables to improve readability and make it easier to type. If we want to select
more that one variable, we can use the latter syntax:
## origin dest
## 1: EWR IAH
## 2: LGA IAH
## 3: JFK MIA
Selection supports a number of goodies, such as ranges of variables
with :
(for instance, dep_time:arr_delay
) and excluding variables
with !
or -
(for instance, -year
).
Obviously we can combine both i
and j
: let’s select origin and
departure delay for flights to Seattle:
## origin dep_delay
## 1: EWR -1
## 2: JFK 13
## 3: EWR 6
The example so far broadly corresponds to dplyr’s select
.
But j
is not just for selecting. It is also for computing. Let’s
find the mean arrival delay for flights to Seattle:
## [1] -1.099099
Several variables can be returned by wrapping, and optionally named,
these in .()
. For instance, find the average departure and arrival
delay for all flights to Seattle, given the flight was delayed on
arrival, and name these dep
and arr
:
flights[dest == "SEA" & arr_delay > 0,
.(dep = mean(dep_delay, na.rm=TRUE), arr = mean(arr_delay, na.rm=TRUE))]
## dep arr
## 1: 33.98266 39.79984
The result is a data table with two variables.
We can use the special variable .N
to count the rows:
## [1] 1269
will tell us how many flights to Seattle were delayed at arrival.
Handling the case where the variable names are stored in other
variables is not that hard, but still adds a layer of complexity. We
can specify the variables in .SDcols
parameter. This
parameter determines which columns go into .SD
(=Subset Data)
special variable. Afterwards we make an lapply
expression in j
:
flights[dest == "SEA" & arr_delay > 0,
lapply(.SD, function(x) mean(x, na.rm=TRUE)),
.SDcols = c("arr_delay", "dep_delay")]
## arr_delay dep_delay
## 1: 39.79984 33.98266
Let’s repeat: .SDcols
determines which variables will go into the
special .SD
list (default: all). lapply
in j
computes mean
values of each of the variables in the .SD
list. This procedure
feels complex, although it is internally optimized.
These examples correspond to dplyr’s aggregate
function. One can
argue, however, that data tables’ syntax is more confusing and harder
to read. Note also that the functionality data tables offer here is
optimized for speed and memory efficiency but still return a new
object. Aggregation does not work by reference.
C.3.3 Group in by
Finally, all of the above can by computed by groups using by
. Let’s
compute the average delays above by carrier and origin:
flights[dest == "SEA" & arr_delay > 0,
.(dep = mean(dep_delay, na.rm=TRUE), arr = mean(arr_delay, na.rm=TRUE)),
by = .(carrier, origin)]
## carrier origin dep arr
## 1: DL JFK 29.82373 39.49831
## 2: B6 JFK 28.49767 40.32093
## 3: UA EWR 40.24053 41.85078
## 4: AS EWR 31.80952 34.36508
## 5: AA JFK 34.04132 40.48760
We just had to specify the by
argument that lists the grouping
variables. If more than one, these should be wrapped in a list with
.()
function.
We can use the .N
variable to get the group size. How many flights
did each carrier from each origin?
## carrier origin N
## 1: UA EWR 46087
## 2: UA LGA 8044
## 3: AA JFK 13783
Finally, we can also use quoted variables for grouping too just be
replacing .()
with c()
:
## carrier origin N
## 1: UA EWR 46087
## 2: UA LGA 8044
## 3: AA JFK 13783
In dplyr context, the examples here include group_by
and summarize
verbs.
Read more about the basic usage in data.table the vignette Data analysis using data.table.
C.4 :=
–Create variables by reference
While summarizing we compute values in j
, these will always create a
new data.table. Reducing operations are not possible to do in-place.
But computing new variables can be done in-place.
In place variable computations (without summarizing) can be done with
:=
assignment operator in j
. Let’s compute a new
variable–speed–for each flight. We can do this as follows:
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 517 515 2 830
## 2: 2013 1 1 533 529 4 850
## 3: 2013 1 1 542 540 2 923
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1: 819 11 UA 1545 N14228 EWR IAH 227
## 2: 830 20 UA 1714 N24211 LGA IAH 227
## 3: 850 33 AA 1141 N619AA JFK MIA 160
## distance hour minute time_hour speed
## 1: 1400 5 15 2013-01-01 05:00:00 370.0441
## 2: 1416 5 29 2013-01-01 05:00:00 374.2731
## 3: 1089 5 40 2013-01-01 05:00:00 408.3750
We see the new variable, speed, included as the last variable in the data. Note we did this operation by reference, i.e. we did not assign the result to a new data table. The existing table was modified in place.
The same assignment operator also permits us to remove variables by
setting these to NULL
. Let’s remove speed:
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 517 515 2 830
## 2: 2013 1 1 533 529 4 850
## 3: 2013 1 1 542 540 2 923
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1: 819 11 UA 1545 N14228 EWR IAH 227
## 2: 830 20 UA 1714 N24211 LGA IAH 227
## 3: 850 33 AA 1141 N619AA JFK MIA 160
## distance hour minute time_hour
## 1: 1400 5 15 2013-01-01 05:00:00
## 2: 1416 5 29 2013-01-01 05:00:00
## 3: 1089 5 40 2013-01-01 05:00:00
Indeed, there is no speed any more.
Assigning more that one variable by reference may feel somewhat more intimidating:
flights[, c("speed", "meanDelay") := .(distance/(air_time/60), (arr_delay + dep_delay)/2)]
flights %>% head(3)
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 517 515 2 830
## 2: 2013 1 1 533 529 4 850
## 3: 2013 1 1 542 540 2 923
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1: 819 11 UA 1545 N14228 EWR IAH 227
## 2: 830 20 UA 1714 N24211 LGA IAH 227
## 3: 850 33 AA 1141 N619AA JFK MIA 160
## distance hour minute time_hour speed meanDelay
## 1: 1400 5 15 2013-01-01 05:00:00 370.0441 6.5
## 2: 1416 5 29 2013-01-01 05:00:00 374.2731 12.0
## 3: 1089 5 40 2013-01-01 05:00:00 408.3750 17.5
Assignment works together with both selection and grouping. For instance, we may want to replace negative delay by zeros:
## [1] 11 20 33 0 0 12 19 0 0 8 0 0 7 0 31 0 0 0 12 0
Indeed, we only see positive numbers and zeros. But be careful: now
we have overwritten the arr_delay
in the original data. We cannot
restore the previous state any more without re-loading the dataset.
As an example of groupings, let’s compute the maximum departure delay by origin:
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 517 515 2 830
## 2: 2013 1 1 533 529 4 850
## 3: 2013 1 1 542 540 2 923
## 4: 2013 1 1 544 545 -1 1004
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1: 819 11 UA 1545 N14228 EWR IAH 227
## 2: 830 20 UA 1714 N24211 LGA IAH 227
## 3: 850 33 AA 1141 N619AA JFK MIA 160
## 4: 1022 0 B6 725 N804JB JFK BQN 183
## distance hour minute time_hour speed meanDelay maxDelay
## 1: 1400 5 15 2013-01-01 05:00:00 370.0441 6.5 1126
## 2: 1416 5 29 2013-01-01 05:00:00 374.2731 12.0 911
## 3: 1089 5 40 2013-01-01 05:00:00 408.3750 17.5 1301
## 4: 1576 5 45 2013-01-01 05:00:00 516.7213 -9.5 1301
We can see that by
caused the delay to be computed for each group,
however, the data is not summarized, just the max delay is added to
every single row.
Finally, if you do not want to modify the original data, you should
use copy
function. This makes a deep copy of the data, and you can
modify the copy afterwards:
## origin dest
## 1: EWR IAH
## 2: LGA IAH
## 3: JFK MIA
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 517 515 2 830
## 2: 2013 1 1 533 529 4 850
## 3: 2013 1 1 542 540 2 923
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1: 819 11 UA 1545 N14228 EWR IAH 227
## 2: 830 20 UA 1714 N24211 LGA IAH 227
## 3: 850 33 AA 1141 N619AA JFK MIA 160
## distance hour minute time_hour speed meanDelay maxDelay
## 1: 1400 5 15 2013-01-01 05:00:00 370.0441 6.5 1126
## 2: 1416 5 29 2013-01-01 05:00:00 374.2731 12.0 911
## 3: 1089 5 40 2013-01-01 05:00:00 408.3750 17.5 1301
As you see, the flights
data has not changed.
These operations correspond to the dplyr’s mutate
verb. However,
mutate
always makes a copy of the original dataset, something that
may well make your analysis slow and sluggish with large data.
Read more in vignette Data.table reference semantics
C.5 keys
Data tables allow fast lookup based on key. In it’s simplest version, a key is a column (or several columns) which is used to pre-sort the data table. Pre-sorting makes it much faster to look up certain values, perform grouping operations and merges. As data can only be sorted according to one rule at a time, there can only be one key on data.table (but a key may be based on several variables).
Let’s set origin and destination as keys for the data table:
## Warning in data(flights, pkg = "nycflights13"): data set 'flights' not
## found
## Warning in data(flights, pkg = "nycflights13"): data set 'nycflights13' not
## found
setDT(flights, key=c("origin", "dest"))
fl <- flights[,.(origin, dest, arr_delay)]
# focus on a few variables only
head(fl, 3)
## origin dest arr_delay
## 1: EWR ALB 0
## 2: EWR ALB 40
## 3: EWR ALB 44
We see that both origin and destination are alphabetically ordered.
Note that when selecting variables, the resulting data table fl
will
have the same keys as the original one.
When set, we can easily subset by key by just feeding the key values
in i
:
## origin dest arr_delay
## 1: LGA ATL 0
## 2: LGA ATL 12
## 3: LGA ATL 5
## 4: LGA ATL 0
## 5: LGA ATL 17
will extract all LaGuardia-originating flights. In terms of output,
this is equivalent to fl[origin == "LGA"]
, just much more
efficient. When you want to
extract flights based on origin-destination pair, you can just add
both key columns:
## origin dest arr_delay
## 1: EWR SEA 0
## 2: EWR SEA 0
## 3: EWR SEA 0
## 4: EWR SEA 0
Again, this can be achieved in other ways, just keys are more efficient. Finally, if we want to extract based on the second key, the syntax is more confusing:
## origin dest arr_delay
## 1: EWR SEA 0
## 2: EWR SEA 0
## 3: EWR SEA 0
## 4: EWR SEA 0
We have to tell the [
that we want to extract all observations
where the first key is everything, and the second one is “SEA”.
Read more in the vignette Keys and fast binary search based subset.
C.6 Resources
- Data Table CRAN page. Vignettes are a very valuable source of information.
Automatic decompression is a feature request for data tables↩