B R Language Control Structures
In chapter Functions we briefly described conditional execution
using the if
/else
construct. Here we introduce the other main control structures, and
cover the if/else statement in more detail.
B.1 Loops
One of the common problems that computers are good at, is to execute similar tasks repeatedly many times. This can be achieved through loops. R implements three loops: for-loop, while-loop, and repeat-loop, these are among the most popular loops in many modern programming languages. Loops are often advised agains in case of R. While it is true that loops in R are more demanding than in several other languages, and can often be replaced by more efficient constructs, they nevertheless have a very important role in R. Good usage of loops is easy, fast, and improves the readability of your code.
B.1.1 For Loop
For-loop is a good choice if you want to run some statements repeatedly, each time picking a different value for a variable from a sequence; or just repeat some statements a given number of times. The syntax of for-loop is the following:
for(variable in sequence) {
# do something many times. Each time 'variable' has
# a different value, taken from 'sequence'
}
The statement block inside of the curly braces {} is repeated through the sequence.
Despite the curly braces, the loop body is still evaluated in the same environment as the rest of the code. This means the variables generated before entering the loop are visible from inside, and the variables created inside will be visible after the loop ends. This is in contrast to functions where the function body, also enclosed in curly braces, is a separate scope.
A simple and perhaps the most common usage of for loop is just to run some code a given number of times:
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
Through the loop the variable i
takes all the values from the
sequence 1:5
. Note: we do not need curly braces if the loop block
only contains a single statement.
But the sequence does not have to be a simple sequence of numbers. It may as well be a list:
data <- list("Xiaotian", 25, c(2061234567, 2069876543))
for(field in data) {
cat("field length:", length(field), "\n")
}
## field length: 1
## field length: 1
## field length: 2
As another less common example, we may run different functions over the same data:
## [1] 1.000000 1.414214 1.732051 2.000000
## [1] 2.718282 7.389056 20.085537 54.598150
## [1] 0.8414710 0.9092974 0.1411200 -0.7568025
B.1.2 While-loop
While-loop let’s the programmer to decide explicitly if we want to execute the statements one more time:
For instance, we can repeat something a given number of times (although this is more of a task for for-loop):
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
Note that in case the condition is not true to begin with, the loop never executes and control is immediately passed to the code after the loop:
## done
While-loop is great for handling repeated attempts to get input in shape. One can write something like this:
This will run getData()
until data is good. If it is not good at
the first try, the code tries really hard next time(s).
While can also easily do infinite loops with:
will execute until something breaks the loop. Another somewhat
interesting usage of while
is to disable execution of part of your
code:
B.1.3 repeat-loop
The R way of implementing repeat-loop does not include any condition and hence it just keeps running infdefinitely, or until something breaks the execution (see below):
repeat {
## just do something indefinitely many times.
## if you want to stop, use `break` statement.
}
Repeat is useful in many similar situations as while. For instance,
we can write a modified version of getting data using repeat
:
The main difference between this, and the previous while
-version is
that we a) don’t have to write getData
function twice, and b) on
each try we attempt to get data in the same way.
Exactly as in the case of for and while loop, enclosing the loop body in curly braces is only necessary if it contains more than one statement. For instance, if you want to close all the graphics devices left open by buggy code, you can use the following one-liner from console:
The loop will be broken by error when attempting to close the last null device.
It should also be noted that the loop does not create a separate environment (scope). All the variables from outside of the loop are visible inside, and variables created inside of the loop remain accessible after the loop terminates.
B.1.4 Leaving Early: break
and next
A straightforward way to leave a loop is the break
statement. It just leaves
the loop and transfers control to the code immediately after the loop
body. It breaks all three loops discussed here (for, while, and repeat-loops)
but not some of the other types of loops implemented outside of base R. For instance:
## 1
## 2
## 3
## 4
## 5
## done
As the loop body will use the same environment (the same variable
values) as the rest of the code, the loop variables can be used right
afterwards. For instance, we can see that i = 5
:
## [1] 5
Opposite to break
, next
will throw the execution flow back to the
head of the loop without running the commands following next
. In
case of for-loop, this means a new value is picked from the sequence,
in case of while-loop the condition is evaluated again; the head of
repeat loop does not do any calculations. For instance, we can print
only even numbers:
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
next
jumps over printing where modulo of division by 2 is not 0.
B.1.5 When (Not) To Use Loops In R?
In general, use loops if it improves code readability, makes it easier to extend the code, and does not cause too big inefficiencies. In practice, it means use loops for repeating something slow. Don’t use loops if you can use vectorized operators instead.
For instance, look at the examples above where we were reading data until it was good. There is probably very little you can achieve by avoiding loops–except making your code messy. First, the input will probably attempted a few times only (otherwise re-consider how you get your data), and second, data input is probably a far slower process than the loop overhead. Just use the loop.
Similar arguments hold if you consider to loop over input files to read, output files to create, figures to plot, webpage to scrape… In all these cases the loop overheads are negligible compared to the task performed inside the loops. And there are no real vectorized alternatives anyway.
The opposite is the case where good alternatives exist. Never do this kind of computations in R:
This loop introduces three inefficiencies. First, and most
importantly, you are growing your results vector res
when adding new values to it. This involves
creating new and longer vectors again and again when the old one is
filled up. This is very slow. Second, as the true vectorized
alternatives exist here, one can easily speed up the code by a
magnitude or more by just switching to the vectorized code. Third, and
this is a very important point again, this code is much harder
to read than the pure vectorized expression:
We can easily demonstrate how much faster the vectorized expressions work. Let’s do this using microbenchmark library (it provides nanosecond timings). Let’s wrap these expressions in functions for easier handling by microbenchmark. We also include a middle version where we use a loop but avoid the terribly slow vector re-allocation by creating the right-sized result in advance:
seq <- 1:1000 # global variables are easier to handle with microbenchmark
baad <- function() {
r <- numeric()
for(i in seq)
r <- c(r, sqrt(i))
}
soso <- function() {
r <- numeric(length(seq))
for(i in seq)
r[i] <- sqrt(i)
}
good <- function()
sqrt(seq)
microbenchmark::microbenchmark(baad(), soso(), good())
## Unit: microseconds
## expr min lq mean median uq max neval
## baad() 826.782 836.8505 1121.75340 849.7555 876.0790 3884.545 100
## soso() 42.018 42.5170 58.71166 43.0100 44.0435 1576.462 100
## good() 2.259 2.4470 8.64512 2.6815 3.1675 549.440 100
When comparing the median values, we can see that the middle example is 10x slower than the vectorized example, and the first version is almost 300x slower!
The above holds for “true” vectorized operations. R provides many pseudo-vectorizers (the lapply family) that may be handy in many circumstances, but not in case you need speed. Under the hood these functions just use a loop. We can demonstrate this by adding one more function to our family:
soso2 <- function() {
sapply(seq, sqrt)
}
microbenchmark::microbenchmark(baad(), soso(), soso2(), good())
## Unit: microseconds
## expr min lq mean median uq max neval
## baad() 823.078 853.0670 1039.05146 862.1650 874.8660 3547.314 100
## soso() 41.704 42.9815 43.67443 43.3550 43.7230 68.680 100
## soso2() 219.362 222.0515 235.62589 223.8685 230.2925 1023.763 100
## good() 2.359 2.5435 2.93111 2.8205 3.0625 4.675 100
As it turns out, sapply
is much slower than plain for-loop.
Lapply-family is not a substitute for true vectorization. However,
one may argue that soso2
function is easier to understand than
explicit loop in soso
. In any case, it is easier to type when
working on the interactive console.
Finally, it should be noted that when performing truly vectorized operations on long vectors, the overhead of R interpreter becomes negligible and the resulting speed is almost equalt to that of the corresponding C code. And optimized R code can easily beat non-optimized C.
B.2 More about if
and else
The basics of if-else blocks were covered in Chapter Functions. Here we discuss some more advanced aspects.
B.2.1 Where To Put Else
Normally R wants to understand if a line of code belongs to an earlier statement, or is it beginning of a fresh one. For instance,
may fail. R finishes the if-block on line 3 and thinks line 4 is
beginning of a new statement. And gets confused by the unaccompanied
else
as it has already forgotten
about the if
above. However, this code works if used inside a function, or
more generally, inside a {…} block.
But else may always stay on the same line as the the statement of the if-block. You are already familiar with the form
but this applies more generally. For instance:
or
or
the last one-liner uses the fact that we can write several statements on a
single line with ;
. Arguably, such style is often a bad idea, but
it has it’s place, for instance when creating anonymous functions
for lapply
and friends.
B.2.2 Return Value
As most commands in R, if-else block has a return value. This is the
last value evaluated by the block, and it can be assigned to
variables. If the condition is false, and there is no else block, if
returns NULL
.
Using return values of if-else statement is often a recipe for hard-to-read code but may be a good choice in other circumstances, for instance where the multy-line form will take too much attention away from more crucial parts of the code. The line
## [1] "odd"
is rather easy to read. This is the closest general construct in R corresponding to C conditional shortcuts
parity = n % 2 ? "even" : "odd"
.
Another good place for such compressed if-else statements is
inside anonymous functions in lapply
and friends. For instance, the following code
replaces NULL
-s and NA
-s in a list of characters:
emails <- list("ott@example.com", "xi@abc.net", NULL, "li@whitehouse.gov", NA)
n <- 0
sapply(emails, function(x) if(is.null(x) || is.na(x)) "-" else x)
## [1] "ott@example.com" "xi@abc.net" "-"
## [4] "li@whitehouse.gov" "-"
B.3 switch
: Choosing Between Multiple Conditions
If-else is appropriate if we have a small number of conditions, potentially only one. Switch is a related construct that tests a large number of equality conditions. The best way to explain it’s syntax is through an example:
stock <- "AMZN"
switch(stock,
"AAPL" = "Apple Inc",
"AMZN" = "Amazon.com Inc",
"INTC" = "Intel Corp",
"unknown")
## [1] "Amazon.com Inc"
Switch expression attempts to match the expression, here stock = "AMZN"
, to a
list of alternatives. If one of the alternative matches the name of
the argument, that argument is returned. If none matches, the
nameless default argument (here “unknown”) is returned. In case there
is no default argument, switch
returns NULL
.
Switch allows to specify the same return value for multiple cases:
## [1] "something else"
If the matching named argument is empty, the first non-empty named argument is evaluated. In this case this is “something else”, corresponding to the name “INTC”.
As in case of if
, switch
only accepts length-1 expressions.
Switch also allows to use integers instead of character expressions,
in that case the return value list should be without names.