B R Language Control Structures

In chapter Functions we briefly described conditional execution using the if/else construct. Here we introduce the other main control structures, and cover the if/else statement in more detail.

B.1 Loops

One of the common problems that computers are good at, is to execute similar tasks repeatedly many times. This can be achieved through loops. R implements three loops: for-loop, while-loop, and repeat-loop, these are among the most popular loops in many modern programming languages. Loops are often advised agains in case of R. While it is true that loops in R are more demanding than in several other languages, and can often be replaced by more efficient constructs, they nevertheless have a very important role in R. Good usage of loops is easy, fast, and improves the readability of your code.

B.1.1 For Loop

For-loop is a good choice if you want to run some statements repeatedly, each time picking a different value for a variable from a sequence; or just repeat some statements a given number of times. The syntax of for-loop is the following:

for(variable in sequence) {
                           # do something many times.  Each time 'variable' has
                           # a different value, taken from 'sequence'
}

The statement block inside of the curly braces {} is repeated through the sequence.

Despite the curly braces, the loop body is still evaluated in the same environment as the rest of the code. This means the variables generated before entering the loop are visible from inside, and the variables created inside will be visible after the loop ends. This is in contrast to functions where the function body, also enclosed in curly braces, is a separate scope.

A simple and perhaps the most common usage of for loop is just to run some code a given number of times:

for(i in 1:5)
   print(i)

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

Through the loop the variable i takes all the values from the sequence 1:5. Note: we do not need curly braces if the loop block only contains a single statement.

But the sequence does not have to be a simple sequence of numbers. It may as well be a list:

data <- list("Xiaotian", 25, c(2061234567, 2069876543))
for(field in data) {
   cat("field length:", length(field), "\n")
}

## field length: 1 
## field length: 1 
## field length: 2

As another less common example, we may run different functions over the same data:

data <- 1:4
for(fun in c(sqrt, exp, sin)) {
   print(fun(data))
}

## [1] 1.000000 1.414214 1.732051 2.000000
## [1]  2.718282  7.389056 20.085537 54.598150
## [1]  0.8414710  0.9092974  0.1411200 -0.7568025

B.1.2 While-loop

While-loop let’s the programmer to decide explicitly if we want to execute the statements one more time:

while(condition) {
                           # do something if the 'condition' is true
}

For instance, we can repeat something a given number of times (although this is more of a task for for-loop):

i <- 0
while(i < 5) {
   print(i)
   i <- i + 1
}

## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4

Note that in case the condition is not true to begin with, the loop never executes and control is immediately passed to the code after the loop:

i <- 10
while(i < 5) {
   print(i)
   i <- i + 1
}
cat("done\n")

## done

While-loop is great for handling repeated attempts to get input in shape. One can write something like this:

data <- getData()
while(!dataIsGood(data)) {
   data <- getData(tryReallyHard = TRUE)
}

This will run getData() until data is good. If it is not good at the first try, the code tries really hard next time(s).

While can also easily do infinite loops with:

while(TRUE) { ... }

will execute until something breaks the loop. Another somewhat interesting usage of while is to disable execution of part of your code:

while(FALSE) {
   ## code you don't want to be executed
}

B.1.3 repeat-loop

The R way of implementing repeat-loop does not include any condition and hence it just keeps running infdefinitely, or until something breaks the execution (see below):

repeat {
   ## just do something indefinitely many times.
   ## if you want to stop, use `break` statement.
}

Repeat is useful in many similar situations as while. For instance, we can write a modified version of getting data using repeat:

repeat {
   data <- getData()
   if(dataIsGood())
      break
}

The main difference between this, and the previous while-version is that we a) don’t have to write getData function twice, and b) on each try we attempt to get data in the same way.

Exactly as in the case of for and while loop, enclosing the loop body in curly braces is only necessary if it contains more than one statement. For instance, if you want to close all the graphics devices left open by buggy code, you can use the following one-liner from console:

repeat dev.off()

The loop will be broken by error when attempting to close the last null device.

It should also be noted that the loop does not create a separate environment (scope). All the variables from outside of the loop are visible inside, and variables created inside of the loop remain accessible after the loop terminates.

B.1.4 Leaving Early: `break` and `next`

A straightforward way to leave a loop is the break statement. It just leaves the loop and transfers control to the code immediately after the loop body. It breaks all three loops discussed here (for, while, and repeat-loops) but not some of the other types of loops implemented outside of base R. For instance:

for(i in 1:10) {
   cat(i, "\n")
   if(i > 4)
      break
}

## 1 
## 2 
## 3 
## 4 
## 5

cat("done\n")

## done

As the loop body will use the same environment (the same variable values) as the rest of the code, the loop variables can be used right afterwards. For instance, we can see that i = 5:

print(i)

## [1] 5

Opposite to break, next will throw the execution flow back to the head of the loop without running the commands following next. In case of for-loop, this means a new value is picked from the sequence, in case of while-loop the condition is evaluated again; the head of repeat loop does not do any calculations. For instance, we can print only even numbers:

for(i in 1:10) {
   if(i %% 2 != 0)
      next
   print(i)
}

## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10

next jumps over printing where modulo of division by 2 is not 0.

B.1.5 When (Not) To Use Loops In R?

In general, use loops if it improves code readability, makes it easier to extend the code, and does not cause too big inefficiencies. In practice, it means use loops for repeating something slow. Don’t use loops if you can use vectorized operators instead.

For instance, look at the examples above where we were reading data until it was good. There is probably very little you can achieve by avoiding loops–except making your code messy. First, the input will probably attempted a few times only (otherwise re-consider how you get your data), and second, data input is probably a far slower process than the loop overhead. Just use the loop.

Similar arguments hold if you consider to loop over input files to read, output files to create, figures to plot, webpage to scrape… In all these cases the loop overheads are negligible compared to the task performed inside the loops. And there are no real vectorized alternatives anyway.

The opposite is the case where good alternatives exist. Never do this kind of computations in R:

res <- numeric()
for(i in 1:1000) {
   res <- c(res, sqrt(i))
}

This loop introduces three inefficiencies. First, and most importantly, you are growing your results vector res when adding new values to it. This involves creating new and longer vectors again and again when the old one is filled up. This is very slow. Second, as the true vectorized alternatives exist here, one can easily speed up the code by a magnitude or more by just switching to the vectorized code. Third, and this is a very important point again, this code is much harder to read than the pure vectorized expression:

res <- sqrt(1:1000)

We can easily demonstrate how much faster the vectorized expressions work. Let’s do this using microbenchmark library (it provides nanosecond timings). Let’s wrap these expressions in functions for easier handling by microbenchmark. We also include a middle version where we use a loop but avoid the terribly slow vector re-allocation by creating the right-sized result in advance:

seq <- 1:1000 # global variables are easier to handle with microbenchmark
baad <- function() {
   r <- numeric()
   for(i in seq)
      r <- c(r, sqrt(i))
}
soso <- function() {
   r <- numeric(length(seq))
   for(i in seq)
      r[i] <- sqrt(i)
}
good <- function()
   sqrt(seq)
microbenchmark::microbenchmark(baad(), soso(), good())

## Unit: microseconds
##    expr     min       lq       mean   median       uq      max neval
##  baad() 826.782 836.8505 1121.75340 849.7555 876.0790 3884.545   100
##  soso()  42.018  42.5170   58.71166  43.0100  44.0435 1576.462   100
##  good()   2.259   2.4470    8.64512   2.6815   3.1675  549.440   100

When comparing the median values, we can see that the middle example is 10x slower than the vectorized example, and the first version is almost 300x slower!

The above holds for “true” vectorized operations. R provides many pseudo-vectorizers (the lapply family) that may be handy in many circumstances, but not in case you need speed. Under the hood these functions just use a loop. We can demonstrate this by adding one more function to our family:

soso2 <- function() {
   sapply(seq, sqrt)
}
microbenchmark::microbenchmark(baad(), soso(), soso2(), good())

## Unit: microseconds
##     expr     min       lq       mean   median       uq      max neval
##   baad() 823.078 853.0670 1039.05146 862.1650 874.8660 3547.314   100
##   soso()  41.704  42.9815   43.67443  43.3550  43.7230   68.680   100
##  soso2() 219.362 222.0515  235.62589 223.8685 230.2925 1023.763   100
##   good()   2.359   2.5435    2.93111   2.8205   3.0625    4.675   100

As it turns out, sapply is much slower than plain for-loop. Lapply-family is not a substitute for true vectorization. However, one may argue that soso2 function is easier to understand than explicit loop in soso. In any case, it is easier to type when working on the interactive console.

Finally, it should be noted that when performing truly vectorized operations on long vectors, the overhead of R interpreter becomes negligible and the resulting speed is almost equalt to that of the corresponding C code. And optimized R code can easily beat non-optimized C.

B.2 More about `if` and `else`

The basics of if-else blocks were covered in Chapter Functions. Here we discuss some more advanced aspects.

B.2.1 Where To Put Else

Normally R wants to understand if a line of code belongs to an earlier statement, or is it beginning of a fresh one. For instance,

if(condition) {
   ## do something
}
else {
   ## do something else
}

may fail. R finishes the if-block on line 3 and thinks line 4 is beginning of a new statement. And gets confused by the unaccompanied else as it has already forgotten about the if above. However, this code works if used inside a function, or more generally, inside a {…} block.

But else may always stay on the same line as the the statement of the if-block. You are already familiar with the form

if(condition) {
   ## do something
} else {
   ## do something else
}

but this applies more generally. For instance:

if(condition) 
   x <- 1 else {
             ## do something else
          }

if(condition) { ... } else {
             ## do something else
          }

if(condition) { x <- 1; y <- 2 } else z <- 3

the last one-liner uses the fact that we can write several statements on a single line with ;. Arguably, such style is often a bad idea, but it has it’s place, for instance when creating anonymous functions for lapply and friends.

B.2.2 Return Value

As most commands in R, if-else block has a return value. This is the last value evaluated by the block, and it can be assigned to variables. If the condition is false, and there is no else block, if returns NULL.

Using return values of if-else statement is often a recipe for hard-to-read code but may be a good choice in other circumstances, for instance where the multy-line form will take too much attention away from more crucial parts of the code. The line

n <- 3
parity <- if(n %% 2 == 0) "even" else "odd"
parity

## [1] "odd"

is rather easy to read. This is the closest general construct in R corresponding to C conditional shortcuts parity = n % 2 ? "even" : "odd".

Another good place for such compressed if-else statements is inside anonymous functions in lapply and friends. For instance, the following code replaces NULL-s and NA-s in a list of characters:

emails <- list("ott@example.com", "xi@abc.net", NULL, "li@whitehouse.gov", NA)
n <- 0
sapply(emails, function(x) if(is.null(x) || is.na(x)) "-" else x)

## [1] "ott@example.com"   "xi@abc.net"        "-"                
## [4] "li@whitehouse.gov" "-"

B.3 `switch`: Choosing Between Multiple Conditions

If-else is appropriate if we have a small number of conditions, potentially only one. Switch is a related construct that tests a large number of equality conditions. The best way to explain it’s syntax is through an example:

stock <- "AMZN"
switch(stock,
       "AAPL" = "Apple Inc",
       "AMZN" = "Amazon.com Inc",
       "INTC" = "Intel Corp",
       "unknown")

## [1] "Amazon.com Inc"

Switch expression attempts to match the expression, here stock = "AMZN", to a list of alternatives. If one of the alternative matches the name of the argument, that argument is returned. If none matches, the nameless default argument (here “unknown”) is returned. In case there is no default argument, switch returns NULL.

Switch allows to specify the same return value for multiple cases:

switch(stock,
       "AAPL" = "Apple Inc",
       "AMZN" =, "INTC" = "something else",
       "unknown")

## [1] "something else"

If the matching named argument is empty, the first non-empty named argument is evaluated. In this case this is “something else”, corresponding to the name “INTC”.

As in case of if, switch only accepts length-1 expressions. Switch also allows to use integers instead of character expressions, in that case the return value list should be without names.