Chapter 6 Functions

This chapter will explore how to use functions in R to perform advanced capabilities and actually ask questions about data. After considering a function in an abstract sense, it will discuss using built-in R functions, accessing additional functions by loading R packages, and writing your own functions.

6.1 What are Functions?

In a broad sense, a function is a named sequence of instructions (lines of code) that you may want to perform one or more times throughout a program. They provide a way of encapsulating multiple instructions into a single “unit” that can be used in a variety of different contexts. So rather than needing to repeatedly write down all the individual instructions for “make a sandwich” every time you’re hungry, you can define a make_sandwich() function once and then just call (execute) that function when you want to perform those steps. Typically, functions also accept inputs so you can make the things slightly differently from time to time. For instance, sometimes you may want to call make_sandwich(cheese) while another time make_sandwich(chicken).

In addition to grouping instructions, functions in programming languages also allow to model the mathematical definition of function, i.e. perform certain operations on a number of inputs that lead to an output. For instance, look at the function max() that finds the largest value among numbers:

max(1,2,3)  # 3

The inputs are numbers 1, 2, and 3 in parenthesis, usually called arguments or parameters, and we say that these arguments are passed to a function (like a football). We say that a function then returns a value, number “3” in this example, which we can either print or assign to a variable and use later.

Finally, the functions may also have side effects. An example case is the cat() function that just prints it’s arguments. For instance, in case of the following line of code

cat("The answer is", 1+1, "\n")

we call function cat() with three arguments: "The answer is", 2 (note: 1+1 will be evaluated to 2), and "\n" (the line break symbol). However, here we may not care about the return value but in the side effect—the message being printed on the screen.

6.2 How to Use Functions

R functions are referred to by name (technically, they are values like any other variable, just not atomic values). As in many programming languages, we call a function by writing the name of the function followed immediately (no space) by parentheses (). Sometimes this is enough, for instance

Sys.Date()

gives us the current date and that’s it.

But often we want the function to do something with our inputs. In this case we put the arguments (inputs) inside the parenthesis, separated by commas (,). Thus computer functions look just like multi-variable mathematical functions, although usually with fancier names than f().

# call the sqrt() function, passing it 25 as an argument
sqrt(25)  # 5, square root of 25

# count number of characters in "Hello world" as an argument
nchar("Hello world")  # 11, note: space is a character too

# call the min() function, pass it 1, 6/8, AND 4/3 as arguments
# this is an example of a function that takes multiple args
min(1, 6/8, 4/3)  # 0.75, (6/8 is the smallest value)

To keep functions and ordinary variables distinct, we include empty parentheses () when referring to a function by name. This does not mean that the function takes no arguments, it is just a useful shorthand for indicating that something is a function.

Note: You always need to supply the parenthesis if you want to call the function (force it to do what it is supposed to do). If you leave the parenthesis out, you get the function definition printed on screen instead. So cat() is actually a function call while cat is the function. You can see that it is a function if you just print it as print(cat). However, we ignore this distinction here.

If you call any of these functions interactively, R will display the returned value (the output) in the console. However, the computer is not able to “read” what is written in the console—that’s for humans to view! If you want the computer to be able to use a returned value, you will need to give that value a name so that the computer can refer to it. That is, you need to store the returned value in a variable:

# store min value in smallest.number variable
smallest_number <- min(1, 6/8, 4/3)

# we can then use the variable as normal, such as for a comparison
min_is_big <- smallest_number > 1  # FALSE

# we can also use functions directly when doing computations
phi <- .5 + sqrt(5)/2  # 1.618...

# we can even pass the result of a function as an argument to another!
# watch out for where the parentheses close!
print(min(1.5, sqrt(3)))  # prints 1.5

In the last example, the resulting value of the “inner” function (e.g., sqrt()) is immediately used as an argument for the middle function (i.e. min()), value of which is fed in turn to the outer function print(). Because that value is used immediately, we don’t have to assign it a separate variable name. It is known as an anonymous variable.
note also that in the last example, we are solely interested in the side effect of the print() function. It also returns it’s argument (min(1.5, sqrt(3)) here) but we do not store it in a variable.

R functions take two types of arguments: positional arguments and named arguments: the function has to know how to treat each of it’s arguments. For instance, we can round number e to 3 digits by round(2.718282, 3). But in order to do this, the round() function must know that 2.718282 is the number and 3 is the requested number of digits, and not the other way around. It understands this because it requires the number to be the first argument, and digits the second argument. This approach works well in case of known small number of inputs. However, this is not an option for functions with variable number of arguments, such as cat(). cat() just prints out all of it’s (potentially a large number of) inputs, except a limited number of special named arguments. One of these is sep, the string to be placed between the other pieces of output (by default just a space is printed). Note the difference in output between

cat(1, 2, "-", "\n")  # 1 2 - 
cat(1, 2, sep="-", "\n")  # 1-2-

In the first case cat() prints 1, 2, "-", and the line break "\n", all separated by a space. In the second case the name sep ensures that "-" is not printed out but instead treated as the separator between 1, 2 and "\n".

6.3 Built-in R Functions

As you have likely noticed, R comes with a variety of functions that are built into the language. In the above example, we used the print() function to print a value to the console, the min() function to find the smallest number among the arguments, and the sqrt() function to take the square root of a number. Here is a very limited list of functions you can experiment with (or see a few more here).

Function Name	Description	Example
`sum(a,b,...)`	Calculates the sum of all input values	`sum(1, 5)` returns `6`
`round(x,digits)`	Rounds the first argument to the given number of digits	`round(3.1415, 3)` returns `3.142`
`toupper(str)`	Returns the characters in uppercase	`toupper("hi there")` returns `"HI THERE"`
`paste(a,b,...)`	Concatenate (combine) characters into one value	`paste("hi", "there")` returns `"hi there"`
`nchar(str)`	Counts the number of characters in a string	`nchar("hi there")` returns `8` (space is a character!)
`c(a,b,...)`	Concatenate (combine) multiple items into a vector (see chapter 7)	`c(1, 2)` returns `1, 2`
`seq(a,b)`	Return a sequence of numbers from a to b	`seq(1, 5)` returns `1, 2, 3, 4, 5`

To learn more about any individual function, look them up in the R documentation by using ?FunctionName account as described in the previous chapter.

“Knowing” how to program in a language is to some extent simply “knowing” what provided functions are available in that language. Thus you should look around and become familiar with these functions… but do not feel that you need to memorize them! It’s enough to simply be aware “oh yeah, there was a function that sums up numbers”, and then be able to look up the name and argument for that function.

6.4 Loading Functions

Although R comes with lots of built-in functions, you can always use more! Packages (also known as libraries) are additional sets of R functions (and data and variables) that are written and published by the R community. Because many R users encounter the same data management/analysis challenges, programmers are able to use these libraries and thus benefit from the work of others (this is the amazing thing about the open-source community—people solve problems and then make those solutions available to others). Popular packages include dplyr for manipulating data, ggplot2 for visualizations, and data.table for handling large datasets.

Most of the R packages do not ship with the R software by default, and need to be downloaded (once) and then loaded into your interpreter’s environment (each time you wish to use them). While this may seem cumbersome, it is a necessary trade-off between speed and size. R software would be huge and slow if it would include all available packages.

Luckily, it is quite simple to install and load R packages from within R. To do so, you’ll need to use the built-in R functions install.packages and library. Below is an example of installing and loading the stringr package (which contains many handy functions for working with character strings):

# Install the `stringr` package. Only needs to be done once on your machine
install.packages("stringr")

We stress here that you need to install each package only once per computer. As installation may be slow and and resource-demanding, you should not do it repeatedly inside your script!. Even more, if your script is also run by other users on their computers, you should get their explicit consent before installing additional software for them! The easiest remedy is to solely rely on manual installation.

Exactly the same syntax—install.packages("stringr")—is also used for re-installing it. You may want to re-install it if a newer version comes out, of if you upgrade your R and receive warnings about the package being built under a previous version of R.

After installation, the easiest way to get access to the functions is by loading the package:

# Load the package (make stringr() functions available in this R session/program)
library("stringr") # quotes optional here

This makes all functions in the stringr package available for R (see the documentation for a list of functions included with the stringr library). For instance, if we want to pad the word “justice” from left with tildes to create a width-10 string, we can do

str_pad("justice", 10, "left", "~")  # "~~~justice"

We can use str_pad function without any additional hassle because library() command made it available.

This is an easy and popular approach. However, what happens if more than one package call a function by the same name? For instance, many packages implement function filter(). In this case the more recent package will mask the function as defined by the previous package. You will also see related warnings when you load the library. In case you want to use a masked function you can write something like package::function() in order to call it. For instance, we can do the example above with

stringr::str_pad("justice", 10, "left", "~")  # "~~~justice"

This approach—specifying namespace in front of the function—ensures we access the function in the right package. If we call all functions in this way, we don’t even need to load the package with library() command. This is the preferred approach if you only need few functions from a large library.

6.5 Writing Functions

Even more exciting than loading other peoples’ functions is writing your own. Any time you have a task that you may repeat throughout a script—or sometimes when you just want to organize your code better—it’s a good practice to write a function to perform that task. This will limit repetition and reduce the likelihood of errors as well as make things easier to read and understand (and thus identify flaws in your analysis).

Functions are values like numbers and characters, so we use the assignment operator (<-) to store a function into a variable.

Although “values” (more precisely objects), functions are not atomic objects. But they can still be stored and manipulated in many ways like other objects.

Tidyverse style guide we follow in this books suggests to use verbs as function names, written in snake_case just like other variables.

The best way to understand the syntax for defining a function is to look at an example:

# A function named `make_full_name` that takes two arguments
# and returns the "full name" made from them
make_full_name <- function(first_name, last_name) {
  # Function body: perform tasks in here
  full_name <- paste(first_name, last_name)
                           # paste joins first and last name into a
                           # single string
  # Return: what you want the function to output
  return(full_name)
}

# Call the `make_full_name` function with the values "Alice" and "Kim"
my_name <- make_full_name("Alice", "Kim")  # "Alice Kim"

Function definition contains several important pieces:

The value assigned to the function variable uses the syntax function(....) to indicate that you are creating a function (as opposed to a number or character string).
If you want to feed your function with certain inputs, these must be referred somehow inside of your function. You list these names between the parentheses in the function(....) declaration. These are called formal arguments. Formal arguments will contain the values passed in when calling the function (called actual arguments). For example, when we call make_full_name("Alice", "Kim"), the value of the first actual argument ("Alice") will be assigned to the first formal argument (first_name), and the value of the second actual argument ("Kim") will be assigned to the second formal argument (last_name). Inside the function’s body, both of these formal arguments behave exactly as ordinary variables with values “Alice” and “Kim” respectively.

Importantly, we could have made the formal argument names anything we wanted (name_first, xyz, etc.), just as long as we then use these formal argument names to refer to the argument while inside the function.

Moreover, the formal argument names only apply while inside the function. You can think of them like “nicknames” for the values. The variables first_name, last_name, and full_name only exist within this particular function.
Body: The body of the function is a block of code that falls between curly braces {} (a “block” is represented by curly braces surrounding code statements). Note that cleanest style is to put the opening { immediately after the arguments list, and the closing } on its own line.

The function body specifies all the instructions (lines of code) that your function will perform. A function can contain as many lines of code as you want—you’ll usually want more than 1 to make it worth while, but if you have more than 20 you might want to break it up into separate functions. You can use the formal arguments here as any other variables, you can create new variables, call other functions, you can even declare functions inside functions… basically any code that you would write outside of a function can be written inside of one as well!

All the variables you create in the function body are local variables. These are only visible from within the function and “will be forgotten” as soon as you return from the function. However, variables defined outside of the function are still visible from within.
Return value is what your function produces. You can specify this by calling the return() function and passing that the value that you wish your function to return. It is typically the last line of the function. Note that even though we returned a variable called full_name, that variable was local to the function and so doesn’t exist outside of it; thus we have to store the returned value into another variable if we want to use it later (as with name <- make_full_name("Alice", "Kim")).

return() statement is usually unnecessary as R implicitly returns the last object it evaluated anyway. So we may shorten the function definition into

make_full_name <- function(first_name, last_name) {
  full_name <- paste(first_name, last_name)
}

or even not store the concatenated names into full_name:

make_full_name <- function(first_name, last_name) {
   paste(first_name, last_name)
}

The last evaluation was concatenating the first and last name, and hence the full name will be implicitly returned.

We can call (execute) a function we defined exactly in the same way we call built-in functions. When we do so, R will take the actual arguments we passed in (e.g., "Alice" and "Kim") and assign them to the formal arguments. Then it executes each line of code in the function body one at a time. When it gets to the return() call, it will end the function and return the given value, which can then be assigned to a different variable outside of the function.

6.6 Conditional Statements

Functions are one way to organize and control the flow of execution (e.g., what lines of code get run in what order). There are other ways. In all languages, we can specify that different instructions will be run based on a different set of conditions. These are Conditional statements which specify which chunks of code will run depending on which conditions are true. This is valuable both within and outside of functions.

In an abstract sense, an conditional statement is saying:

IF something is true
  do some lines of code
OTHERWISE
  do some other lines of code

In R, we write these conditional statements using the keywords if and else and the following syntax:

if (condition) {
  # lines of code to run if condition is TRUE
} else {
  # lines of code to run if condition is FALSE
}

(Note that the the else needs to be on the same line as the closing } of the if block. It is also possible to omit the else and its block).

The condition can be any variable or expression that resolves to a logical value (TRUE or FALSE). Thus both of the conditional statements below are valid:

porridge_temp <- 115 # in degrees F
if (porridge_temp > 120) {
  print("This porridge is too hot!")
}

too_cold <- porridge_temp < 70
# a logical value
if (too_cold) {
  print("This porridge is too cold!")
}

Note, we can extend the set of conditions evaluated using an else if statement. For example:

# Function to determine if you should eat porridge
test_food_temp <- function(temp) {
  if (temp > 120) {
    status <- "This porridge is too hot!"
  } else if (temp < 70) {
    status <- "This porridge is too cold!"
  } else {
    status <- "This porridge is just right!"
  }
  return(status)
}
# Use function on different temperatures
test_food_temp(119)  # "This porridge is just right!"
test_food_temp(60)   # "This porridge is too cold!"
test_food_temp(150)  # "This porridge is too hot!"

See more about if, else and other related constructs in Appendix C.

Technical Foundations of Informatics