Chapter 7 Vectors

This chapter covers the foundational concepts for working with vectors in R. Vectors are the fundamental data type in R: in order to use R, you need to become comfortable with vectors. This chapter will discuss how R stores information in vectors, the way in which operations are executed in vectorized form, and how to extract subsets of vectors. These concepts are key to effectively programming in R.

7.1 What is a Vector?

Vectors are one-dimensional ordered collections of values that are all stored in a single variable. For example, you can make a vector people that contains the character strings “Sarah”, “Amit”, and “Zhang”. Alternatively, you could make a vector numbers that stores the numbers from 1 to 100. Each value in a vector is refered to as an element of that vector; thus the people vector would have 3 elements, "Sarah", "Amit", and "Zhang", and numbers vector will have 100 elements. Ordered means that once in the vector, the elements will remain there in the original order. If “Amit” was put on the second place, it will remain on the second place unless explicitly moved.

Unfortunately, there are at least five different sometimes contradicting definitions of what is “vector” in R. Here we focus on atomic vectors, vectors that contain the atomic data types. Another different class of vectors is generalized vectors or lists, the topic of chapter Lists.

Atomic vector can only contain elements of an atomic data type—numeric, integer, character or logical. Importantly, all the elements in a vector need to have the same. You can’t have an atomic vector whose elements include both numbers and character strings.

7.2 Creating Vectors

The easiest and most universal syntax for creating vectors is to use the built in c() function, which is used to c_ombine_ values into a vector. The c() function takes in any number of arguments of the same type (separated by commas as usual), and returns a vector that contains those elements:

You can use the length() function to determine how many elements are in a vector:

As atomic vectors can only contain same type of elements, c() automatically casts (converts) one type to the other if necessary (and if possible). For instance, when attempting to create a vector containing number 1 and character “a”

we get a character vector where the number 1 was converted to a character “1”. This is a frequent problem when reading data where some fields contain invalid number codes.

There are other handy ways to create vectors. For example, the seq() function mentioned in chapter 6 takes 2 (or more) arguments and produces a vector of the integers between them. An optional third argument specifies by which step to increment the numbers:

  • When you print out one_to_ninety, you’ll notice that in addition to the leading [1] that you’ve seen in all printed results, there are additional bracketed numbers at the start of each line. These bracketed numbers tells you from which element number (index, see below) that line is showing the elements of. Thus the [1] means that the printed line shows elements started at element number 1, a [20] means that the printed line shows elements starting at element number 20, and so on. This is to help make the output more readable, so you know where in the vector you are when looking at in a printed line of elements!

As a shorthand, you can produce a sequence with the colon operator (a:b), which returns a vector a to b with the element values incrementing by 1:

Another useful function that creates vectors is rep() that repeats it’s first argument:

c() can also be used to add elements to an existing vector:

Note that c() retains the order of elements—“Josh” will be the last element in the extended vector.

All the vector creation functions we introduced here, c(), seq() and rep() are noticeably more powerful and complex than the brief discussion above. You are encouraged to read the help pages!

7.3 Vector Indices

Vectors are the fundamental structure for storing collections of data. Yet you often want to only work with some of the data in a vector. This section will discuss a few ways that you can get a subset of elements in a vector.

In particular, you can refer to individual elements in a vector by their index (more specifically numeric index), which is the number of their position in the vector. For example, in the vector:

The 'a' (the first element) is at index 1, 'e' (the second element) is at index 2, and so on.

Note in R vector elements are indexed starting with 1 (one-based indexing). This is distinct from many other programming languages which use zero-based indexing and so reference the first element at index 0.

7.3.1 Simple Numeric Indices

You can retrieve a value from a vector using bracket notation: you refer to the element at a particular index of a vector by writing the name of the vector, followed by square brackets ([]) that contain the index of interest:

Don’t get confused by the [1] in the printed output—it doesn’t refer to which index you got from people, but what index in the extracted result (e.g., stored in first_person) is being printed!

If you specify an index that is out-of-bounds (e.g., greater than the number of elements in the vector) in the square brackets, you will get back the value NA, which stands for Not Available. Note that this is not the character string "NA", but a specific value, specially designed to denote missing data.

If you specify a negative index in the square-brackets, R will return all elements except the (negative) index specified:

7.3.2 Multiple Indices

Remember that in R, all atomic objects are vectors. This means that when you put a single number inside the square brackets, you’re actually putting a vector with a single element in it into the brackets So what you’re really doing is specifying a vector of indices that you want R to extract from the vector. As such, you can put a vector of any length inside the brackets, and R will extract all the elements with those indices from the vector (producing a subset of the vector elements):

It’s very-very handy to use the colon operator to quickly specify a range of indices to extract:

This easily reads as “a vector of the elements in positions 2 through 5”.

The object returned by multiple indexing (and also by a single index) is a copy of the original, unlike in some other programming languages. These are good news in terms of avoiding unexpected effects: modifying the returned copy does not affect the original. However, copying large objects may be costly and make your code slow and sluggish.

7.3.3 Logical Indexing

In the above section, you used a vector of indices (numeric values) to retrieve a subset of elements from a vector. Alternatively, you can put a vector of logical values inside the square brackets to specify which ones you want to extract (TRUE in the corresponding position means extract, FALSE means don’t extract):

R will go through the boolean vector and extract every item at the position that is TRUE. In the example above, since filter is TRUE and indices 1, 4, and 5, then shoe_sizes[filter] returns a vector with the elements from indices 1, 4, and 5.

This may seem a bit strange, but it is actually incredibly powerful because it lets you select elements from a vector that meet a certain criteria (called filtering). You perform this filtering operation by first creating a vector of boolean values that correspond with the indices meeting that criteria, and then put that filter vector inside the square brackets:

There is often little reason to explicitly create the index vector shoe_is_big. You can combine the second and third lines of code into a single statement with anonymous index vector:

You can think of the this statement as saying “shoe_sizes where shoe_sizes is greater than 6.5”. This is a valid statement because the logical expression inside of the square-brackets (shoe_sizes > 6.5) is evaluated first, producing an anonymos boolean vector which is then used to filter the shoe_sizes vector.

This kind of filtering is immensely popular in real-life applications.

7.3.4 Named Vectors and Character Indexing

All the vectors we created above where created without names. But vector elements can have names, and given they have names, we can access these using the names. There are two ways to create named vectors.

First, we can add names when creating vectors with c() function:

This creates a numeric vector of length 3 where each element has a name. Note that we have to quote the names, such as “c-2”, that are not valid R variable names. Note also that the printout differs from that of unnamed vectors, in particular the index position ([1]) is not printed.

Alternatively, we can set names to an already existing vector using the names() function:1

Now when we have a named vector, we can access it’s elements by names. For instance

Note that in the latter case the names "B" and "D" are in “wrong order”, i.e. not in the same order as they are in the vector numbers. However, this works just fine, the elements are extracted in the order they are specified in the index (This is only possible with character and numeric indices, logical index can only extract elements in the “right” order.)

While most vectors we encounter in this book gain little by names, exactly the same approach also applies to lists and data frames where character indexing is one of the important workhorses.

Another important use case of named vectors in R are a substitute of maps (aka dictionaries). Maps are just lookup tables where we can find a value that corresponds to a value of another element in the table. For instance, the example above found values that correspond to the names "D" and "B".

7.4 Modifying Vectors

Indexing can also be used to modify elements within the vector. To do this, put the extracted subset on the left-hand side of the assignment operator, and then assign the element a new value:

And of course, there’s no reason that you can’t select multiple elements on the left-hand side, and assign them multiple values. The assignment operator is vectorized!

If you vector has names, you can use character indexing in exactly the same way.

Logical indexing offer some very powerful possibilities. Imagine you had a vector of values in which you wanted to replace all numbers greater that 10 with the number 10 (to “cap” the values). We can achieve with an one-liner:

In this example, we first compute the logical index of “too large” values by v1 > 10, and thereafter assign the value 10 to all these elements in vector v1. Replacing a numeric vector by the absolute values of the elements can be done in a similar fashion:

As a first step we find the logical index of the negative elements of v: v < 0. Next, we flip the sign of these elements in v by replacing these with -v[v < 0].

7.5 Vectorized Operations

Many R operators and functions are optimized for vectors, i.e. when fed a vector, they work on all elements of that vector. These operations are usually very fast and efficient.

7.5.1 Vectorized Operators

When performing operations (such as mathematical operations +, -, etc.) on vectors, the operation is applied to vector elements member-wise. This means that each element from the first vector operand is modified by the element in the same corresponding position in the second vector operand, in order to determine the value at the corresponding position of the resulting vector. E.g., if you want to add (+) two vectors, then the value of the first element in the result will be the sum (+) of the first elements in each vector, the second element in the result will be the sum of the second elements in each vector, and so on.

7.5.2 Vectorized Functions

Vectors In, Vector Out

Because all atomic objects are vectors, it means that pretty much every function you’ve used so far has actually applied to vectors, not just to single values. These are referred to as vectorized functions, and will run significantly faster than non-vector approaches. You’ll find that functions work the same way for vectors as they do for single values, because single values are just instances of vectors! For instance, we can use paste() to concatenate the elements of two character vectors:

Notice the same member-wise combination is occurring: the paste() function is applied to the first elements, then to the second elements, and so on.

For another example consider the round() function described in the previous chapter. This function rounds the given argument to the nearest whole number (or number of decimal places if specified).

But recall that the 1.6 in the above example is actually a vector of length 1. If we instead pass a longer vector as an argument, the function will perform the same rounding on each element in the vector.

This vectorization process is extremely powerful, and is a significant factor in what makes R an efficient language for working with large data sets (particularly in comparison to languages that require explicit iteration through elements in a collection). Thus to write really effective R code, you’ll need to be comfortable applying functions to vectors of data, and getting vectors of data back as results.

Just remember: when you use a vectorized function on a vector, you’re using that function on each item in the vector!

7.5.3 Recycling

Above we saw a number of vectorized operations, where similar operations were applied to elements of two vectors member-wise. However, what happens if the two vectors are of unequal length? Recycling refers to what R does in cases when there are an unequal number of elements in two operand vectors. If R is tasked with performing a vectorized operation with two vectors of unequal length, it will reuse (recycle) elements from the shorter vector. For example:

In this example, R first combined the elements in the first position of each vector (1+1=2). Then, it combined elements from the second position (3+2=5). When it got to the third element of v1 it run out of elements of v2, so it went back to the beginning of v2 to select a value, yielding 5+1=6. Finally, it combined the 4th element of v1 (8) with the second element of v2 (2) to get 10.

If the longer object length is not a multiple of shorter object length, R will issue a warning, notifying you that the lengths do not match. This warning doesn’t necessarily mean you did something wrong although in practice this tends to be the case.

7.5.4 R Is a Vectorized World

Actually we have already met many more examples of recycling and vectorized functions above. For instance, in the case of finding big shoes with

we first recycle the length-one vector 6.5 five times to match it with the shoe size vector c(7, 6.5, 4, 11, 8). Afterwards we use the vectorized operator > (or actually the function ">()") to compare each of the shoe sizes with the value 6.5. The result is a logical vector of length 5.

This is also what happens if you add a vector and a “regular” single value (a scalar):

As you can see (and probably expected), the operation added 4 to every element in the vector. The reason this sensible behavior occurs is because all atomic objects are vectors. Even when you thought you were creating a single value (a scalar), you were actually just creating a vector with a single element (length 1). When you create a variable storing the number 7 (with x <- 7), R creates a vector of length 1 with the number 7 as that single element:

  • This is why R prints the [1] in front of all results: it’s telling you that it’s showing a vector (which happens to have 1 element) starting at element number 1.

  • This is also why you can’t use the length() function to get the length of a character string; it just returns the length of the array containing that string (1). Instead, use the nchar() function to get the number of characters in each element in a character vector.

Thus when you add a “scalar” such as 4 to a vector, what you’re really doing is adding a vector with a single element 4. As such the same recycling principle applies, and that single element is “recycled” and applied to each element of the first operand.

Note: here we are implicitly using the word vector in two different meanings. The one is a way R stores objects (atomic vector), the other is vector in the mathematical sense, as the opposite to scalar. Similar confusion also occurs with matrices. Matrices as mathematical objects are distinct from vectors (and scalars). In R they are stored as vectors, and treated as matrices in dedicated matrix operations only.

Finally, you should also know that there are many kinds of objects in R that are not vectors. These include functions, and many other more “exotic” objects.

  1. Strictly speaking, this is names()<- function, the assignment function that sets the names, in contrast to the names() function that extracts names from an object.