Chapter 11 Accessing Web APIs

R is able to load data from external packages or read it from locally-saved .csv files, but it is also able to download data directly from web sites on the internet. This allows scripts to always work with the latest data available, performing analysis on data that may be changing rapidly (such as from social networks or other live events). Web services may make their data easily accessible to computer programs like R scripts by offering an Application Programming Interface (API). A web service’s API specifies where and how particular data may be accessed, and many web services follow a particular style known as Representational State Transfer (REST). This chapter will cover how to access and work with data from these RESTful APIs.

11.1 What is a Web API?

An interface is the point at which two different systems meet and communicate: exchanging informations and instructions. An Application Programming Interface (API) thus represents a way of communicating with a computer application by writing a computer program (a set of formal instructions understandable by a machine). APIs commonly take the form of functions that can be called to give instructions to programs—the set of functions provided by a library like dplyr make up the API for that library.

While most APIs provide an interface for utilizing functionality, other APIs provide an interface for accessing data. One of the most common sources of these data apis are web services: websites that offer an interface for accessing their data.

With web services, the interface (the set of “functions” you can call to access the data) takes the form of HTTP Requests—a request for data sent following the HyperText Transfer Protocol. This is the same protocol (way of communicating) used by your browser to view a web page! An HTTP Request represents a message that your computer sends to a web server (another computer on the internet which “serves”, or provides, information). That server, upon receiving the request, will determine what data to include in the response it sends back to the requesting computer. With a web browser, the response data takes the form of HTML files that the browser can render as web pages. With data APIs, the response data will be structured data that you can convert into R structures such as lists or data frames.

In short, loading data from a Web API involves sending an HTTP Request to a server for a particular piece of data, and then receiving and parsing the response to that request.

11.2 RESTful Requests

There are two parts to a request sent to an API: the name of the resource (data) that you wish to access, and a verb indicating what you want to do with that resource. In many ways, the verb is the function you want to call on the API, and the resource is an argument to that function.

11.2.1 URIs

Which resource you want to access is specified with a Uniform Resource Identifier (URI). A URI is a generalization of a URL (Uniform Resource Locator)—what you commonly think of as “web addresses”. URIs act a lot like the address on a postal letter sent within a large organization such as a university: you indicate the business address as well as the department and the person, and will get a different response (and different data) from Alice in Accounting than from Sally in Sales.

  • Note that the URI is the identifier (think: variable name) for the resource, while the resource is the actual data value that you want to access.

Like postal letter addresses, URIs have a very specific format used to direct the request to the right resource.

The format (schema) of a URI.

The format (schema) of a URI.

Not all parts of the format are required—for example, you don’t need a port, query, or fragment. Important parts of the format include:

  • scheme (protocol): the “language” that the computer will use to communicate the request to this resource. With web services this is normally https (secure HTTP)
  • domain: the address of the web server to request information from
  • path: which resource on that web server you wish to access. This may be the name of a file with an extension if you’re trying to access a particular file, but with web services it often just looks like a folder path!
  • query: extra parameters (arguments) about what resource to access.

The domain and path usually specify the resource. For example, www.domain.com/users might be an identifier for a resource which is a list of users. Note that web services can also have “subresources” by adding extra pieces to the path: www.domain.com/users/mike might refer to the specific “mike” user in that list.

With an API, the domain and path are often viewed as being broken up into two parts:

  • The Base URI is the domain and part of the path that is included on all resources. It acts as the “root” for any particular resource. For example, the GitHub API has a base URI of https://api.github.com/.

  • An Endpoint, or which resource on that domain you want to access. Each API will have many different endpoints.

    For example, GitHub includes endpoints such as:

    • /users/{user} to get information about a specific :user: id (the {} indicate a “variable”, in that you can put any username in there in place of the string "{user}"). Check out this example in your browser.
    • /orgs/{organization}/repos to get the repositories that are on an organization page. See an example.

Thus you can equivalently talk about accessing a particular resource and sending a request to a particular endpoint. The endpoint is appended to the end of the Base URI, so you could access a GitHub user by combining the Base URI (https://api.github.com) and endpoint (/users/mkfreeman) into a single string: https://api.github.com/users/mkfreeman. That URL will return a data structure of new releases, which you can request from your R program or simply view in your web-browser.

11.2.1.1 Query Parameters

Often in order to access only partial sets of data from a resource (e.g., to only get some users) you also include a set of query parameters. These are like extra arguments that are given to the request function. Query parameters are listed after a question mark ? in the URI, and are formed as key-value pairs similar to how you named items in lists. The key (parameter name) is listed first, followed by an equal sign =, followed by the value (parameter value); note that you can’t include any spaces in URIs! You can include multiple query parameters by putting an ampersand & between each key-value pair:

?firstParam=firstValue&secondParam=secondValue&thirdParam=thirdValue

Exactly what parameter names you need to include (and what are legal values to assign to that name) depends on the particular web service. Common examples include having parameters named q or query for searching, with a value being whatever term you want to search for: in https://www.google.com/search?q=informatics, the resource at the /search endpoint takes a query parameter q with the term you want to search for!

11.2.1.2 Access Tokens and API Keys

Many web services require you to register with them in order to send them requests. This allows them to limit access to the data, as well as to keep track of who is asking for what data (usually so that if someone starts “spamming” the service, they can be blocked).

To facilitate this tracking, many services provide Access Tokens (also called API Keys). These are unique strings of letters and numbers that identify a particular developer (like a secret password that only works for you). Web services will require you to include your access token as a query parameter in the request; the exact name of the parameter varies, but it often looks like access_token or api_key. When exploring a web service, keep an eye out for whether they require such tokens.

Access tokens act a lot like passwords; you will want to keep them secret and not share them with others. This means that you should not include them in your committed files, so that the passwords don’t get pushed to GitHub and shared with the world. The best way to get around this in R is to create a separate script file in your repo (e.g., api-keys.R) which includes exactly one line: assigning the key to a variable:

You can then include this file_name_ in a .gitignore file in your repo; that will keep it from even possibly being committed with your code!

In order to access this variable in your “main” script, you can use the source() function to load and run your api-keys.R script. This will execute the line of code that assigns the api_key variable, making it available in your environment for your use:

Anyone else who runs the script will simply need to provide an api_key variable to access the API using their key, keeping everyone’s account separate!

Watch out for APIs that mention using OAuth when explaining API keys. OAuth is a system for performing authentification—that is, letting someone log into a website from your application (like what a “Log in with Facebook” button does). OAuth systems require more than one access key, and these keys must be kept secret and usually require you to run a web server to utilize them correctly (which requires lots of extra setup, see the full httr docs for details). So for this course, we encourage you to avoid anything that needs OAuth

11.2.2 HTTP Verbs

When you send a request to a particular resource, you need to indicate what you want to do with that resource. When you load web content, you are typically sending a request to retrieve information (logically, this is a GET request). However, there are other actions you can perform to modify the data structure on the server. This is done by specifying an HTTP Verb in the request. The HTTP protocol supports the following verbs:

  • GET Return a representation of the current state of the resource
  • POST Add a new subresource (e.g., insert a record)
  • PUT Update the resource to have a new state
  • PATCH Update a portion of the resource’s state
  • DELETE Remove the resource
  • OPTIONS Return the set of methods that can be performed on the resource

By far the most common verb is GET, which is used to “get” (download) data from a web service. Depending on how you connect to your API (i.e., which programming language you are using), you’ll specify the verb of interest to indicate what we want to do to a particular resource.

Overall, this structure of treating each datum on the web as a resource which we can interact with via HTTP Requests is referred to as the REST Architecture (REST stands for REpresentational State Transfer). This is a standard way of structuring computer applications that allows them to be interacted with in the same way as everyday websites. Thus a web service that enabled data access through named resources and responds to HTTP requests is known as a RESTful service, with a RESTful API.

11.3 Accessing Web APIs

To access a Web API, you just need to send an HTTP Request to a particular URI. You can easily do this with the browser: simply navigate to a particular address (base URI + endpoint), and that will cause the browser to send a GET request and display the resulting data. For example, you can send a request to search GitHub for repositories named d3 by visiting:

https://api.github.com/search/repositories?q=d3&sort=forks

This query accesses the /search/repositories/ endpoint, and also specifies 2 query parameters:

  • q: The term(s) you are searching for, and
  • sort: The attribute of each repository that you would like to use to sort the results

(Note that the data you’ll get back is structured in JSON format. See below for details).

In R you can send GET requests using the httr library. Like dplyr, you will need to install and load it to use it:

This library provides a number of functions that reflect HTTP verbs. For example, the GET() function will send an HTTP GET Request to the URI specified as an argument:

While it is possible to include query parameters in the URI string, httr also allows you to include them as a list, making it easy to set and change variables (instead of needing to do a complex paste0() operation to produce the correct string):

If you try printing out the response variable, you’ll see information about the response:

Response [https://api.github.com/?q=d3&sort=forks]
  Date: 2017-10-22 19:05
  Status: 200
  Content-Type: application/json; charset=utf-8
  Size: 2.16 kB

This is called the response header. Each response has two parts: the header, and the body. You can think of the response as a envelope: the header contains meta-data like the address and postage date, while the body contains the actual contents of the letter (the data).

Since you’re almost always interested in working with the body, you will need to extract that data from the response (e.g., open up the envelope and pull out the letter). You can do this with the content() method:

Note the second argument "text"; this is needed to keep httr from doing it’s own processing on the body data, since we’ll be using other methods to handle that; keep reading for details!

Pro-tip: The URI shown when you print out the response is a good way to check exactly what URI you were sending the request to: copy that into your browser to make sure it goes where you expected!

11.4 JSON Data

Most APIs will return data in JavaScript Object Notation (JSON) format. Like CSV, this is a format for writing down structured data—but while .csv files organize data into rows and columns (like a data frame), JSON allows you to organize elements into key-value pairs similar to an R list! This allows the data to have much more complex structure, which is useful for web services (but can be challenging for us).

In JSON, lists of key-value pairs (called objects) are put inside braces ({ }), with the key and value separated by a colon (:) and each pair separated by a comma (,). Key-value pairs are often written on separate lines for readability, but this isn’t required. Note that keys need to be character strings (so in quotes), while values can either be character strings, numbers, booleans (written in lower-case as true and false), or even other lists! For example:

(In JavaScript the period . has special meaning, so it is not used in key names; hence the underscores _). The above is equivalent to the R list:

Additionally, JSON supports what are called arrays of data. These are like lists without keys (and so are only accessed by index). Key-less arrays are written in square brackets ([ ]), with values separated by commas. For example:

which is equvalent to the R list:

(Like objects , array elements may or may not be written on separate lines).

Just as R allows you to have nested lists of lists, and those lists may or may not have keys, JSON can have any form of nested objects and arrays. This can be arrays (unkeyed lists) within objects (keyed lists), such as a more complex set of data about Ada:

JSON can also be structured as arrays of objects (unkeyed lists of keyed lists), such as a list of data about Seahawks games:

The latter format is incredibly common in web API data: as long as each object in the array has the same set of keys, then you can easily consider this as a data table where each object (keyed list) represents an observation (row), and each key represents a feature (column) of that observation.

11.4.1 Parsing JSON

When working with a web API, the usual goal is to take the JSON data contained in the response and convert it into an R data structure you can use, such as list or data frame. While the httr package is able to parse the JSON body of a response into a list, it doesn’t do a very clean job of it (particularly for complex data structures).

A more effective solution is to use another library called jsonlite. This library provides helpful methods to convert JSON data into R data, and does a much more effective job of converting content into data frames that you can use.

As always, you will need to install and load this library:

jsonlite provides a function called fromJSON() that allows you to convert a JSON string into a list—or even a data frame if the columns have the right lengths!

The parsed_data will contain a list built out of the JSON. Depending on the complexity of the JSON, this may already be a data frame you can View()… but more likely you’ll need to explore the list more to locate the “main” data you are interested in. Good strategies for this include:

  • You can print() the data, but that is often hard to read (it requires a lot of scrolling).
  • The str() method will produce a more organized printed list, though it can still be hard to read.
  • The names() method will let you see a list of the what keys the list has, which is good for delving into the data.

As an example continuing the above code:

11.4.2 Flattening Data

Because JSON supports—and in fact encourages—nested lists (lists within lists), parsing a JSON string is likely to produce a data frame whose columns are themselves data frames. As an example:

Nested data frames make it hard to work with the data using previously established techniques. Luckily, the jsonlite package provides a helpful function for addressing this called flatten(). This function takes the columns of each nested data frame and converts them into appropriately named columns in the “outer” data frame:

Note that flatten() only works on values that are already data frames; thus you may need to find the appropriate element inside of the list (that is, the item which is the data frame you want to flatten).

In practice, you will almost always want to flatten the data returned from a web API. Thus your “algorithm” for downloading web data is as follows:

  1. Use GET() to download the data, specifying the URI (and any query parameters).
  2. Use content() to extract the data as a JSON string.
  3. Use fromJSON() to convert the JSON string into a list.
  4. Find which element in that list is your data frame of interest. You may need to go “multiple levels” deep.
  5. Use flatten() to flatten that data frame.
  6. Profit!

Pro-tip: JSON data can be quite messy when viewed in your web-browser. Installing a browser extension such as JSONView will format JSON responses in a more readable way, and even enable you to interactively explore the data structure.

Resources