Directions for all labs (read before starting)

  1. Please work together with your assigned partner. Make sure you both fully understand something before moving on.
  2. Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
  3. Ask for help, clarification, or even just a check-in if anything seems unclear.

The “Lab” section is something you will work on with a partner using paired programming, a framework defined as follows:

Partners are encouraged to switch roles throughout the “Lab” section.

\(~\)

\(~\)

Preamble

Packages and Datasets

This lab will primarily use the dplyr package.

# load the following packages
library(dplyr)
library(ggplot2)

The lab’s examples will use two databases:

orders <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/orders.csv", as.is = TRUE)
customers <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/customers.csv", as.is = TRUE)
  • Description: A contrived example of a customer database that is small enough to visually inspect.
library(nycflights13)
data("flights")
data("planes")
data("weather")
data("airports")
  • Description: Data originating from the US Bureau of Transportation pertaining to all flights that departed from New York City in 2013.

\(~\)

Databases

So far we’ve only dealt with data that could be stored in a single spreadsheet; however, many real world applications require more flexibility.

Data are often stored in databases, or structured collections of records. Essential to a database is its relational structure. For example, here is the relational structure of the nycflights13 database:

From this diagram you should notice a few things:

  • Information about a plane could be added to flights via the “tailnum” variable
    • This variable exists in both data files
  • Information about the airport that a plane is departing from or arriving at can be added to flights via the “faa” variable
    • The corresponding key in flights is either “origin” or “dest”

Keys

Matching data across the different components of a database is done using keys:

  • A primary key is a variable (or set of variables) that uniquely define an observation in it’s own table
    • “tailnum” is a primary key in planes as no tail number appears more than once in this data frame
planes %>% count(tailnum) %>% filter(n > 1)  ## Primary key (unique to every plane)
## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>
  • “faa” is a primary key in airports as no entry appears more than once in this data frame
airports %>% count(faa) %>% filter(n > 1) ## Primary key (unique to each airport)
## # A tibble: 0 × 2
## # ℹ 2 variables: faa <chr>, n <int>

  • A foreign key is a variable (or group of variables) in another table that can be linked to a primary key.
    • “tailnum” is a foreign key in the flights data file, as it can be linked to “tailnum” in the planes data file
    • Notice that “tailnum” is not a primary key in the flights file (as many values appear more than once)
flights %>% count(tailnum) %>% filter(n > 1)  ## Foreign key (unique to planes, but not flights)
## # A tibble: 3,873 × 2
##    tailnum     n
##    <chr>   <int>
##  1 D942DN      4
##  2 N0EGMQ    371
##  3 N10156    153
##  4 N102UW     48
##  5 N103US     46
##  6 N104UW     47
##  7 N10575    289
##  8 N105UW     45
##  9 N107US     41
## 10 N108UW     60
## # ℹ 3,863 more rows

A primary key and a foreign key combine to form a relation. Relations are used to link information in one data file to another.

Question #1: Use the following data tables in the nycflights13 database:

data(flights)
data(airlines)
  • Part A: Is the variable “flight” a primary key in the flights? Why or why not?
  • Part B: What keys form the relation needed to add the full airline name (contained in airlines) to the records in flights?
  • Part C: Is the information in these data tables (excluding airlines ) wide or long form?
  • Part D: What are some advantages of storing information in wide form? What about advantages of long form?

\(~\)

Lab

At this point you will begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.

\(~\)

Mutating Joins

The goal of a mutating join is to combine variables from two different data frames, “X” and “Y”.

There are three important types of mutating joins:

  • Left Outer Join: Keep all observations in “X” and add information from any matching records in “Y”, filling with NA for records in “X” that do not have a match in “Y”.
  • Full Outer Join: Keep all observations in both “X” and “Y”, filling with NA for records in “X” without a match in “Y” and for records in “Y” without a match in “X”
  • Inner Join: Keep only observations with records in both “X” and “Y”, dropping any records that aren’t present in both data frames.

In most circumstances you can expect to use a left outer join, as this approach will preserve all of the records in “X” and attach additional variables from “Y”. The different mutating joins are summarized in the graphic below:

Blue sections illustrate observations (rows) that will be present in the combined data frame, while the white sections represent observations that will be omitted. We will not discuss Right Joins, as they are equivalent to a Left Outer Join with the roles of “X” and “Y” reversed.

\(~\)

Left Joins

Consider the data frames “orders” and “customers”:

print(orders)
##   order id   date
## 1     1  4 Jan-01
## 2     2  8 Feb-01
## 3     3 42 Apr-15
## 4     4 50 Apr-17
print(customers)
##   id    name
## 1  4   Tukey
## 2  8 Wickham
## 3 15   Mason
## 4 16  Jordan
## 5 23   Patil
## 6 42     Cox

Notice that the variable “id” is a primary key (unique identifier) in each data frame.

The left_join() function is used to perform a left outer join, adding the customer information to each order using the column “id” as the join’s key:

left_join(x = orders, y = customers, by = "id")
##   order id   date    name
## 1     1  4 Jan-01   Tukey
## 2     2  8 Feb-01 Wickham
## 3     3 42 Apr-15     Cox
## 4     4 50 Apr-17    <NA>

Because no entries in the customers data frame had an “id” of 50, a missing value, NA, appears as the name for that order in the combined data frame.

In most situations, left joins are not symmetric, so your choice of “x” and “y” matters:

left_join(x = customers, y = orders, by = "id")
##   id    name order   date
## 1  4   Tukey     1 Jan-01
## 2  8 Wickham     2 Feb-01
## 3 15   Mason    NA   <NA>
## 4 16  Jordan    NA   <NA>
## 5 23   Patil    NA   <NA>
## 6 42     Cox     3 Apr-15

As seen above, customers with “id” numbers of 15, 16, and 23 did not have any matching records in the orders data frame, resulting in missing values for “order” and “date” in the combined data frame.

Question #2: Consider the example join demonstrated below. Given our initial description of a left outer join, you’d expect the left join using table2 as the “x” argument (in the above code) to produce a data frame with 3 rows (one for each observation in table2). Instead, in this scenario, it produces a data frame with 6 rows. Why do you think this happens?

table1 <- data.frame(key = c("a", "a", "b", "b", "b", "c"), X = 1:6)
table2 <- data.frame(key = c("a", "b", "c"), Y = c("X", "Y", "Z"))

left_join(x = table2, y = table1, by = "key")
##   key Y X
## 1   a X 1
## 2   a X 2
## 3   b Y 3
## 4   b Y 4
## 5   b Y 5
## 6   c Z 6

\(~\)

Full Outer Joins

The full_join() function will perform a full outer join of two data frames:

full_join(x = orders, y = customers, by = "id")
##   order id   date    name
## 1     1  4 Jan-01   Tukey
## 2     2  8 Feb-01 Wickham
## 3     3 42 Apr-15     Cox
## 4     4 50 Apr-17    <NA>
## 5    NA 15   <NA>   Mason
## 6    NA 16   <NA>  Jordan
## 7    NA 23   <NA>   Patil

Notice how the order without a customer (id 50) as well as the three customers without orders (ids 15, 16, and 23) are all included in the resulting data frame alongside the three records that had matches (ids 4, 8, and 42).

Outer joins are symmetric, so the choice of “x” and “y” will only impact the ordering of columns in the resulting data frame:

full_join(x = customers, y = orders, by = "id")
##   id    name order   date
## 1  4   Tukey     1 Jan-01
## 2  8 Wickham     2 Feb-01
## 3 15   Mason    NA   <NA>
## 4 16  Jordan    NA   <NA>
## 5 23   Patil    NA   <NA>
## 6 42     Cox     3 Apr-15
## 7 50    <NA>     4 Apr-17

\(~\)

Inner Joins

The inner_join() function is used to perform an inner join on two data frames:

inner_join(x = orders, y = customers, by = "id")
##   order id   date    name
## 1     1  4 Jan-01   Tukey
## 2     2  8 Feb-01 Wickham
## 3     3 42 Apr-15     Cox

Inner joins also are symmetric, so your choice of “x” and “y” will only impact the ordering of columns in the resulting data frame:

inner_join(x = customers, y = orders, by = "id")
##   id    name order   date
## 1  4   Tukey     1 Jan-01
## 2  8 Wickham     2 Feb-01
## 3 42     Cox     3 Apr-15

Question #3: Consider two data frames, “A” and “B”, each containing the key variable “ID”. In “A”, there is a single record for each of the following IDs: \(\{1, 2, 3, 4\}\). In “B”, there is a single record for each of the following IDs: \(\{3, 4, 5, 6\}\). Given this information, state how many rows will be present in the result produced by the following joins:

  • A: left_join(x = A, y = B, by = "ID")
  • B: left_join(x = B, y = A, by = "ID")
  • C: full_join(x = A, y = B, by = "ID")
  • D: inner_join(x = A, y = B, by = "ID")

\(~\)

More on the by argument

In an ideal world, you have a single key with the same name that can link the information in your “x” and “y” data frames. This is not always the case; however, there are several ways to use the by argument to accommodate those scenarios.

  • The default value, by = NULL, will perform a “natural join”, which will automatically identify all variables with shared names in “x” and “y” and use that collection to link matching records. This can be useful when several variables are needed to uniquely identify an observation.
  • Alternatively, you may use named vector syntax, for example: by = c("a" = "b"). In this example, the variable named “a” in the “x” data frame will be matched with the variable named “b” in the “y” data frame to link records and perform the join.

Question #4: In the nycflights13 database, the locations of each airport (“lat” and “lon”) are stored in the airports data file. For this question, use an appropriate join to add the “lat” and “lon” of the destination airport to each flight recorded in the flights data file. Hint: the variable “faa” is a primary key in airports.

\(~\)

Filtering Joins

Filtering joins will match observations in the same way as mutating joins, but they will not add any variables from “y” to the “x” data frame.

  • semi_join() will keep the observations in “x” that have a match in “y”
  • anti_join() will drop the observations in “x” that have a match in “y”

Here a few examples:

semi_join(x = orders, y = customers, by = "id")
##   order id   date
## 1     1  4 Jan-01
## 2     2  8 Feb-01
## 3     3 42 Apr-15

Notice how semi_join() filters out the fourth order because there are no customers with an id of 50 in the customers data frame.

anti_join(x = orders, y = customers, by = "id")
##   order id   date
## 1     4 50 Apr-17

Conversely, anti_join() will filter out everything except for the fourth order because it was the only observation in the “x” data file without a match in the “y” data file.

Filtering joins are most useful in finding implicit missing values, or values that aren’t recorded as NA and instead exist only as an absence. For example, we could find missing entries in airports by looking for flights whose destination airport doesn’t have a match:

anti_join(x = flights, y = airports, by = c("dest" = "faa")) %>% 
  distinct(dest)  ## Return only the unique values of "dest"
## # A tibble: 4 × 1
##   dest 
##   <chr>
## 1 BQN  
## 2 SJU  
## 3 STT  
## 4 PSE

We can see that there are four implicitly missing airports in our database.

Question #5: The code below creates a data frame, top_dest, containing the ten most common flight destinations and how many flights arrived at them.

top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
  • Part A: Use this data frame, along with an appropriate join, to filter the flights data file to include only flights whose destination is one of the top ten airports. Then, use dim() to print the dimensions of the resulting data frame.
  • Part B: Using your result from Part A and an appropriate join, find the number of planes whose destination was a top ten airport that do not have a record in the planes data frame.

\(~\)

Practice!

Question #6 Without saving intermediate tables, create a table containing 6 columns: the plane model, the airline name, hour, temperature, origin airport name, and destination airport name for all flights on the day October 9, 2013.

This should be done using a series of pipes, do not save intermediate tables in your final solution.

Hints: The expected output should have 974 rows, 6 columns. An example of the first 5 rows in my table are below

## # A tibble: 5 × 6
##   model    airline     timeOfDepart temperature originAirport destinationAirport
##   <chr>    <chr>              <dbl>       <dbl> <chr>         <chr>             
## 1 A321-231 US Airways…            5        51.1 Newark Liber… Charlotte Douglas…
## 2 757-222  United Air…            5        51.1 Newark Liber… George Bush Inter…
## 3 737-724  United Air…            5        55.9 La Guardia    George Bush Inter…
## 4 A320-232 JetBlue Ai…            5        53.1 John F Kenne… <NA>              
## 5 <NA>     American A…            5        53.1 John F Kenne… Miami Intl

Question #7 Using the dataset from Question 6, recreate the following figure. Colors do not need to match.