For this assignment you should record your answers in an R Markdown file and submit the compiled output as a pdf. You are allowed to work in groups of 2-3 for this Homework assignment. Should you work in a group, only one submission is necessary on gradescope.
Homework #7 is due Friday 12/05 by 10:00pm \(~\)
\(~\)
Create a SQL database called “HW7DB” and add the 4 tables from the nycflights13 library to it. Demonstrate that you have done this correctly by printing out the 5 table names and their lengths in SQL.
## [1] "airlines" "airports" "flights" "planes" "weather"
## [1] "Flights length: 336776"
Hint: look up the dbWriteTable command
\(~\)
This is Question 6 from lab 5 revisited.
Using R, without saving intermediate tables, create a table containing 6 columns: the plane model, the airline name, hour, temperature, origin airport name, and destination airport name for all flights on the day October 9, 2013.
This should be done within a series of pipes, do not save intermediate tables in your final solution.
I have printed out the first 5 rows and the dimensions of the expected data below
## model airline timeOfDepart temperature originAirport
## 1 A321-231 US Airways Inc. 5 51.08 Newark Liberty Intl
## 2 757-222 United Air Lines Inc. 5 51.08 Newark Liberty Intl
## 3 737-724 United Air Lines Inc. 5 55.94 La Guardia
## 4 A320-232 JetBlue Airways 5 53.06 John F Kennedy Intl
## 5 <NA> American Airlines Inc. 5 53.06 John F Kennedy Intl
## destinationAirport
## 1 Charlotte Douglas Intl
## 2 George Bush Intercontinental
## 3 George Bush Intercontinental
## 4 <NA>
## 5 Miami Intl
## [1] 974 6
In addition, write out a paragraph explaining the general logic of what your code is doing i.e. what we are joining and why (not just data table names), and which keys are being matched, and what information is being added in each step.
\(~\)
Using SQL, repeat Question 2 (paragraph explanation is not necessary). This should be done within a single SQL query. Then use the following code to double-check that the two datasets are the same.
For example:
# check for differences other than NAs
sum((Q2data==Q3data)==FALSE,na.rm=TRUE)
## [1] 0
\(~\)
Modify your Question 3 query to get an appropriate data table and
recreate the following graph with ggplot2. You may not use
mutate or other dplyr functions, all necessary
filtering, variable creation, and counting should be in SQL.
Hint: It may be helpful to look back at your Lab 5 Question
7 code which did this using dplyr functions.
Note: Color/factor order for Departure Time does not need to match.