For this assignment you should record your answers in an R Markdown file and submit the compiled output as a pdf. You are allowed to work in groups of 2-3 for this Homework assignment. Should you work in a group, only one submission is necessary on gradescope.
Homework #6 is due Wednesday 4/29 by 10:00pm \(~\)
\(~\)
Create a SQL database called “HW7DB” and add the 4 tables from the nycflights13 library to it. Demonstrate that you have done this correctly by printing out the 5 table names and their lengths in SQL.
## [1] "airlines" "airports" "flights" "planes" "weather"
## [1] "Flights length: 336776"
Hint: look up the dbWriteTable command
\(~\)
This is Question 6 from lab 5 revisited.
Using R, without saving intermediate tables, create a table containing 6 columns: the plane model, the airline name, hour, temperature, origin airport name, and destination airport name for all flights on the day October 9, 2013.
This should be done within a series of pipes, do not save intermediate tables in your final solution.
I have printed out the first 5 rows and the dimensions of the expected data below
## model airline timeOfDepart temperature originAirport
## 1 A321-231 US Airways Inc. 5 51.08 Newark Liberty Intl
## 2 757-222 United Air Lines Inc. 5 51.08 Newark Liberty Intl
## 3 737-724 United Air Lines Inc. 5 55.94 La Guardia
## 4 A320-232 JetBlue Airways 5 53.06 John F Kennedy Intl
## 5 <NA> American Airlines Inc. 5 53.06 John F Kennedy Intl
## destinationAirport
## 1 Charlotte Douglas Intl
## 2 George Bush Intercontinental
## 3 George Bush Intercontinental
## 4 <NA>
## 5 Miami Intl
## [1] 974 6
In addition, write out a paragraph explaining the general logic of what your code is doing i.e. what we are joining and why (not just data table names), and which keys are being matched, and what information is being added in each step.
\(~\)
Using SQL, repeat Question 2 (paragraph explanation is not necessary). This should be done within a single SQL query. Then use the following code to double-check that the two datasets are the same.
For example:
# check for differences other than NAs
sum((Q2data==Q3data)==FALSE,na.rm=TRUE)
## [1] 0
\(~\)
Modify your Question 3 query to get an appropriate data table and
recreate the following graph with ggplot2. You may not use
mutate or other dplyr functions, all necessary
filtering, variable creation, and counting should be in SQL.
Hint: It may be helpful to look back at your Lab 5 Question
7 code which did this using dplyr functions.
Note: Color/factor order for Departure Time does not need to match.
Subqueries are useful when we need to perform multiple operations, not all of which can be performed on the original dataset. Here is an example of performing operations using the pipe function in R (we’ve seen similar stuff to this before). Subqueries operate in much the same way, breaking up the operations and perhaps using intermediate data tables.
For this question we will use the 2019 college scorecard data.
Suppose I wanted to do something like the following: Find all private colleges with admission rates below the average admission rate of private colleges. The order that the filters are applied matters for this question, and I’d want to make sure to use two separate filters so R doesn’t accidentally apply them in the wrong order. Notice the following produce different outputs when you run them in your R console.
colleges %>% filter(Type == "Private") %>%
filter(Adm_Rate < mean(Adm_Rate, na.rm = T))
colleges %>% filter(Adm_Rate < mean(Adm_Rate, na.rm = T)) %>% filter(Type == "Private")
How this would work in a SQL command
dbGetQuery(mydb, "
SELECT *
FROM (
SELECT *
FROM colleges
WHERE Type = 'Private'
AND Adm_Rate IS NOT NULL
) AS private_colleges
WHERE Adm_Rate < (
SELECT AVG(Adm_Rate)
FROM (
SELECT *
FROM colleges
WHERE Type = 'Private'
AND Adm_Rate IS NOT NULL
) AS private_for_avg
);
")
Part A: Explain how subqueries are breaking up the tasks in this question. Your answer should mention how many subqueries there are and what each is doing.
Part B: Write a SQL query to answer this question: Find colleges with graduation rates above the overall average graduation rate, using both Male and Female rates.
Part C: Write a SQL query using subqueries to answer this question: Find all private colleges with admission rates below average and graduation rates above average. (It may be helpful to think about how this would be performed in R)