This lab focuses on more advanced uses of SQL:
library(DBI)
library(odbc)
library(RMySQL)
library(tidyverse)
library(dbplyr)
The format is [database name]<-dbConnect([SQL Interpreter],[connection])
Student (Read only, Campus only) Credentials
db_user <- 'studentUser'
db_password <- 'STAread!'
db_name <- 'Test'
db_table <- 'your_data_table'
db_host <- 'aiken.cs.grinnell.edu' # for local access
db_port <- 3306
mydb <- dbConnect(MySQL(), user = db_user, password = db_password,
dbname = db_name, host = db_host, port = db_port)
#dbDisconnect(mydb)
You should always disconnect when you are done.
A subquery is a query inside of a query that creates a temporary result for the main query to use. Subqueries are useful to simplify complex joins by breaking them into smaller chunks.
dbGetQuery(mydb," SELECT tab1.Name, tab1.ProfessorDepartment,
tab2.CourseName,tab2.CourseDepartment
FROM
(SELECT a.fname as Name, b.depname as ProfessorDepartment, c.cid
FROM week9professors as a
LEFT JOIN week9departments as b ON a.did=b.did
LEFT JOIN week9profcourses as c ON a.pid=c.pid)
AS tab1
LEFT JOIN
(Select d.cname as CourseName, e.depname as CourseDepartment, d.cid
FROM week9courses as d
LEFT JOIN week9departments as e ON d.did=e.did)
AS tab2
ON tab1.cid=tab2.cid;") %>% head()
## Name ProfessorDepartment CourseName CourseDepartment
## 1 Elizabeth Chemistry General Chemistry w/lab Chemistry
## 2 Maisha Chemistry General Chemistry w/lab Chemistry
## 3 Rajendra Chemistry General Chemistry w/lab Chemistry
## 4 Stephen Chemistry General Chemistry w/lab Chemistry
## 5 Mark Chemistry Analytical Chemistry w/Lab Chemistry
## 6 Molly Chemistry Analytical Chemistry w/Lab Chemistry
Question 1: Explain how this query works as best you can. What information does it provide? What problems could you answer from last week using this query? We will talk more about subqueries on Wed/Fri.
The examples listed below will use the week9... tables
we saw last week. The questions will refer to the following five tables
for this section, all of which contain information on things for the
History department here at Grinnell:
course_offering ,office_hours,
professor_info ,syllabus_info and
course.
DISTINCT returns on unique values from a
column/variable, or possibly a combination of columns/variables that
define unique observations. It is used to remove duplicates from query
results.
# Everything from week9profcourses
dbGetQuery(mydb, "SELECT * FROM week9profcourses LIMIT 5")
## pid cid
## 1 42 MAT131
## 2 42 MAT218
## 3 41 ECN111
## 4 41 ECN295
## 5 40 ECN366
# Only unique pid's from week9profcourses
dbGetQuery(mydb, "SELECT DISTINCT pid FROM week9profcourses LIMIT 5")
## pid
## 1 42
## 2 41
## 3 40
## 4 39
## 5 38
# Only unique cid's from week9profcourses
dbGetQuery(mydb, "SELECT DISTINCT cid FROM week9profcourses LIMIT 5")
## cid
## 1 MAT131
## 2 MAT218
## 3 ECN111
## 4 ECN295
## 5 ECN366
Note: The combination of pid and cid in the
week9profcourses data table is already unique, so the
following DISTINCT does nothing extra.
dbGetQuery(mydb, "SELECT DISTINCT pid, cid FROM week9profcourses
LIMIT 5")
## pid cid
## 1 42 MAT131
## 2 42 MAT218
## 3 41 ECN111
## 4 41 ECN295
## 5 40 ECN366
By default, COUNT returns the number of non-null rows in
a column when the column is specified directly. We can use
COUNT(*) to count all rows of all variables, but this will
include non-null values. COUNT() is usually accompanied by
some GROUP BY argument at the end which allows us to
specify categories which we want counted.
# counting occurences (without grouping)
dbGetQuery(mydb, "SELECT pid, count(pid) FROM week9profcourses")
## pid count(pid)
## 1 42 68
# counting occurences (with grouping)
dbGetQuery(mydb, "SELECT pid as 'professor_id', count(pid) as 'number_of_courses' FROM week9profcourses GROUP BY pid ORDER BY pid DESC LIMIT 6")
## professor_id number_of_courses
## 1 42 2
## 2 41 2
## 3 40 1
## 4 39 2
## 5 38 2
## 6 37 1
Question 2: Explain the difference between what these two queries are counting. Does the first do what we want?
We can include DISTINCT within COUNT to
count unique occurences of something. Let’s say I wanted to count the
number of professors in the week9profcourses table.
# counting number of professors in data table (incorrect)
dbGetQuery(mydb, "SELECT count(pid) as number_of_unique_professors FROM week9profcourses ORDER BY pid DESC LIMIT 6")
## number_of_unique_professors
## 1 68
# counting number of professors in data table (correct)
dbGetQuery(mydb, "SELECT count(DISTINCT pid) as number_of_unique_professors FROM week9profcourses ORDER BY pid DESC LIMIT 6")
## number_of_unique_professors
## 1 42
Question 3: List the name of the professor, the number of courses they have taught over the years and their contact emails from the History department at Grinnell College. You will first need to investigate the tables, and join two appropriate data tables. Use the variable names ‘instructor’, ‘email’, and ‘number_of_courses_taught’.
HAVING filters groups after aggregation (some sort of
calculation). This is usually used for conditions on summarized values
by some group identity. The syntax works as follows. Also notice a
convenient way to structure query formatting.
# count number of courses per professor
dbGetQuery(mydb, "SELECT pid as 'professor_id', count(pid) as 'number_of_courses' FROM week9profcourses GROUP BY pid ORDER BY pid DESC LIMIT 6")
## professor_id number_of_courses
## 1 42 2
## 2 41 2
## 3 40 1
## 4 39 2
## 5 38 2
## 6 37 1
# select only professors with 1 course listed
dbGetQuery(mydb,
"SELECT pid as 'professor_id',
count(pid) as 'number_of_courses'
FROM week9profcourses
GROUP BY pid
HAVING count(pid) = 2
ORDER BY pid DESC
LIMIT 3")
## professor_id number_of_courses
## 1 42 2
## 2 41 2
## 3 39 2
# can also use inequalities
dbGetQuery(mydb,
"SELECT pid as 'professor_id',
count(pid) as 'number_of_courses'
FROM week9profcourses
GROUP BY pid
HAVING count(pid) > 1
ORDER BY pid DESC
LIMIT 3")
## professor_id number_of_courses
## 1 42 2
## 2 41 2
## 3 39 2
Question 4: Make modifications to your SQL query in Question 3, try to list the professor details who have taught more than 20 courses in the timespan of this data.
BETWEEN is another clause that works with HAVING that we can include to specify whether a value falls within a range, including the boundary values. Works with numbers (and dates, but more complicated). The syntax is as follows.
# syntax only, doesn't do anything special because of count vals
dbGetQuery(mydb,
"SELECT pid as 'professor_id',
count(pid) as 'number_of_courses'
FROM week9profcourses
GROUP BY pid
HAVING count(pid) BETWEEN 1 AND 2
ORDER BY pid DESC
LIMIT 3")
## professor_id number_of_courses
## 1 42 2
## 2 41 2
## 3 40 1
Question 5: Write a SQL query to list the name of professors and the average number of students enrolled in their courses.
Question 6: Use your answer to Question 5, modify the query to show only professors with average enrollment values between 20 and 24.
WHERE is another type of filtering clause, which filters rows before any grouping or aggregation happens. It’s used to specify conditions that individual rows must meet to be included in the results/calculations in the first place.
# select only Chemistry courses from week9courses
dbGetQuery(mydb, "SELECT * FROM week9courses WHERE did = 'CHM' ")
## cid cname did
## 1 CHM129 General Chemistry w/lab CHM
## 2 CHM210 Analytical Chemistry w/Lab CHM
## 3 CHM221 Organic Chemistry I w/lab CHM
## 4 CHM358 Instrumental Analysis w/lab CHM
## 5 CHM363 Physical Chemistry I w/lab CHM
## 6 CHM395 ST: Adv NMR Spectroscopy CHM
# select multiple categories: Chem, STAT
dbGetQuery(mydb, "SELECT * FROM week9courses WHERE did = 'CHM' OR did = 'STA' ")
## cid cname did
## 1 CHM129 General Chemistry w/lab CHM
## 2 CHM210 Analytical Chemistry w/Lab CHM
## 3 CHM221 Organic Chemistry I w/lab CHM
## 4 CHM358 Instrumental Analysis w/lab CHM
## 5 CHM363 Physical Chemistry I w/lab CHM
## 6 CHM395 ST: Adv NMR Spectroscopy CHM
## 7 STA209 Applied Statistics STA
## 8 STA230 Introduction to Data Science STA
## 9 STA310 Statistical Modeling STA
## 10 STA335 Probability & Statistics I STA
Another helpful clause to pair with this is IN (works
like %in% in dplyr) for multiple categories.
# select Chem, STAT using IN clause
dbGetQuery(mydb, "SELECT * FROM week9courses WHERE did IN ('CHM', 'STA') ")
## cid cname did
## 1 CHM129 General Chemistry w/lab CHM
## 2 CHM210 Analytical Chemistry w/Lab CHM
## 3 CHM221 Organic Chemistry I w/lab CHM
## 4 CHM358 Instrumental Analysis w/lab CHM
## 5 CHM363 Physical Chemistry I w/lab CHM
## 6 CHM395 ST: Adv NMR Spectroscopy CHM
## 7 STA209 Applied Statistics STA
## 8 STA230 Introduction to Data Science STA
## 9 STA310 Statistical Modeling STA
## 10 STA335 Probability & Statistics I STA
The following code shows how the LIKE operator works to match
patterns in a column. The course_offering table has info on
professors and the department of courses they have taught. Notice the
syntax for how to match things. % serves as a wildcard operator in the
LIKE clause.
# pick all stat courses, regardless of number
dbGetQuery(mydb, "SELECT * from week9courses
WHERE cid LIKE 'STA%'")
## cid cname did
## 1 STA209 Applied Statistics STA
## 2 STA230 Introduction to Data Science STA
## 3 STA310 Statistical Modeling STA
## 4 STA335 Probability & Statistics I STA
# pick only 300 level stat courses
dbGetQuery(mydb, "SELECT * from week9courses
WHERE cid LIKE 'STA3%'")
## cid cname did
## 1 STA310 Statistical Modeling STA
## 2 STA335 Probability & Statistics I STA
Question 7: Select the Professors whose name starts with either ‘A’ or ‘E’ and display their titles and phone numbers
Question 8: Using the office_hours
table, find all professors who have office hours on Fridays. Then narrow
this down to office hours beginning in the afternoon on Fridays.
Hint: Times can be listed as character objects but
inequalities (> or <) still work with them.
You can rename or categorize values in SQL by using a CASE WHEN … THEN … ELSE statement, which checks conditions and assigns new labels based on those conditions.
The following code creates a new variable in our table, classifying courses as having a lab or not.
dbGetQuery(mydb, "SELECT cid, cname,
CASE
WHEN LOWER(cname) LIKE '%lab' THEN 'Yes'
ELSE 'No'
END AS lab
FROM week9courses
LIMIT 6")
## cid cname lab
## 1 CHM129 General Chemistry w/lab Yes
## 2 CHM210 Analytical Chemistry w/Lab Yes
## 3 CHM221 Organic Chemistry I w/lab Yes
## 4 CHM358 Instrumental Analysis w/lab Yes
## 5 CHM363 Physical Chemistry I w/lab Yes
## 6 CHM395 ST: Adv NMR Spectroscopy No
Question 9: Use your query for Question 8 as inspiration. Pull the instructor info, day, start_time, and then make a new column that shows whether the office hours start in the morning or afternoon.
In this part of the lab, we will pivot back to working with the
week9... data tables. Run the following code to create R
environment versions of the week9... tables.
week9courses = dbGetQuery(mydb, "SELECT * FROM week9courses;")
week9departments = dbGetQuery(mydb, "SELECT * FROM week9departments")
week9profcourses = dbGetQuery(mydb, "SELECT * FROM week9profcourses")
week9professors = dbGetQuery(mydb, "SELECT * FROM week9professors")
The rest of this is mostly for illustrative purposes. After running
the above code to store the week9... tables in the R
environment, run the following command:
dbDisconnect(mydb)
## [1] TRUE
mydb <- dbConnect(RSQLite::SQLite(), ":memory:")
## Create Table
dbExecute(mydb, "CREATE TABLE `profcourses` (
`pid` int(11) NOT NULL,
`cid` varchar(6) NOT NULL
)")
## [1] 0
## Fill Table
dbWriteTable(conn = mydb, name = 'profcourses', value = week9profcourses, append = TRUE, header = FALSE, row.names = FALSE)
## Check
dbGetQuery(mydb,"Select * from profcourses LIMIT 5;")
## pid cid
## 1 42 MAT131
## 2 42 MAT218
## 3 41 ECN111
## 4 41 ECN295
## 5 40 ECN366
This works as follows: we use dbExecute not
dbGetQuery since we are executing a function on the
database. We are creating a table called profcourses. It has 2 values, a
pid that is an integer with at most 11 characters, and a cid which is a
string of at most 6 characters. Neither of them is allowed to be
NULL.
Question 10: Using similar queries, create and fill tables for the other 3 dataframes.
Question 11: Investigating Options
dbDisconnect(mydb)
Note: I may add one small extra bit for Subqueries for class on Friday.