This lab focuses on more advanced uses of SQL:

Preamble

Packages

library(DBI)
library(odbc)
library(RMySQL)
library(tidyverse)
library(dbplyr)

Commands from Last week

  • SQL Queries:
    • SELECT: retrieve data
    • *: Wildcard, everything
      • Example: SELECT * FROM employees;
      • This returns all columns and all values from the employees table
    • DISTINCT: unique values from a column
      • Example: SELECT DISTINCT department FROM employees;
      • This returns the distinct departments in the employees table
    • WHERE: filter on conditions
    • LIMIT: limit return size
    • CASE: conditional logic
  • Filters:
    • WHERE: filter rows on conditions
    • LIKE: Match a pattern in a column
    • IN: match any value in a list
    • Between: Values in a specified range
    • IS NULL: null values
    • ORDER BY: Sort
    • AND/OR/NOT can be used in Where
  • Aggregation:
    • GROUP BY: group rows with same values
    • COUNT: count the values in a column
    • SUM: sum the values of a column
    • AVG: average the values
    • MIN: find minimum value
    • MAX: find max value
    • HAVING: filter on conditions
  • Sorting:
    • ORDER BY: sorts the output by the column(s) of interest.
    • default is ascending, use DESC to descend.

Connecting to a database

The format is [database name]<-dbConnect([SQL Interpreter],[connection])

Student (Read only, Campus only) Credentials

db_user <- 'studentUser'
db_password <- 'STAread!'
db_name <- 'Test'
db_table <- 'your_data_table'
db_host <- 'aiken.cs.grinnell.edu' # for local access
db_port <- 3306
mydb <-  dbConnect(MySQL(), user = db_user, password = db_password,
                   dbname = db_name, host = db_host, port = db_port)
#dbDisconnect(mydb)

You should always disconnect when you are done.

Subqueries

A subquery is a query inside of a query that creates a temporary result for the main query to use. Subqueries are useful to simplify complex joins by breaking them into smaller chunks.

dbGetQuery(mydb," SELECT  tab1.Name, tab1.ProfessorDepartment,
                          tab2.CourseName,tab2.CourseDepartment
                  FROM
                      (SELECT a.fname as Name, b.depname as ProfessorDepartment, c.cid
                      FROM week9professors as a 
                      LEFT JOIN week9departments as b ON a.did=b.did
                      LEFT JOIN week9profcourses as c ON a.pid=c.pid)
                    AS tab1
                  LEFT JOIN
                      (Select d.cname as CourseName, e.depname as CourseDepartment, d.cid 
                      FROM week9courses as d
                      LEFT JOIN week9departments as e ON d.did=e.did)
                    AS tab2
                  ON tab1.cid=tab2.cid;") %>% head()
##        Name ProfessorDepartment                 CourseName CourseDepartment
## 1 Elizabeth           Chemistry    General Chemistry w/lab        Chemistry
## 2    Maisha           Chemistry    General Chemistry w/lab        Chemistry
## 3  Rajendra           Chemistry    General Chemistry w/lab        Chemistry
## 4   Stephen           Chemistry    General Chemistry w/lab        Chemistry
## 5      Mark           Chemistry Analytical Chemistry w/Lab        Chemistry
## 6     Molly           Chemistry Analytical Chemistry w/Lab        Chemistry

Question 1: Explain how this query works as best you can. What information does it provide? What problems could you answer from last week using this query? We will talk more about subqueries on Wed/Fri.


Lab Part 1: More on Operations

The examples listed below will use the week9... tables we saw last week. The questions will refer to the following five tables for this section, all of which contain information on things for the History department here at Grinnell:

course_offering ,office_hours, professor_info ,syllabus_info and course.


DISTINCT Operator

DISTINCT returns on unique values from a column/variable, or possibly a combination of columns/variables that define unique observations. It is used to remove duplicates from query results.

# Everything from week9profcourses
dbGetQuery(mydb, "SELECT * FROM week9profcourses LIMIT 5")
##   pid    cid
## 1  42 MAT131
## 2  42 MAT218
## 3  41 ECN111
## 4  41 ECN295
## 5  40 ECN366
# Only unique pid's from week9profcourses
dbGetQuery(mydb, "SELECT DISTINCT pid FROM week9profcourses LIMIT 5")
##   pid
## 1  42
## 2  41
## 3  40
## 4  39
## 5  38
# Only unique cid's from week9profcourses
dbGetQuery(mydb, "SELECT DISTINCT cid FROM week9profcourses LIMIT 5")
##      cid
## 1 MAT131
## 2 MAT218
## 3 ECN111
## 4 ECN295
## 5 ECN366

Note: The combination of pid and cid in the week9profcourses data table is already unique, so the following DISTINCT does nothing extra.

dbGetQuery(mydb, "SELECT DISTINCT pid, cid FROM week9profcourses
           LIMIT 5")
##   pid    cid
## 1  42 MAT131
## 2  42 MAT218
## 3  41 ECN111
## 4  41 ECN295
## 5  40 ECN366

COUNT Operator

By default, COUNT returns the number of non-null rows in a column when the column is specified directly. We can use COUNT(*) to count all rows of all variables, but this will include non-null values. COUNT() is usually accompanied by some GROUP BY argument at the end which allows us to specify categories which we want counted.

# counting occurences (without grouping)
dbGetQuery(mydb, "SELECT pid, count(pid) FROM week9profcourses")
##   pid count(pid)
## 1  42         68
# counting occurences (with grouping)
dbGetQuery(mydb, "SELECT pid as 'professor_id', count(pid) as 'number_of_courses' FROM week9profcourses GROUP BY pid ORDER BY pid DESC LIMIT 6")
##   professor_id number_of_courses
## 1           42                 2
## 2           41                 2
## 3           40                 1
## 4           39                 2
## 5           38                 2
## 6           37                 1

Question 2: Explain the difference between what these two queries are counting. Does the first do what we want?

We can include DISTINCT within COUNT to count unique occurences of something. Let’s say I wanted to count the number of professors in the week9profcourses table.

# counting number of professors in data table (incorrect)
dbGetQuery(mydb, "SELECT count(pid) as number_of_unique_professors FROM week9profcourses ORDER BY pid DESC LIMIT 6")
##   number_of_unique_professors
## 1                          68
# counting number of professors in data table (correct)
dbGetQuery(mydb, "SELECT count(DISTINCT pid) as number_of_unique_professors FROM week9profcourses ORDER BY pid DESC LIMIT 6")
##   number_of_unique_professors
## 1                          42

Question 3: List the name of the professor, the number of courses they have taught over the years and their contact emails from the History department at Grinnell College. You will first need to investigate the tables, and join two appropriate data tables. Use the variable names ‘instructor’, ‘email’, and ‘number_of_courses_taught’.


HAVING Clause

HAVING filters groups after aggregation (some sort of calculation). This is usually used for conditions on summarized values by some group identity. The syntax works as follows. Also notice a convenient way to structure query formatting.

# count number of courses per professor
dbGetQuery(mydb, "SELECT pid as 'professor_id', count(pid) as 'number_of_courses' FROM week9profcourses GROUP BY pid ORDER BY pid DESC LIMIT 6")
##   professor_id number_of_courses
## 1           42                 2
## 2           41                 2
## 3           40                 1
## 4           39                 2
## 5           38                 2
## 6           37                 1
# select only professors with 1 course listed
dbGetQuery(mydb, 
"SELECT pid as 'professor_id', 
count(pid) as 'number_of_courses' 
           FROM week9profcourses 
           GROUP BY pid 
           HAVING count(pid) = 2
           ORDER BY pid DESC 
           LIMIT 3")
##   professor_id number_of_courses
## 1           42                 2
## 2           41                 2
## 3           39                 2
# can also use inequalities
dbGetQuery(mydb, 
"SELECT pid as 'professor_id', 
count(pid) as 'number_of_courses' 
           FROM week9profcourses 
           GROUP BY pid 
           HAVING count(pid) > 1
           ORDER BY pid DESC 
           LIMIT 3")
##   professor_id number_of_courses
## 1           42                 2
## 2           41                 2
## 3           39                 2

Question 4: Make modifications to your SQL query in Question 3, try to list the professor details who have taught more than 20 courses in the timespan of this data.


BETWEEN CLAUSE:

BETWEEN is another clause that works with HAVING that we can include to specify whether a value falls within a range, including the boundary values. Works with numbers (and dates, but more complicated). The syntax is as follows.

# syntax only, doesn't do anything special because of count vals
dbGetQuery(mydb, 
"SELECT pid as 'professor_id', 
count(pid) as 'number_of_courses' 
           FROM week9profcourses 
           GROUP BY pid 
           HAVING count(pid) BETWEEN 1 AND 2
           ORDER BY pid DESC 
           LIMIT 3")
##   professor_id number_of_courses
## 1           42                 2
## 2           41                 2
## 3           40                 1

Question 5: Write a SQL query to list the name of professors and the average number of students enrolled in their courses.

Question 6: Use your answer to Question 5, modify the query to show only professors with average enrollment values between 20 and 24.


WHERE CLAUSE:

WHERE is another type of filtering clause, which filters rows before any grouping or aggregation happens. It’s used to specify conditions that individual rows must meet to be included in the results/calculations in the first place.

# select only Chemistry courses from week9courses
dbGetQuery(mydb, "SELECT * FROM week9courses WHERE did = 'CHM' ")
##      cid                       cname did
## 1 CHM129     General Chemistry w/lab CHM
## 2 CHM210  Analytical Chemistry w/Lab CHM
## 3 CHM221   Organic Chemistry I w/lab CHM
## 4 CHM358 Instrumental Analysis w/lab CHM
## 5 CHM363  Physical Chemistry I w/lab CHM
## 6 CHM395    ST: Adv NMR Spectroscopy CHM
# select multiple categories: Chem, STAT
dbGetQuery(mydb, "SELECT * FROM week9courses WHERE did = 'CHM' OR did = 'STA' ")
##       cid                        cname did
## 1  CHM129      General Chemistry w/lab CHM
## 2  CHM210   Analytical Chemistry w/Lab CHM
## 3  CHM221    Organic Chemistry I w/lab CHM
## 4  CHM358  Instrumental Analysis w/lab CHM
## 5  CHM363   Physical Chemistry I w/lab CHM
## 6  CHM395     ST: Adv NMR Spectroscopy CHM
## 7  STA209           Applied Statistics STA
## 8  STA230 Introduction to Data Science STA
## 9  STA310         Statistical Modeling STA
## 10 STA335   Probability & Statistics I STA

Another helpful clause to pair with this is IN (works like %in% in dplyr) for multiple categories.

# select Chem, STAT using IN clause
dbGetQuery(mydb, "SELECT * FROM week9courses WHERE did IN ('CHM', 'STA') ")
##       cid                        cname did
## 1  CHM129      General Chemistry w/lab CHM
## 2  CHM210   Analytical Chemistry w/Lab CHM
## 3  CHM221    Organic Chemistry I w/lab CHM
## 4  CHM358  Instrumental Analysis w/lab CHM
## 5  CHM363   Physical Chemistry I w/lab CHM
## 6  CHM395     ST: Adv NMR Spectroscopy CHM
## 7  STA209           Applied Statistics STA
## 8  STA230 Introduction to Data Science STA
## 9  STA310         Statistical Modeling STA
## 10 STA335   Probability & Statistics I STA

LIKE Operator

The following code shows how the LIKE operator works to match patterns in a column. The course_offering table has info on professors and the department of courses they have taught. Notice the syntax for how to match things. % serves as a wildcard operator in the LIKE clause.

# pick all stat courses, regardless of number
dbGetQuery(mydb, "SELECT * from week9courses
           WHERE cid LIKE 'STA%'")
##      cid                        cname did
## 1 STA209           Applied Statistics STA
## 2 STA230 Introduction to Data Science STA
## 3 STA310         Statistical Modeling STA
## 4 STA335   Probability & Statistics I STA
# pick only 300 level stat courses
dbGetQuery(mydb, "SELECT * from week9courses
           WHERE cid LIKE 'STA3%'")
##      cid                      cname did
## 1 STA310       Statistical Modeling STA
## 2 STA335 Probability & Statistics I STA

Question 7: Select the Professors whose name starts with either ‘A’ or ‘E’ and display their titles and phone numbers

Question 8: Using the office_hours table, find all professors who have office hours on Fridays. Then narrow this down to office hours beginning in the afternoon on Fridays. Hint: Times can be listed as character objects but inequalities (> or <) still work with them.


Renaming things using CASE, THEN and ELSE

You can rename or categorize values in SQL by using a CASE WHEN … THEN … ELSE statement, which checks conditions and assigns new labels based on those conditions.

The following code creates a new variable in our table, classifying courses as having a lab or not.

dbGetQuery(mydb, "SELECT cid, cname,
           CASE
           WHEN LOWER(cname) LIKE '%lab' THEN 'Yes'
           ELSE 'No'
           END AS lab
           FROM week9courses
           LIMIT 6")
##      cid                       cname lab
## 1 CHM129     General Chemistry w/lab Yes
## 2 CHM210  Analytical Chemistry w/Lab Yes
## 3 CHM221   Organic Chemistry I w/lab Yes
## 4 CHM358 Instrumental Analysis w/lab Yes
## 5 CHM363  Physical Chemistry I w/lab Yes
## 6 CHM395    ST: Adv NMR Spectroscopy  No

Question 9: Use your query for Question 8 as inspiration. Pull the instructor info, day, start_time, and then make a new column that shows whether the office hours start in the morning or afternoon.


Lab Part 2: Writing Tables (added Wednesday)

In this part of the lab, we will pivot back to working with the week9... data tables. Run the following code to create R environment versions of the week9... tables.

week9courses = dbGetQuery(mydb, "SELECT * FROM week9courses;")
week9departments = dbGetQuery(mydb, "SELECT * FROM week9departments")
week9profcourses = dbGetQuery(mydb, "SELECT * FROM week9profcourses")
week9professors = dbGetQuery(mydb, "SELECT * FROM week9professors")

Creating a temporary database in memory

The rest of this is mostly for illustrative purposes. After running the above code to store the week9... tables in the R environment, run the following command:

dbDisconnect(mydb)
## [1] TRUE
mydb <- dbConnect(RSQLite::SQLite(), ":memory:")

Creating a Table

## Create Table
dbExecute(mydb, "CREATE TABLE `profcourses` (
  `pid` int(11) NOT NULL,
  `cid` varchar(6) NOT NULL
)")
## [1] 0
## Fill Table
dbWriteTable(conn = mydb, name = 'profcourses', value = week9profcourses, append = TRUE, header = FALSE, row.names = FALSE)

## Check
dbGetQuery(mydb,"Select * from profcourses LIMIT 5;")
##   pid    cid
## 1  42 MAT131
## 2  42 MAT218
## 3  41 ECN111
## 4  41 ECN295
## 5  40 ECN366

This works as follows: we use dbExecute not dbGetQuery since we are executing a function on the database. We are creating a table called profcourses. It has 2 values, a pid that is an integer with at most 11 characters, and a cid which is a string of at most 6 characters. Neither of them is allowed to be NULL.

Question 10: Using similar queries, create and fill tables for the other 3 dataframes.

Question 11: Investigating Options

  • What happens if you try to create a table that already exists with append = TRUE?
  • What if append=FALSE?
  • What if header=TRUE?
  • What if you rerun the writetable a second time with the original values?
  • What if you write a table to a “name” table that doesn’t exist?

Disconnecting the SQL connection

dbDisconnect(mydb)

Note: I may add one small extra bit for Subqueries for class on Friday.