This lab focuses on the basics of SQL:

Directions for all labs (read before starting)

  1. Please work together with your assigned partner. Make sure you both fully understand something before moving on.
  2. Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
  3. Ask for help, clarification, or even just a check-in if anything seems unclear.

The “Lab” section is something you will work on with a partner using paired programming, a framework defined as follows:

Preamble

Packages

library(DBI)
library(odbc)
library(RMySQL)
library(tidyverse)
library(dbplyr)

Basics of SQL

Relational Databases (Review of Lab 5)

So far we’ve only dealt with data that could be stored in a single spreadsheet; however, many real world applications require more flexibility.

Data are often stored in databases, or organized collections of structured information (data). Most databases are a collection of tables of data like those we’ve been working with so far in class. For this class, and most applications in general, we will work with Relational Databases.

The essential component of a Relational Database is its relational structure. A Relational Database consists of a number of tables. Each Table consists of columns of attributes and unique rows of information. Tables may have a “Primary Key” which is a unique identifier. They may also have “Foreign Keys” which are keys that can be used to match to other tables. It is best practice that the first column be the Primary Key if it exists.

The combination of a Primary and Foreign Key is a relation and where the name comes from. The tables can be related to each other to store, access, and modify data more efficiently.

For more information on databases, check out: https://www.oracle.com/database/what-is-database

Structured Query Language

https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/sql.html

When working with databases, we normally use Structured Query Language or SQL. There are a number of different implementations, but they all share the standard commands.

Conventions

By convention all SQL syntax is written in UPPER CASE and variables are written in lower case. However it is truly case insensitive, unlike R which is case sensitive.

Using SQL

Most of the time SQL is used to read data from a database. This week, we will not worry about how to write, alter, or delete databases and tables. We will however do so next week in order to make the project feasible. However, I will still include the commands in the summary today.

Basic Commands

A basic SQL query looks like “SELECT [attribute] FROM [table] WHERE [some filter];”

There are a number of useful commands, and the cheat sheet at https://www.geeksforgeeks.org/sql-cheat-sheet/# can be very useful.

  • SQL Queries:
    • SELECT: retrieve data
    • *: Wildcard, everything
      • Example: SELECT * FROM employees;
      • This returns all columns and all values from the employees table
    • DISTINCT: unique values from a column
      • Example: SELECT DISTINCT department FROM employees;
      • This returns the distinct departments in the employees table
    • WHERE: filter on conditions
    • LIMIT: limit return size
    • CASE: conditional logic
  • Filters:
    • WHERE: filter rows on conditions
    • LIKE: Match a pattern in a column
    • IN: match any value in a list
    • Between: Values in a specified range
    • IS NULL: null values
    • ORDER BY: Sort
    • AND/OR/NOT can be used in Where
  • Aggregation:
    • GROUP BY: group rows with same values
    • COUNT: count the values in a column
    • SUM: sum the values of a column
    • AVG: average the values
    • MIN: find minimum value
    • MAX: find max value
    • HAVING: filter on conditions
  • Sorting:
    • ORDER BY: sorts the output by the column(s) of interest.
    • default is ascending, use DESC to descend.

General SQL notes

  • In my experience, more than 90% of joins that you will do are left joins.
  • As a data scientist, unless you are maintaining the database, you will rarely, if ever, use SQL by itself. You will normally use it as part of a workflow in other languages (e.g. R, Python).

For these (and other reasons) you will be learning how to utilize SQL through R packages.

Connecting to a database

The format is [database name]<-dbConnect([SQL Interpreter],[connection])

Student (Read only, Campus only) Credentials

db_user <- 'studentUser'
db_password <- 'STAread!'
db_name <- 'Test'
db_table <- 'your_data_table'
db_host <- 'aiken.cs.grinnell.edu' # for local access
db_port <- 3306
mydb <-  dbConnect(MySQL(), user = db_user, password = db_password,
                   dbname = db_name, host = db_host, port = db_port)
#dbDisconnect(mydb)

You should always disconnect when you are done.

Most common functions

When using SQL in R using the DBI package, there are two commands that you will use most often:

  • dbExecute: to change the table
  • dbGetQuery: to read from the table
  • (dbListTables): figure out what tables exist in the database

In both cases the format is generally dbFunction([database],“[SQL command]”)

Examples

If we want to see what Tables are in the database:

dbListTables(mydb)
##  [1] "course"           "course_offering"  "customers"        "office_hours"    
##  [5] "orders"           "professor_info"   "syllabus_info"    "week9courses"    
##  [9] "week9departments" "week9profcourses" "week9professors"

Caveat: This is not necessarily the best designed database since there are many practice tables in here for you all to work with.

Toy example of customers and orders:

Read everything from the tables

dbGetQuery(mydb, "SELECT * FROM customers;")
##   id    name
## 1  4   Tukey
## 2  8 Wickham
## 3 15   Mason
## 4 16  Jordan
## 5 23   Patil
## 6 42     Cox
dbGetQuery(mydb, "SELECT * FROM orders;")
##   order id   date
## 1     1  4 Jan-01
## 2     2  8 Feb-01
## 3     3 42 Apr-15
## 4     4 50 Apr-17
## 5    11  4 Apr-18
## 6    12  8 Feb-11
## 7    13 42 Jul-15
## 8    14 50 Jun-17
dbGetQuery(mydb, "SELECT date, id FROM orders;")
##     date id
## 1 Jan-01  4
## 2 Feb-01  8
## 3 Apr-15 42
## 4 Apr-17 50
## 5 Apr-18  4
## 6 Feb-11  8
## 7 Jul-15 42
## 8 Jun-17 50
dbGetQuery(mydb, "SELECT date, id as CustomerID FROM orders WHERE id<10;")
##     date CustomerID
## 1 Jan-01          4
## 2 Feb-01          8
## 3 Apr-18          4
## 4 Feb-11          8

Joins

The LEFT JOIN function is used to perform a left outer join. Note that SQL expects you to specify which keys you are comparing. This allows you to use different keys at times.

While we will mostly use left joins, SQL does have the other joins as well.

# Left join with limit
dbGetQuery(mydb, "SELECT * FROM orders as a
LEFT JOIN customers as b ON a.id = b.id
LIMIT 5;")
##   order id   date id    name
## 1     1  4 Jan-01  4   Tukey
## 2    11  4 Apr-18  4   Tukey
## 3     2  8 Feb-01  8 Wickham
## 4    12  8 Feb-11  8 Wickham
## 5     3 42 Apr-15 42     Cox
# Left join without limit
dbGetQuery(mydb, "SELECT * FROM orders as a
LEFT JOIN customers as b ON a.id = b.id;")
##   order id   date id    name
## 1     1  4 Jan-01  4   Tukey
## 2    11  4 Apr-18  4   Tukey
## 3     2  8 Feb-01  8 Wickham
## 4    12  8 Feb-11  8 Wickham
## 5     3 42 Apr-15 42     Cox
## 6    13 42 Jul-15 42     Cox
## 7     4 50 Apr-17 NA    <NA>
## 8    14 50 Jun-17 NA    <NA>
# Right join without limit
dbGetQuery(mydb, "SELECT * FROM orders as a
RIGHT JOIN customers as b ON a.id = b.id;")
##   order id   date id    name
## 1     1  4 Jan-01  4   Tukey
## 2     2  8 Feb-01  8 Wickham
## 3     3 42 Apr-15 42     Cox
## 4    11  4 Apr-18  4   Tukey
## 5    12  8 Feb-11  8 Wickham
## 6    13 42 Jul-15 42     Cox
## 7    NA NA   <NA> 15   Mason
## 8    NA NA   <NA> 16  Jordan
## 9    NA NA   <NA> 23   Patil
# Left join without limit
dbGetQuery(mydb, "SELECT * FROM customers as a
LEFT JOIN orders as b ON a.id = b.id;")
##   id    name order id   date
## 1  4   Tukey     1  4 Jan-01
## 2  8 Wickham     2  8 Feb-01
## 3 42     Cox     3 42 Apr-15
## 4  4   Tukey    11  4 Apr-18
## 5  8 Wickham    12  8 Feb-11
## 6 42     Cox    13 42 Jul-15
## 7 15   Mason    NA NA   <NA>
## 8 16  Jordan    NA NA   <NA>
## 9 23   Patil    NA NA   <NA>
# Inner join without limit
dbGetQuery(mydb, "SELECT * FROM customers as a
INNER JOIN orders as b ON a.id = b.id;")
##   id    name order id   date
## 1  4   Tukey     1  4 Jan-01
## 2  8 Wickham     2  8 Feb-01
## 3 42     Cox     3 42 Apr-15
## 4  4   Tukey    11  4 Apr-18
## 5  8 Wickham    12  8 Feb-11
## 6 42     Cox    13 42 Jul-15
# Cross Join: all possible combinations
dbGetQuery(mydb, "SELECT * FROM customers as a
CROSS JOIN orders as b ON a.id = b.id;")
##   id    name order id   date
## 1  4   Tukey     1  4 Jan-01
## 2  8 Wickham     2  8 Feb-01
## 3 42     Cox     3 42 Apr-15
## 4  4   Tukey    11  4 Apr-18
## 5  8 Wickham    12  8 Feb-11
## 6 42     Cox    13 42 Jul-15
#original join
dbGetQuery(mydb, "SELECT * FROM orders as a
LEFT JOIN customers as b ON a.id = b.id
LIMIT 5;")
##   order id   date id    name
## 1     1  4 Jan-01  4   Tukey
## 2    11  4 Apr-18  4   Tukey
## 3     2  8 Feb-01  8 Wickham
## 4    12  8 Feb-11  8 Wickham
## 5     3 42 Apr-15 42     Cox
#better join
dbGetQuery(mydb, "SELECT a.*, b.name FROM orders as a
LEFT JOIN customers as b ON a.id = b.id
ORDER BY name DESC
LIMIT 5;")
##   order id   date    name
## 1     2  8 Feb-01 Wickham
## 2    12  8 Feb-11 Wickham
## 3     1  4 Jan-01   Tukey
## 4    11  4 Apr-18   Tukey
## 5    13 42 Jul-15     Cox

Lab

At this point you will be working on the lab with your partner(s). We will be using the “week9…” tables today.

Basics

Question 1 Make sure that you have connected to the database correctly by running the following code and comparing the output

dbListTables(mydb)
##  [1] "course"           "course_offering"  "customers"        "office_hours"    
##  [5] "orders"           "professor_info"   "syllabus_info"    "week9courses"    
##  [9] "week9departments" "week9profcourses" "week9professors"

Question 2* Write queries to return the first 3 rows of the 4 tables that start week9. Describe the tables and what you think these represent.

\(~\)

You can also save the results of the query to an R dataframe. For example:

exampleQuery<-dbGetQuery(mydb, "SELECT a.*, b.name FROM orders as a
LEFT JOIN customers as b ON a.id = b.id
ORDER BY name DESC
LIMIT 5;")
print(exampleQuery)
##   order id   date    name
## 1     2  8 Feb-01 Wickham
## 2    12  8 Feb-11 Wickham
## 3     1  4 Jan-01   Tukey
## 4    11  4 Apr-18   Tukey
## 5    13 42 Jul-15     Cox

Question 3 Write queries to pull the week9 tables and store them as R dataframes with the same name.

Joins

Question 4

  • Part A: Write a left join in SQL to add the department name to the data in the professors table and return the first 5 values.
  • Part B: Do the same thing using R (dplyr) commands with the tables saved in Q3.
  • Part C: Are the resulting tables the same? If so, why? If not, what differences are there?

Question 5 Using the tables imported into R, complete the next two parts to determine the number of faculty in each department by department ID. The following tables were created with SQL queries.

  • Part A: Create a table containing only professor ids and department ids. I have included what the first 5 lines should look like.
##   pid did
## 1  42 MAT
## 2  41 ECN
## 3  40 ECN
## 4  39 CHM
## 5  38 ECN
  • Part B: Create a table containing only the number of faculty in each department by department id. Your answer should match the table below. You may leave out the “(*)” of the column name.
##   did COUNT(*)
## 1 CHM       11
## 2 CSC        8
## 3 ECN        9
## 4 MAT        9
## 5 STA        5

Question 6 (Question 5, but in SQL) Write a SQL query to determine the number of faculty in each department by department name. Use the corresponding tables in R (from Q3) to make sure your query works.

  • Part A: Write a SQL query to get a table containing only professor ids and department ids.

  • Part B: Write a SQL query to get a table containing only professor ids and the name of the department they are in.

  • Part C: Write a SQL query to determine the number of faculty in each department by department id (this should match the table in 5b).

  • Part D: Put it all together to write one SQL query to determine the number of faculty in each department by department name. The first line of the table should say “Chemistry 11”

  • Part E: What benefits does using LIMIT have for using table operations?

R vs SQL

Question 7 Using R create a table consisting of only pid, cname, and depname.

  • Part A: First create a table by left joining professors and profcourses

  • Part B: create a table by left joining the table from Part A and courses. At this point you should be able to match pid and cname.

  • Part C: Create a table by left joining the table from Part B and departments using the FOREIGN KEY from professors. Keep only the three columns we care about. Sort the Dataset by depname (ascending), then pid (descending), then cname (ascending).

  • Part D: Create a table by left joining the table from Part B and departments using the FOREIGN KEY from courses. Keep only the three columns we care about. Sort the Dataset by depname (ascending), then pid (descending), then cname (ascending).

  • Part E: Are these tables the same? If so prove it. If not, filter the data to the pid corresponding to Professor Rebelsky (3). Which courses are different? Which table is “Correct”? Justify your answer.

Question 8 Repeat part 7 using SQL. Do you get the same results in each part?

If time only (not graded)

Question 9 Investigate the other tables in the database. What do you think we will use them for?

Question 10 Practice with other commands from SQL, which ones seem most complicated?

dbDisconnect(mydb)
## [1] TRUE