This lab focuses on the basics of SQL:
The “Lab” section is something you will work on with a partner using paired programming, a framework defined as follows:
library(DBI)
library(odbc)
library(RMySQL)
library(tidyverse)
library(dbplyr)
So far we’ve only dealt with data that could be stored in a single spreadsheet; however, many real world applications require more flexibility.
Data are often stored in databases, or organized collections of structured information (data). Most databases are a collection of tables of data like those we’ve been working with so far in class. For this class, and most applications in general, we will work with Relational Databases.
The essential component of a Relational Database is its relational structure. A Relational Database consists of a number of tables. Each Table consists of columns of attributes and unique rows of information. Tables may have a “Primary Key” which is a unique identifier. They may also have “Foreign Keys” which are keys that can be used to match to other tables. It is best practice that the first column be the Primary Key if it exists.
The combination of a Primary and Foreign Key is a relation and where the name comes from. The tables can be related to each other to store, access, and modify data more efficiently.
For more information on databases, check out: https://www.oracle.com/database/what-is-database
https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/sql.html
When working with databases, we normally use Structured Query Language or SQL. There are a number of different implementations, but they all share the standard commands.
By convention all SQL syntax is written in UPPER CASE and variables are written in lower case. However it is truly case insensitive, unlike R which is case sensitive.
Most of the time SQL is used to read data from a database. This week, we will not worry about how to write, alter, or delete databases and tables. We will however do so next week in order to make the project feasible. However, I will still include the commands in the summary today.
A basic SQL query looks like “SELECT [attribute] FROM [table] WHERE [some filter];”
There are a number of useful commands, and the cheat sheet at https://www.geeksforgeeks.org/sql-cheat-sheet/# can be very useful.
For these (and other reasons) you will be learning how to utilize SQL through R packages.
The format is [database name]<-dbConnect([SQL Interpreter],[connection])
Student (Read only, Campus only) Credentials
db_user <- 'studentUser'
db_password <- 'STAread!'
db_name <- 'Test'
db_table <- 'your_data_table'
db_host <- 'aiken.cs.grinnell.edu' # for local access
db_port <- 3306
mydb <- dbConnect(MySQL(), user = db_user, password = db_password,
dbname = db_name, host = db_host, port = db_port)
#dbDisconnect(mydb)
You should always disconnect when you are done.
When using SQL in R using the DBI package, there are two commands that you will use most often:
In both cases the format is generally dbFunction([database],“[SQL command]”)
If we want to see what Tables are in the database:
dbListTables(mydb)
## [1] "course" "course_offering" "customers" "office_hours"
## [5] "orders" "professor_info" "syllabus_info" "week9courses"
## [9] "week9departments" "week9profcourses" "week9professors"
Caveat: This is not necessarily the best designed database since there are many practice tables in here for you all to work with.
Read everything from the tables
dbGetQuery(mydb, "SELECT * FROM customers;")
## id name
## 1 4 Tukey
## 2 8 Wickham
## 3 15 Mason
## 4 16 Jordan
## 5 23 Patil
## 6 42 Cox
dbGetQuery(mydb, "SELECT * FROM orders;")
## order id date
## 1 1 4 Jan-01
## 2 2 8 Feb-01
## 3 3 42 Apr-15
## 4 4 50 Apr-17
## 5 11 4 Apr-18
## 6 12 8 Feb-11
## 7 13 42 Jul-15
## 8 14 50 Jun-17
dbGetQuery(mydb, "SELECT date, id FROM orders;")
## date id
## 1 Jan-01 4
## 2 Feb-01 8
## 3 Apr-15 42
## 4 Apr-17 50
## 5 Apr-18 4
## 6 Feb-11 8
## 7 Jul-15 42
## 8 Jun-17 50
dbGetQuery(mydb, "SELECT date, id as CustomerID FROM orders WHERE id<10;")
## date CustomerID
## 1 Jan-01 4
## 2 Feb-01 8
## 3 Apr-18 4
## 4 Feb-11 8
The LEFT JOIN function is used to perform a left outer join. Note that SQL expects you to specify which keys you are comparing. This allows you to use different keys at times.
While we will mostly use left joins, SQL does have the other joins as well.
# Left join with limit
dbGetQuery(mydb, "SELECT * FROM orders as a
LEFT JOIN customers as b ON a.id = b.id
LIMIT 5;")
## order id date id name
## 1 1 4 Jan-01 4 Tukey
## 2 11 4 Apr-18 4 Tukey
## 3 2 8 Feb-01 8 Wickham
## 4 12 8 Feb-11 8 Wickham
## 5 3 42 Apr-15 42 Cox
# Left join without limit
dbGetQuery(mydb, "SELECT * FROM orders as a
LEFT JOIN customers as b ON a.id = b.id;")
## order id date id name
## 1 1 4 Jan-01 4 Tukey
## 2 11 4 Apr-18 4 Tukey
## 3 2 8 Feb-01 8 Wickham
## 4 12 8 Feb-11 8 Wickham
## 5 3 42 Apr-15 42 Cox
## 6 13 42 Jul-15 42 Cox
## 7 4 50 Apr-17 NA <NA>
## 8 14 50 Jun-17 NA <NA>
# Right join without limit
dbGetQuery(mydb, "SELECT * FROM orders as a
RIGHT JOIN customers as b ON a.id = b.id;")
## order id date id name
## 1 1 4 Jan-01 4 Tukey
## 2 2 8 Feb-01 8 Wickham
## 3 3 42 Apr-15 42 Cox
## 4 11 4 Apr-18 4 Tukey
## 5 12 8 Feb-11 8 Wickham
## 6 13 42 Jul-15 42 Cox
## 7 NA NA <NA> 15 Mason
## 8 NA NA <NA> 16 Jordan
## 9 NA NA <NA> 23 Patil
# Left join without limit
dbGetQuery(mydb, "SELECT * FROM customers as a
LEFT JOIN orders as b ON a.id = b.id;")
## id name order id date
## 1 4 Tukey 1 4 Jan-01
## 2 8 Wickham 2 8 Feb-01
## 3 42 Cox 3 42 Apr-15
## 4 4 Tukey 11 4 Apr-18
## 5 8 Wickham 12 8 Feb-11
## 6 42 Cox 13 42 Jul-15
## 7 15 Mason NA NA <NA>
## 8 16 Jordan NA NA <NA>
## 9 23 Patil NA NA <NA>
# Inner join without limit
dbGetQuery(mydb, "SELECT * FROM customers as a
INNER JOIN orders as b ON a.id = b.id;")
## id name order id date
## 1 4 Tukey 1 4 Jan-01
## 2 8 Wickham 2 8 Feb-01
## 3 42 Cox 3 42 Apr-15
## 4 4 Tukey 11 4 Apr-18
## 5 8 Wickham 12 8 Feb-11
## 6 42 Cox 13 42 Jul-15
# Cross Join: all possible combinations
dbGetQuery(mydb, "SELECT * FROM customers as a
CROSS JOIN orders as b ON a.id = b.id;")
## id name order id date
## 1 4 Tukey 1 4 Jan-01
## 2 8 Wickham 2 8 Feb-01
## 3 42 Cox 3 42 Apr-15
## 4 4 Tukey 11 4 Apr-18
## 5 8 Wickham 12 8 Feb-11
## 6 42 Cox 13 42 Jul-15
#original join
dbGetQuery(mydb, "SELECT * FROM orders as a
LEFT JOIN customers as b ON a.id = b.id
LIMIT 5;")
## order id date id name
## 1 1 4 Jan-01 4 Tukey
## 2 11 4 Apr-18 4 Tukey
## 3 2 8 Feb-01 8 Wickham
## 4 12 8 Feb-11 8 Wickham
## 5 3 42 Apr-15 42 Cox
#better join
dbGetQuery(mydb, "SELECT a.*, b.name FROM orders as a
LEFT JOIN customers as b ON a.id = b.id
ORDER BY name DESC
LIMIT 5;")
## order id date name
## 1 2 8 Feb-01 Wickham
## 2 12 8 Feb-11 Wickham
## 3 1 4 Jan-01 Tukey
## 4 11 4 Apr-18 Tukey
## 5 13 42 Jul-15 Cox
At this point you will be working on the lab with your partner(s). We will be using the “week9…” tables today.
Question 1 Make sure that you have connected to the database correctly by running the following code and comparing the output
dbListTables(mydb)
## [1] "course" "course_offering" "customers" "office_hours"
## [5] "orders" "professor_info" "syllabus_info" "week9courses"
## [9] "week9departments" "week9profcourses" "week9professors"
Question 2* Write queries to return the first 3 rows of the 4 tables that start week9. Describe the tables and what you think these represent.
\(~\)
You can also save the results of the query to an R dataframe. For example:
exampleQuery<-dbGetQuery(mydb, "SELECT a.*, b.name FROM orders as a
LEFT JOIN customers as b ON a.id = b.id
ORDER BY name DESC
LIMIT 5;")
print(exampleQuery)
## order id date name
## 1 2 8 Feb-01 Wickham
## 2 12 8 Feb-11 Wickham
## 3 1 4 Jan-01 Tukey
## 4 11 4 Apr-18 Tukey
## 5 13 42 Jul-15 Cox
Question 3 Write queries to pull the week9 tables and store them as R dataframes with the same name.
Question 4
Question 5 Using the tables imported into R, complete the next two parts to determine the number of faculty in each department by department ID. The following tables were created with SQL queries.
## pid did
## 1 42 MAT
## 2 41 ECN
## 3 40 ECN
## 4 39 CHM
## 5 38 ECN
## did COUNT(*)
## 1 CHM 11
## 2 CSC 8
## 3 ECN 9
## 4 MAT 9
## 5 STA 5
Question 6 (Question 5, but in SQL) Write a SQL query to determine the number of faculty in each department by department name. Use the corresponding tables in R (from Q3) to make sure your query works.
Part A: Write a SQL query to get a table containing only professor ids and department ids.
Part B: Write a SQL query to get a table containing only professor ids and the name of the department they are in.
Part C: Write a SQL query to determine the number of faculty in each department by department id (this should match the table in 5b).
Part D: Put it all together to write one SQL query to determine the number of faculty in each department by department name. The first line of the table should say “Chemistry 11”
Part E: What benefits does using LIMIT have for using table operations?
Question 7 Using R create a table consisting of only pid, cname, and depname.
Part A: First create a table by left joining professors and profcourses
Part B: create a table by left joining the table from Part A and courses. At this point you should be able to match pid and cname.
Part C: Create a table by left joining the table from Part B and departments using the FOREIGN KEY from professors. Keep only the three columns we care about. Sort the Dataset by depname (ascending), then pid (descending), then cname (ascending).
Part D: Create a table by left joining the table from Part B and departments using the FOREIGN KEY from courses. Keep only the three columns we care about. Sort the Dataset by depname (ascending), then pid (descending), then cname (ascending).
Part E: Are these tables the same? If so prove it. If not, filter the data to the pid corresponding to Professor Rebelsky (3). Which courses are different? Which table is “Correct”? Justify your answer.
Question 8 Repeat part 7 using SQL. Do you get the same results in each part?
Question 9 Investigate the other tables in the database. What do you think we will use them for?
Question 10 Practice with other commands from SQL, which ones seem most complicated?
dbDisconnect(mydb)
## [1] TRUE