Class Introduction

Syllabus Link

Important notes:

Class Format

Most classes will have the following format:

  1. 5-10 minutes: Class announcements, preamble to lab (together)
  2. Work on Lab (with partner)
  3. 5-10 minutes: Class wrap up: reminder of goals/what you should have learned

Notes:

  1. This semester I am experimenting with 1-2 labs per week that covers 2-3 days. I will try to give you an idea of how far you should get on each day.
  2. We will only go over the preamble on the first day of a lab.
  3. If you don’t finish the lab in class, you will need to finish them outside of class. You can either continue to work with your partner, or you can finish the lab on your own.

Lab format

The “Lab” section is something you will work on with a partner using paired programming, a framework defined as follows:

Partners are encouraged to switch roles throughout the “Lab” section, but for the first few labs the less experienced coder should spend more time as the driver.

Directions for all labs (read before starting)

  1. Please work together with your assigned partner (will not apply to first lab, just work with who you sit next to). Make sure you both fully understand something before moving on.
  2. Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
  3. Ask for help, clarification, or even just a check-in if anything seems unclear.

Preamble

Today’s preamble will be longer than usual in order to review important parts of R as a class.

Goals

  • Introduce the class and class format
  • Refresh your memory of R and Rstudio
  • Remind you of the differences between an R script and an R markdown file
  • Introduce additional miscellaneous topics that will make working in R better
    • File paths
    • Basic Loops
    • Row Binding

The Layout of R Studio

After you open RStudio, the first thing you’ll want to do is open a file to work in. You can do this by navigating: File -> New File -> RScript, which will open a new window in the top left of the RStudio interface for you to work in. At this point you should see four panels:

  1. Your R Script (top left)
  2. The Console (bottom left)
  3. Your Environment (top right)
  4. The Files/Plots/Help viewer (bottom right)

An R Script is like a text-file that stores your code while you work on it. At any point you can send some or all of the code in your R Script to the Console to execute. You can also type commands directly into the Console. The Console will echo any code you run, and it will display any textual/numeric output generated by your code.

The Environment shows you the names of data sets, variables, and user-created functions that have been loaded into your work space and can be accessed by your code. The Files/Plots/Help Viewer will display graphics generated by your code and a few other useful entities (like help documentation and file trees).


Packages

To facilitate more complex tasks in R, many people have developed their own sets of functions known as packages. If you plan on working with a new package for the first time, it must be installed:

install.packages("ggplot2", repos = "http://cran.us.r-project.org")

Once a package is installed, it still needs to be loaded into your R session using the library() function (or require()) before its contents can be used.

You’ll need to re-load a package every time you open R Studio, but you’ll only need to install it once.

my_data <- read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")
library(ggplot2)
qplot(my_data$Region)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


R scripts vs R markdown

R Scripts are built to contain only executable R code and comments.

R Studio supports several other types of files, some of which use the “Markdown” authoring framework. An “R Markdown” file allows you to both:

  1. Write and execute R code
  2. Generate a high quality, reproducible report

To use R Markdown, you’ll need the rmarkdown package. In order to knit to pdf instead of html, you will likely need the tinytex package:

install.packages("rmarkdown")
library("rmarkdown")
install.packages("tinytex")
tinytex::install_tinytex()

Once you have the package installed and loaded, you can create a new R Markdown file by selecting: File -> New File -> R Markdown.

At the top of the document is the header:

  • This section initiated by three ‘-’ characters and closed by another three ‘-’ characters
  • It contains the title, author, etc. that appears at the top of the document created by your code
  • You can use it to add elements like a table of contents, page numbers, etc.

The second thing you’ll see is a code chunk:

  • Code chunks are initiated by ```{r} and closed by ```
  • The ``` wrappers tell R Markdown that what appears inside is code that should be executed. The first code chunk, initiated by ```{r setup} sets up options that will be used in executing your R code when your report is built. For now, you should keep this chunk as it appears and place your actual code inside of other code chunks.
  • You can execute the R code in a chunk by clicking the small green arrow in the upper right corner. You can also highlight individual code pieces and execute them using Ctrl-Enter.

Next you’ll see section headers:

  • Sections are created using strings of the # character.
  • The number of # characters used determines the level (size) of the header.

Finally, R Markdown allows you to type ordinary text outside of code chunks. Thus, you can easily integrate written text into the same document as your code and its output.

The primary purpose of R Markdown is to create documents that blend R code, output, and text into a polished report. To generate this document you must compile your R Markdown file using the “Knit” button (a blue yarn ball icon) located towards the upper left part of your screen.

The R Markdown cheat sheet can be found here: https://raw.githubusercontent.com/rstudio/cheatsheets/main/rmarkdown.pdf

Question 0: Create a new R Markdown file and delete all of the template code that appears beneath the “r setup” code block. Change the title to “Lab #1” and the author to your name(s). Next, create section labels for each question in the lab using three # characters followed by “Question X” (where X is the number of the question).

Question 0 (continued): R Markdown will use LaTex typesetting for any text wrapped in $ characters. For example, $\beta$ will appear as a the Greek letter \(\beta\) after you knit your document. To practice this, add a label for Question #1 and below it include $H_0: \mu = 0$ in a sentence (the sentence can say anything, but it should not be inside an R code chunk or a section header).


Using R

R is an interpreted programming language, which allows you to have the computer execute any piece of code contained your R Script at any time without a lengthy compiling process.

To run a single piece of code, simply highlight it and either hit Ctrl-Enter or click on the “Run” button near the top right corner of your R Script. You should see an echo of the code you ran in the Console, along with any response generated by that code.

4 + 6 - (24/6)
## [1] 6
5 ^ 2 + 2 * 2
## [1] 29

The examples shown above demonstrate how R can be used as a calculator. However, most of code we will write will rely upon functions, or pre-built units of code that translate one or more inputs into one or more outputs.

log(x = 4, base = 2)
## [1] 2

The example above demonstrates the log() function. The input named “x” is set to be 4, and the input named “base” is set to 2. The labels given to these inputs, “x” and “base”, are the function’s arguments. The function returns the output “2”, which is \(\log_2{(4)}\). Note that log(4, 2) will also produce the output “2” as any unlabeled inputs are mapped to arguments in the order defined by the creator of the function.


Help Documentation

You’ll eventually end up memorizing the arguments of common R functions; however, while you’re learning I strongly encourage you to read the help documentation for any R function used in your code. You can access a function’s documentation by typing a ? in front of the function name and submitting to the console.

?log

In addition, if you think there should be a function but you don’t know what it is called, you can use two ‘??’:

??logarithm

Adding Comments

When coding, it is good practice to include comments that describe what your code is doing. In R the character “#” is used to start a comment. Everything appearing on the same line to the right of the “#” will not be executed when that line is submitted to the console.

# This entire line is a comment and will do nothing if run
1:6 # The command "1:6" appears before this comment
## [1] 1 2 3 4 5 6

In your R Script, comments appear in green. You also should remember that the “#” starts a comment only for a single line of your R Script, so long comments requiring multiple lines should each begin with their own “#”.


Lab

The remainder of the lab is to be completed by you and your lab partner. You should work at a comfortable pace that ensures both of you thoroughly understand the lab’s contents and examples.

Loading Data

An important part of data science is reproducibility, or the ability for two people to independently replicate the results of a project.

To ensure reproducibility, every data analysis should begin by importing raw data into R and manipulating it used documented (commented) code. Further, the raw data should be imported using functions, such as read.csv, instead of the point and click interface provided by the “Import Dataset” button (at the top of the environment pane).

Below are two different examples:

## Loading a CSV file from a web URL (storing it as "my_data")
my_data <- read.csv("https://some_webpage/some_data.csv")
## Loading a CSV file with a local file path
my_data <- read.csv("H:/path_to_my_data/my_data.csv")

A few things to note.

  1. Both <- or = can be used to assign something to a named object. The <- operator will create the object globally, while = will create the object locally in the environment where it was used. For the purposes of this course, we can use the two interchangeably since our code will “live” in the global environment.
  2. File paths must use / or \\. A single \ is used by R to start an instance of a special text character. For example, \n creates a new line in a string of text.

Loading a Single File

Question 2A: Add code to your script that uses the read.csv() function to create an object named my_data that contains the “Happy Planet” data stored at: https://remiller1450.github.io/data/HappyPlanet.csv

After running your Question 2A code, an entry named “my_data” should appear in the Environment panel (top right).

You can click on the small arrow icon to reveal the data’s structure, or you can click on the object’s name to view the data in spreadsheet format.

Question 2B: Inspect the structure of my_data and view the data set in spreadsheet format. In an R comment, briefly describe how this data set is structured (ie: what does each row and column represent, what are some of the columns, etc.)


Objects and Assignments

R stores data in containers called objects. Data is assigned into an object using <- or =. After assignment, data can be referenced using the object’s name. The simplest objects are scalars, or a single element:

x <- 5 # This assigns the integer value '5' to an object called 'x'
x^2    # We can now reference 'x'
## [1] 25

R stores sequences of elements in objects called vectors:

x <- 1:3 # The sequence {1, 2, 3} is assigned to the vector called 'x'
print(x)
## [1] 1 2 3
y <- c(1,2,3) # The function 'c' concatenates arguments (separated by commas) into a vector
print(y)
## [1] 1 2 3
z <- c("A","B","C") # Vectors can contain many types of values
print(z)
## [1] "A" "B" "C"

The three most important types of vectors are:

  1. numeric vectors - for example: x = c(1,2,3)
  2. character vectors - for example: x = c("A","B","C")
  3. logical vectors - for example: x = c(TRUE, FALSE, TRUE)

You should always consider a vector’s type before using it. Many functions expect specific input types and will produce an error if the wrong type is used. You can check the type of an object using the typeof() function:

chars <- c("1","2","3") # Create a character vector
typeof(chars)
## [1] "character"
nums <- c(1,2,3) # Create a numeric vector
typeof(nums)
## [1] "double"
mean(chars) # This produces an error, mean() only works for numeric vectors
## Warning in mean.default(chars): argument is not numeric or logical: returning
## NA
## [1] NA
mean(nums) # This works as intended
## [1] 2

Certain R functions are vectorized, meaning they can accept a scalar input, for example 1, and return the scalar output f(1), or they can accept a vector input, such as c(1,2,3), and return the vector c(f(1),f(2),f(3)). For example, sqrt() is vectorized:

nums <- c(1,2,3,4)
sqrt(nums)
## [1] 1.000000 1.414214 1.732051 2.000000

Data is usually stored in objects called data.frames, which are composed of several vectors of the same length:

DF <- data.frame(A = x, B = y, C = z) # Creates a data.frame object 'DF'
print(DF)
##   A B C
## 1 1 1 A
## 2 2 2 B
## 3 3 3 C

Functions like read.csv() will automatically store their output as a data frame:

my_data <- read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")
typeof(my_data)
## [1] "list"

However, notice typeof() describes my_data as a list object. Lists are a flexible class of objects whose elements can be of any type. A data frame is a special case of a list.

Shown below is an example list containing three components, two different data frames and the character string “ABC”:

my_list <- list(my_data, DF, "ABC")

Question 3: Create a data frame named my_DF containing two vectors, J and K, where J is created using the seq function to be a sequence from 0 to 100 counting by 10 and K is created using the rep function to replicate the character string “XYZ” the proper number of times. Hint: read the help documentation for each function (seq and rep) to determine the necessary arguments.


Indexing

Suppose we have a vector “x” and would like to extract the element in its second position and assign it to a new object called “b”:

x <- 5:10
b <- x[2]
b
## [1] 6

The square brackets, [ and ], are used to access a certain position (or multiple positions) within an object. In this example we access the second position of the object “x”. Note that R is a 1-indexed language not a 0-indexed language (e.g. C). The first index is 1.

Some objects, such as data frames, have multiple dimensions, requiring indices in each dimension (separated by commas) to describe a single element. A few examples are shown below:

DF <- data.frame(x = x, y = y, z = z) 
DF[2,3] # The element in row 2, column 3
## [1] "B"
DF[2,] # Everything in row 2
##   x y z
## 2 6 2 B

For list objects, double square brackets, [[, are used to access positions within the list:

my_list[[2]] ## The 2nd component of the list
##   A B C
## 1 1 1 A
## 2 2 2 B
## 3 3 3 C

Question 4: Use indices to print the Happiness score (column #3) of Hong Kong (row #57) in the object my_data (the Happy Planet data from Question 1). Be sure your code does not print any other information about this observation.


Working with Data

Suppose we want to access a single variable from a data set, there are a few different ways we can do so:

# The $ accesses the component named 'Country' within 'my_data'
countries <- my_data$Country  

# Position indexing to access the variable 'Country' (since its the first column)
countries2 <- my_data[,1] 

# Use the name of the variable in place of an index position
countries3 <- my_data[,'Country']

Suppose we want to access a single observation (data point) in our dataset:

Albania <- my_data[1,] # This stores the entire first row

Suppose we want a range of observations:

FirstFive <- my_data[1:5,] # This stores the first five rows
head(FirstFive)
##     Country Region Happiness LifeExpectancy Footprint  HLY   HPI HPIRank
## 1   Albania      7       5.5           76.2       2.2 41.7 47.91      54
## 2   Algeria      3       5.6           71.7       1.7 40.1 51.23      40
## 3    Angola      4       4.3           41.7       0.9 17.8 26.78     130
## 4 Argentina      1       7.1           74.8       2.5 53.4 58.95      15
## 5   Armenia      7       5.0           71.7       1.4 36.1 48.28      48
##   GDPperCapita   HDI Population
## 1         5316 0.801       3.15
## 2         7062 0.733      32.85
## 3         2335 0.446      16.10
## 4        14280 0.869      38.75
## 5         4945 0.775       3.02

The head function prints the first few rows and variables of an object. Here are a few other functions might use when working with a new data set:

dim(my_data) # prints the dimensions of 'my.data'
## [1] 143  11
nrow(my_data) # prints the number of rows of 'my.data'
## [1] 143
ncol(my_data) # prints the number of columns of 'my.data'
## [1] 11
colnames(my_data) # prints the names of the variables (columns) of 'mydata'
##  [1] "Country"        "Region"         "Happiness"      "LifeExpectancy"
##  [5] "Footprint"      "HLY"            "HPI"            "HPIRank"       
##  [9] "GDPperCapita"   "HDI"            "Population"

Question 5A: Write code that prints the populations of the last three observations (countries) that appear in the Happy Planet data.

Question 5B: Write code that finds the median value of the “LifeExpectancy” variable for the last 10 observations in the Happy Planet data.


Logical Conditions and Subsetting

Often we want to access all data that meet certain criteria. For example, we may want to analyze all countries with a life expectancy above 80. To accomplish this, we’ll need to use logical operators:

## This returns a logical vector using the condition "> 80"
my_data$LifeExpectancy > 80

A few logical operators you should know of are:

Operator | Description

== | equal to

!= | not equal to

> | great than

>= | greater than or equal to

<| less than

<=| less than or equal to

&| and

| | or

! | negation (“not”)

The which() function can be used to identify the indices of elements of within an object containing the logical value TRUE, for example:

## This returns the positions where the condition evaluated to TRUE
which(my_data$LifeExpectancy > 80 )

This result could then be used as indices to subset my_data:

## sub-setting via indices
keep_idx <- which(my_data$LifeExpectancy > 80)
my_subset <- my_data[keep_idx, ]

The approach shown above is a bit cumbersome. As an alternative we can use the subset() function alongside logical expressions:

## Example #1
Ex1 <- subset(my_data, LifeExpectancy > 80)

In example #1, the data frame Ex1 will contain the subset of countries with life expectancy above 80. Notice how the subset() function knows that LifeExpectancy is a component of my_data.

## Example #2
Ex2 <- subset(my_data, LifeExpectancy <= 70 & Happiness > 6)

In example #2, the & operator is used to create a data frame, Ex2, containing all countries with a life expectancy of 70 or below and a happiness score above 6.

## Example #3
Ex3 <- subset(my_data, LifeExpectancy <= 70 | Happiness > 6)

In example #3, the | operator is used create a data frame of all countries with a life expectancy of 70 or below or a happiness score above 6. Notice the different dimensions of Ex2 and Ex3:

dim(Ex2)
## [1]  9 11
dim(Ex3)
## [1] 118  11

Question 6: Create a data frame named “Q6” that contains all countries with a population over 100 million that also have a happiness score of 6 or lower. Then, print the number of rows of this data frame.


Data Summaries

Descriptive summaries are an essential component of any data analysis. A few functions used to calculate several basic numerical summaries are shown below:

mean(my_data$LifeExpectancy) # mean
## [1] 67.83846
sd(my_data$LifeExpectancy) # standard deviation
## [1] 11.04193
min(my_data$LifeExpectancy) # minimum
## [1] 40.5
max(my_data$LifeExpectancy ) # maximum
## [1] 82.3
quantile(my_data$LifeExpectancy, probs = .35) # the 35th percentile
##   35% 
## 66.18

Each of these functions operates on a single variable. For a broader set of summary statistics, you can input an entire data frame into the summary() function:

summary(my_data)
##    Country              Region        Happiness     LifeExpectancy 
##  Length:143         Min.   :1.000   Min.   :2.400   Min.   :40.50  
##  Class :character   1st Qu.:2.000   1st Qu.:5.000   1st Qu.:61.90  
##  Mode  :character   Median :4.000   Median :5.900   Median :71.50  
##                     Mean   :3.832   Mean   :5.919   Mean   :67.84  
##                     3rd Qu.:6.000   3rd Qu.:7.000   3rd Qu.:76.05  
##                     Max.   :7.000   Max.   :8.500   Max.   :82.30  
##                                                                    
##    Footprint           HLY             HPI           HPIRank     
##  Min.   : 0.500   Min.   :11.60   Min.   :16.59   Min.   :  1.0  
##  1st Qu.: 1.300   1st Qu.:31.10   1st Qu.:34.47   1st Qu.: 36.5  
##  Median : 2.200   Median :41.80   Median :43.60   Median : 72.0  
##  Mean   : 2.877   Mean   :41.38   Mean   :43.38   Mean   : 72.0  
##  3rd Qu.: 3.850   3rd Qu.:53.20   3rd Qu.:52.20   3rd Qu.:107.5  
##  Max.   :10.200   Max.   :66.70   Max.   :76.12   Max.   :143.0  
##                                                                  
##   GDPperCapita        HDI           Population      
##  Min.   :  667   Min.   :0.3360   Min.   :   0.290  
##  1st Qu.: 2107   1st Qu.:0.5790   1st Qu.:   4.455  
##  Median : 6632   Median :0.7720   Median :  10.480  
##  Mean   :11275   Mean   :0.7291   Mean   :  44.145  
##  3rd Qu.:15711   3rd Qu.:0.8680   3rd Qu.:  31.225  
##  Max.   :60228   Max.   :0.9680   Max.   :1304.500  
##  NA's   :2       NA's   :2

Notice how summary() is not particularly useful for categorical variables. For these variables you should be using frequency tables.

A one-way frequency table shows the frequencies of categories in a single categorical variable, while a two-way frequency table shows the relationship between two categorical variables. Both are created by the table() function:

table(my_data$Region) # A one-way frequency table of 'region'
## 
##  1  2  3  4  5  6  7 
## 24 24 16 33  7 12 27
table(my_data$Region, my_data$LifeExpectancy > 80) # A two-way frequency table showing the number of countries w/ LifeExpectancy > 80 by region
##    
##     FALSE TRUE
##   1    24    0
##   2    16    8
##   3    15    1
##   4    33    0
##   5     7    0
##   6    10    2
##   7    27    0
# Notice how the table function can use numeric, logical, and character variables

Tables are their own type of object, and they can be used as an input to functions like barplot():

my_table <- table(my_data$Region) # Tables can be stored as objects
barplot(my_table) # Creates a bar plot from a table

They can also be used as an input to the prop.table() function to find row or column proportions:

prop.table(my_table, margin = 1) # "margin = 1" gives row props, "margin = 2" gives column props 
## 
## 1 2 3 4 5 6 7 
## 1 1 1 1 1 1 1

In the example above, the table only had a single dimension (so each row total was the same as the frequency). Shown below is a more typical example:

my_table <- table(my_data$Region, my_data$LifeExpectancy > 80)
prop.table(my_table, margin = 1)
##    
##         FALSE      TRUE
##   1 1.0000000 0.0000000
##   2 0.6666667 0.3333333
##   3 0.9375000 0.0625000
##   4 1.0000000 0.0000000
##   5 1.0000000 0.0000000
##   6 0.8333333 0.1666667
##   7 1.0000000 0.0000000

Notice how this example used a logical condition to construction a binary variable to serve as the columns in the table.

Question 7: Find the mean, median, and range (maximum - minimum) of the variable LifeExpectancy in the Happy Planet data. Briefly comment on whether the distribution of this variable seems to be symmetric or skewed using plain text beneath your answer’s code chunk.


Coercion

Earlier we introduced three important types of vectors:

  1. numeric vectors - for example: x = c(1,2,3)
  2. character vectors - for example: x = c("A","B","C")
  3. logical vectors - for example: x = c(TRUE, FALSE, TRUE)

Many functions require their inputs be of a certain type. Fortunately, data can be coerced into another type using the as. family of functions:

## A character vector where the text strings are numbers
x <- c("1","12","123")
typeof(x)
## [1] "character"
## Coerce 'x' to a numeric vector
x <- as.numeric(x)
x
## [1]   1  12 123
typeof(x)
## [1] "double"

Question 8: Coerce the variable “Region” into a character variable. Use the typeof function to verify the change. Hint: you should overwrite the “Region” vector within “my_data” as part of this question.


Missing Data

Real data sometimes contain missing values, which R stores as the special element NA. Missing values may be present in your raw data, but they can also be introduced by coercion or other operations/functions:

## The second element is a blank space
x <- c("1"," ","123")
typeof(x)
## [1] "character"
## Coerce to a numeric vector (stored as 'y'), notice the NA
y <- as.numeric(x)
y
## [1]   1  NA 123

Missing values can cause problems for many functions, but some functions have arguments that control how missing values are handled. The example below shows how to remove any missing values when calculating the mean of y:

mean(y) ## Doesn't handle the missing value
## [1] NA
mean(y, na.rm = TRUE) ## Removes the missing value
## [1] 62

If missing values are removed in any part of an analysis, you should track and report the identities of the cases that were excluded. You can use the is.na() function to help locate these cases.

is.na(y) ## Returns TRUE if the value is missing
## [1] FALSE  TRUE FALSE
which(is.na(y))  ## Uses the which function to return the positions where is.na() returns "TRUE"
## [1] 2

Another useful function is na.omit(), which will subset a data frame to remove any rows that contain missing data in any variable. This function is demonstrated on the Happy Planet data below:

## Store the subset without missing data
my_data_without_na <- na.omit(my_data)

## Compare dimensions
dim(my_data)
## [1] 143  11
dim(my_data_without_na)
## [1] 141  11

Question 9: Find the median value of the variable “GDPperCapita” in the Happy Planet data, removing any missing values in this variable if necessary. Report the country names corresponding to any missing values that you removed (if applicable).


Factor Variables

Many functions will coerce character variables into factors.

On the surface you might not notice any difference, but internally a factor relies upon a set of categorical labels known as levels. By default, these labels are ordered alphabetically, but in some circumstances you’ll want to organize them yourself.

## A vector containing different months
mons <- c("March","April","January","November","January", "September","October","September","November","August","January","November",
          "November","February","May","August",   "July","December","August","August","September","November", "February","April")

## Convert it to a factor
mons_unordered = factor(mons)

## Notice the factor defaults to alphabetical order
barplot(table(mons_unordered))

## Convert to a factor with ordering specified by the "levels" argument
mons_ordered = factor(mons_unordered, levels= c("January","February","March","April","May","June",
                                                "July","August","September","October","November","December"), 
                        ordered = TRUE)

## Notice the new ordering (useful for data visualization!)
barplot(table(mons_ordered))

Question 10:

The code below loads the “colleges” data set. Recall that this data set contains information pertaining to all primarily undergraduate institutions with at least 400 full-times students in the 2019-20 academic year.

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
  • Part A: Create a subset of these (the?) data that contains all schools that admit less than 50% of applicants (an “Adm_Rate” less than 50%) and are located in the “Great Lakes” region. You should use this subset in Parts B and C.

  • Part B: Using the subset created in Part A, find the average value of “Salary10yr_median”, the median salary of a school’s alumni 10 years after their graduation. Remove missing data if necessary, but be sure to report the identity of any colleges that were removed.

  • Part C: The “Great Lakes” region consists of 5 different states: IL, IN, MI, OH, and WI. Using the subset created in Part A, create a bar plot that displays the number of colleges (meeting the criteria specified in Part A) in each of these states in descending order (you may examine the frequency table to determine this ordering “by hand”).


Loading Multiple Files

Question 11A: Download the data from this folder of data files to your Lab Data folder. Run the following line (rewrite it to match your username) to verify the .xlsx files are there. You may have to extract from a .zip file and rename the folder from Experiments->Data.

list.files(path = "C:/Users/friedrichsen/OneDrive - Grinnell College/Documents/STA230_F25/Labs/Data")
## [1] "run18_treatment.xlsx" "run21_control.xlsx"   "run34_control.xlsx"  
## [4] "run35_treatment.xlsx"

Now let’s suppose we want to find the means of the variable “VDS.Veh.Speed” for each participant file. Note that these are excel files (not .csv files), so we must first load (and possibly install) the readxl package in order to read them into R.

We then can iterate through these files using a for loop, storing the mean of each participant:

library(readxl)
my_dir = "C:/Users/friedrichsen/OneDrive - Grinnell College/Documents/STA230_F25/Labs/Data"
my_files <- list.files(path = my_dir)  ## List of file names in your directory

means <- numeric(length(my_files))                           ## Set up storage object
for(i in 1:length(my_files)){                                ## Loop over each file
  temp <- read_excel(paste0(my_dir, "/", my_files[i]))       ## Read by appending file name to the path prefix using paste0()
  means[i] <- mean(temp$VDS.Veh.Speed)                       ## Store the mean of the current file
}
print(means)
## [1] 24.91347 35.62761 56.93149 26.81412

Question 11B: Using the example above as a template, find the standard deviations (using the sd() function) of each participant file. Store these standard deviations in an object named “sds” and print them as part of your answer.


Row Binding

Sometimes it can be useful to aggregate several data frames with the same structure into one larger data frame. For example, we might want to combine all four participant files from Question #11 into a combined data frame. Or, perhaps we want to aggregate several years of data from the same source. These tasks can be handled by the rbind() function, which will append the rows of one or more data frames to an initial data frame (provided the column names match):

df1 <- data.frame(Year = 2019, val = rnorm(3))
df2 <- data.frame(Year = 2020, val = rnorm(3, mean = 10))

rbind(df1, df2)
##   Year        val
## 1 2019 -0.9926343
## 2 2019  0.6679190
## 3 2019  0.7363896
## 4 2020  9.7745287
## 5 2020  9.3232174
## 6 2020  8.8489899

Note that this result could also be achieved using full_join() (though I personally find that approach less intuitive). However, rbind() has the advantage of easily being able to bind an arbitrary number of data frames in a single command:

full_join(x = df1, y = df2)
df3 <- data.frame(Year = 2021, val = rnorm(3, mean = -5)) ## How about a third year?
rbind(df1, df2, df3)
##   Year        val
## 1 2019 -0.9926343
## 2 2019  0.6679190
## 3 2019  0.7363896
## 4 2020  9.7745287
## 5 2020  9.3232174
## 6 2020  8.8489899
## 7 2021 -4.9116580
## 8 2021 -4.5849365
## 9 2021 -6.2910218

Question 12: In the colleges dataset, find the rows for Iowa State University and Grinnell College. Make two data frames consisting of each row of data respectively. Use rbind() to combine these into a new data frame.


Practice (Required!)

Question 13:

The College Scorecard is a government database that record various characteristics of accredited colleges and universities within the United States. A portion of this database containing 2019-2020 data on colleges that primarily award undergraduate degrees and had at least 400 full time students is available at the URL below:

https://remiller1450.github.io/data/Colleges2019.csv

Part A: Load these data into R and store them as a data.frame object named colleges.

Part B: Create a subset of colleges that admit fewer than 25% of applicants (as measured by “Adm_Rate”). Store this subject in an object named colleges_selective.

Part C: Using the subset you created in Part B, colleges_selective, construct a table containing the proportion of private colleges within each region.

Part D: Provide a 1-sentence interpretation of one of the proportions in your table (you may choose which one seems most interesting).

Question 14:

The data available at the URL below contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010:

https://remiller1450.github.io/data/AmesHousing.csv

A more detailed description can be found at this link.

Part A: Read these data into R and store in a data frame named ames_housing. Then check the type of the variable MS.SubClass and compare it with the description of this variable given in the link above. Based upon your assessment, should this variable coerced to a different type? Briefly explain.

Part B: Find the number of homes in this data set with missing values for the variable Garage.Type. Hint: you can use the sum() function on a logical vector to count the number of TRUE values.

Part C: Create a subset containing homes with a missing value for the variable Garage.Type. What is the average Garage.Area of these homes?

Part D: Using the variable Exter.Cond (exterior condition), create an ordered factor that goes from “Poor” condition (a value of Po) to “Excellent” condition (a value of Ex) following the order and definitions given in the detailed description for this variable. Use the barplot() and table() functions to construct a bar chart using your ordered factor variable.