Lab 4: Data Manipulation

This lab focuses on manipulating, cleaning, and preparing data for visualization (or other analyses) using packages from the tidyverse suite. In the first part of the lab we will use the tidyr` package. In the second part of the lab, we will work withdplyr` package, which is used for “data wrangling”, or the process of cleaning, restructuring, and enriching a data set to make it more usable.

Directions for all labs (read before starting)

Please work together with your assigned partner. Make sure you both fully understand something before moving on.
Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
Ask for help, clarification, or even just a check-in if anything seems unclear.

The “Lab” section is something you will work on with a partner using paired programming, a framework defined as follows:

One partner is the driver, who physically writes code and operates the computer
One partner is the navigator, who reviews the actions of the driver and provides feedback and guidance

Partners are encouraged to switch roles throughout the “Lab” section.

\(~\)

Preamble

Packages

#install.packages("tidyverse") # we will be using these packages throughout 
# the semester, it is best to download them all now.

Motivation

Shown below are the acceptance rates over time of three different liberal arts colleges from 2018 to 2020:

As a data scientist, you must be able to connect the format of your data and a desired data visualization.

If creating an excel spreadsheet from this graphic, how do you think most people would record the data?
How should the data be formatted to match the conventions of ggplot?

\(~\)

Long vs. Wide Data

Data in a wide format record many different values or variables for a single entity in a single row (ie: acceptance rates for a college in different years, or different test scores for the same employee).

Data in a long format use multiple rows and a single column for the outcome or value of interest (ie: acceptance rate, test score, etc.) with additional columns identifying the meaning of that value.

Example:

The ggplot2 package, as well as the implementations of many statistical models, expect data in long format. However, many data manipulations are easier to do in wide format; for example, calculating the average improvement from pre-test to post-test.

\(~\)

Tidy Data

Converting between “wide” and “long” formats is often the most challenging step in creating a “tidy” data set, or one that is fully prepared for graphing/modeling.

In general, tidy data are defined by the following criteria:

Every column is a variable
Every row is an observation
Every cell is a single value

This lab will introduce several data manipulation functions used to help tidy a data set into a more useful format.

\(~\)

Packages and Datasets

# Please install and load the following packages
# install.packages("tidyr")
library(tidyr)
library(ggplot2)
library(dplyr)

The lab will use several data sets in its examples:

collegeAdm = read.csv("https://remiller1450.github.io/data/college_adm.csv")

Description: Admissions rates of three Midwestern liberal arts colleges according to acceptancerate.com

bluechips = read.csv("https://remiller1450.github.io/data/bluechips.csv")

Description: Closing prices on the first trading day of the year from 2010 to 2021 for four stocks that The Motley Fool calls “blue chip” investments.

polls <- read.csv("https://remiller1450.github.io/data/polls2016.csv")

Description: Polling data leading up to the 2016 US Presidential Elections scraped from RealClearPolitics.com

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

Description: Data from the 2019 college scorecard

\(~\)

Workflow and Piping

As a data scientist, you should strive to write code that is:

Legible - a peer could easily determine what each line is doing
Efficient - it avoids redundant, unnecessary, or computationally burdensome steps
Documented - comments and formatting are used to clearly explain every important step

Below is a “bad” example that begins with the bluechips data and creates a new data frame containing the average price (of the four stocks) for the years 2013, 2017, and 2021.

## Bad example
temp1 = subset(bluechips, Year %in% c(2013, 2017, 2021))
temp2 = data.frame(Year = temp1$Year, Avg = (temp1$AAPL + temp1$KO + temp1$JNJ + temp1$AXP)/4)
temp2

##   Year       Avg
## 1 2013  46.69955
## 2 2017  65.50687
## 3 2021 114.17750

Line 1 creates a new data frame that is only used in the next step (efficiency issue)
Line 2 is doing multiple things at once without documentation (legibility and documentation issues)

We can greatly streamline our workflow in this example using a method known as piping:

## Good example
bluechips %>% 
  filter(Year %in% c(2013, 2017, 2021)) %>%    # Subset to include the target years
  mutate(Avg = (AAPL + KO + JNJ + AXP)/4) %>%  # Calculate average
  select(Year, Avg)                            # Drop everything but year and average

##   Year       Avg
## 1 2013  46.69955
## 2 2017  65.50687
## 3 2021 114.17750

The %>% symbol will “pipe” the output of a preceding function into a subsequent function (usually as the “data” argument).

Below is a description of each line within the piping example given above:

The data frame bluechips is piped forward (into the filter() function on the next line)
filter() subsets the data it receives to include only the target years and the resulting subset is piped forward (into the mutate() function on the next line)
mutate() adds a new column called “Avg” to the data frame it received, and the resulting data frame is piped forward (into the select() function on the next line)
select() function drops all variables other than “Year” and “Avg”

Because the output of this pipeline is not stored as an object, the final data frame is simply printed. If we planned on using data frame prepared by this pipeline in a future data visualization or model, we’d want to store it as its own object:

## Storing the manipulated data set
new_bluechips <- bluechips %>% 
                   filter(Year %in% c(2013, 2017, 2021)) %>%    # Include only target years
                   mutate(Avg = (AAPL + KO + JNJ + AXP)/4) %>%  # Calculate average
                   select(Year, Avg)                            # Drop extra vars

Note: all functions in the tidyverse suite of packages are compatible with the %>% operator, so we can including pivoting steps in a pipeline.

\(~\)

Lab

At this point you will begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.

\(~\)

Pivoting between long and wide formats

Consider the collegeAdm data frame:

head(collegeAdm)

##   Adm_Rate Year  College
## 1     28.9 2018 Grinnell
## 2     24.4 2019 Grinnell
## 3     23.1 2020 Grinnell
## 4     21.2 2018  Carlton
## 5     19.8 2019  Carlton
## 6     19.1 2020  Carlton

These data are currently in “long” format, but we could convert them to a “wide” format using the pivot_wider() function:

## Pivot from long to wide to get 1 row per Year
wideCollegeAdm <- pivot_wider(collegeAdm, 
                              id_cols = Year,
                              names_from = College, 
                              values_from = Adm_Rate)
head(wideCollegeAdm)

## # A tibble: 3 × 4
##    Year Grinnell Carlton Oberlin
##   <int>    <dbl>   <dbl>   <dbl>
## 1  2018     28.9    21.2    33.7
## 2  2019     24.4    19.8    36.2
## 3  2020     23.1    19.1    36.4

The following arguments guide this transformation:

id_cols determines what will be given its own row in the “wide” data set (ie: each row will be a unique value of the variable “Year”)
names_from defines the single column from the “long” data that should be spread into multiple distinct columns in the “wide” data (ie: each value of “College” is given a column named after it)
values_from defines the single column from the “long” data containing the values used to populate the cells of the “wide” data (ie: the columns created for each “College” will contain the values of “Adm_Rate”)

Notice what happens when id_cols and names_from are swapped:

## Pivot from long to wide to get 1 row per College
wideCollegeAdm2 <- pivot_wider(collegeAdm, 
                              id_cols = College,
                              names_from = Year, 
                              values_from = Adm_Rate)
head(wideCollegeAdm2)

## # A tibble: 3 × 4
##   College  `2018` `2019` `2020`
##   <chr>     <dbl>  <dbl>  <dbl>
## 1 Grinnell   28.9   24.4   23.1
## 2 Carlton    21.2   19.8   19.1
## 3 Oberlin    33.7   36.2   36.4

Similarly, the pivot_longer() function will transform “wide” data into “long” data:

pivot_longer(wideCollegeAdm2, 
             cols = !College, 
             names_to = "Year", 
             values_to = "Adm_Rate")

## # A tibble: 9 × 3
##   College  Year  Adm_Rate
##   <chr>    <chr>    <dbl>
## 1 Grinnell 2018      28.9
## 2 Grinnell 2019      24.4
## 3 Grinnell 2020      23.1
## 4 Carlton  2018      21.2
## 5 Carlton  2019      19.8
## 6 Carlton  2020      19.1
## 7 Oberlin  2018      33.7
## 8 Oberlin  2019      36.2
## 9 Oberlin  2020      36.4

cols defines the column(s) used in the pivot (!College will include everything but the variable “College”). The values of these variables will be collapsed into a single column.
names_to is the name of the single column in the “long” data frame that will store the column names of the “wide” data frame
values_to is the name of the single column in the “long” data frame that will store the values from the cells of the “wide” data frame

Note we could interchangeably use the argument cols = c("2018", "2019", "2020") or cols = 2:4 (instead of cols = !College) to achieve the exact same result. The former approach explicitly names the columns that should be pivoted, and the later gives their index positions.

Question #1: Convert the bluechips data to a long format where each stock’s closing price on the first trading day of each year is recorded in a single column named “Price”.

Question #2: Starting with the long format data frame you created in Question #1, recreate the original bluechips data set using pivot_wider().

\(~\)

Other `tidyr` functions

Pivoting or reshaping is often only one of many steps needed to tidy a data set. Another common occurrence is that data will contain multiple variables in a single column. For example, consider the “Date” and “Sample” columns in the polls data set:

head(polls)

##                     Poll        Date  Sample MoE Clinton..D. Trump..R.
## 1               Monmouth 7/14 - 7/16  688 LV 3.7          45        43
## 2                CNN/ORC 7/13 - 7/16  872 RV 3.5          42        37
## 3     ABC News/Wash Post 7/11 - 7/14  816 RV 4.0          42        38
## 4 NBC News/Wall St. Jrnl  7/9 - 7/13 1000 RV 3.1          41        35
## 5       Economist/YouGov  7/9 - 7/11  932 RV 4.5          40        37
## 6   Associated Press-GfK  7/7 - 7/11  837 RV  NA          40        36
##   Johnson..L. Stein..G.
## 1           5         1
## 2          13         5
## 3           8         5
## 4          11         6
## 5           5         2
## 6           6         2

The column “Date” contains two distinct variables, the start and end of the poll’s sampling period. Similarly, “Sample” also contains two variables, the number of participants in the poll and the population that was sampled (registered voters or likely voters).

The separate() function is used to split a column into multiple new columns using a defined separator:

## Example #1
tidy_polls <- separate(polls, 
                       col = Date, 
                       into = c("Begin", "End"),
                       sep = " - ")
head(tidy_polls)

##                     Poll Begin  End  Sample MoE Clinton..D. Trump..R.
## 1               Monmouth  7/14 7/16  688 LV 3.7          45        43
## 2                CNN/ORC  7/13 7/16  872 RV 3.5          42        37
## 3     ABC News/Wash Post  7/11 7/14  816 RV 4.0          42        38
## 4 NBC News/Wall St. Jrnl   7/9 7/13 1000 RV 3.1          41        35
## 5       Economist/YouGov   7/9 7/11  932 RV 4.5          40        37
## 6   Associated Press-GfK   7/7 7/11  837 RV  NA          40        36
##   Johnson..L. Stein..G.
## 1           5         1
## 2          13         5
## 3           8         5
## 4          11         6
## 5           5         2
## 6           6         2

col is the single column to be separated
into indicates the names of the new columns produced by the separation
sep is the character string used to determine how to split. In this example, the split happens when - surrounded by a space on each side is present.

In Example #2 (shown below), the “sep” argument is not explicitly given. In this situation, the default behavior of separate() is to try and guess an appropriate separator.

## Example #2
tidy_polls <- separate(polls, 
                       col = Sample, 
                       into = c("Size", "Population"))
head(tidy_polls)

##                     Poll        Date Size Population MoE Clinton..D. Trump..R.
## 1               Monmouth 7/14 - 7/16  688         LV 3.7          45        43
## 2                CNN/ORC 7/13 - 7/16  872         RV 3.5          42        37
## 3     ABC News/Wash Post 7/11 - 7/14  816         RV 4.0          42        38
## 4 NBC News/Wall St. Jrnl  7/9 - 7/13 1000         RV 3.1          41        35
## 5       Economist/YouGov  7/9 - 7/11  932         RV 4.5          40        37
## 6   Associated Press-GfK  7/7 - 7/11  837         RV  NA          40        36
##   Johnson..L. Stein..G.
## 1           5         1
## 2          13         5
## 3           8         5
## 4          11         6
## 5           5         2
## 6           6         2

While this is not generally recommended, it can work well if there’s a clear pattern in your variable. More complex strings might require the use of regular expressions, a topic we’ll cover later this semester.

Question #3 (Part A): Using either the pivot_longer() or pivot_wider() function, create a version of the “tidy_polls” data containing the variables “Candidate” and “Percentage”, where “Candidate” is taken from the names of the last four columns of the data frame, and “Percentage” is taken from the values contained in these columns.

Question #3 (Part B): Using the separate() function, split the column “Candidate” (created in Part A) into two distinct columns containing the name of the candidate (ie: Clinton, Trump, etc.) and their political party (ie: D, R, etc.). Hint: periods, or ., are a special character in R, but you can reference one using the expression: [.]. You can also try letting separate() guess the proper splitting characters.

\(~\)

Overview of Data Manipulation

The dplyr package contains a suite of functions designed to make data manipulation easier. The package’s core functions can be viewed as verbs:

Verb/Function	Meaning
`filter`	pick specific observations (i.e. specific rows) similar to `subset()`
`arrange`	reorder the rows
`select`	pick variables by their names (i.e. specific columns)
`mutate`	add new derived columns to a data frame
`summarize`	aggregate many rows into a summary measure

Importantly, these functions can be strung together using piping. But first, let’s see a few examples of how they work individually.

\(~\)

Filter

The filter() function is nearly identical to the subset() function from base R. The only difference is that you can provide multiple logical conditions as separate arguments (rather than building a single condition using &). You should also be aware of it because of its compatibility with piping.

Example:

## Colleges in Iowa w/ an ACT median above 25
colleges %>% filter(State == "IA", ACT_median > 25)

##                 Name         City State Enrollment Private Region Adm_Rate
## 1    Cornell College Mount Vernon    IA       1022 Private Plains   0.6102
## 2   Drake University   Des Moines    IA       2952 Private Plains   0.6766
## 3   Grinnell College     Grinnell    IA       1683 Private Plains   0.2438
## 4     Luther College      Decorah    IA       1974 Private Plains   0.6257
## 5 University of Iowa    Iowa City    IA      23410  Public Plains   0.8267
##   ACT_median ACT_Q1 ACT_Q3  Cost Net_Tuition Avg_Fac_Salary PercentFemale
## 1         27     23     23 55817       16457          68832     0.4674157
## 2         27     24     24 53507       21160          85563     0.6097561
## 3         32     30     30 65814       20369         101979     0.5348837
## 4         26     23     23 54045       16779          67833     0.5242718
## 5         26     23     23 22607       14547          91440     0.5623043
##   PercentWhite PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1       0.8856       0.0456          0.0595       0.0209          0.6776860
## 2       0.9025       0.0370          0.0407       0.0250          0.6641791
## 3       0.7933       0.0971          0.0841       0.0400          0.7413793
## 4       0.9335       0.0223          0.0282       0.0172          0.6457399
## 5       0.9158       0.0303          0.0377       0.0224          0.6086310
##   FourYearComp_Females Debt_median Salary10yr_median
## 1            0.7310345       22130             43000
## 2            0.7139175       19197             58300
## 3            0.8372093       15000             49100
## 4            0.7553191       25250             47400
## 5            0.6572554       16173             51900

\(~\)

Arrange

The arange() function sorts the rows of your data by one or more numeric variables:

## Sort by ACT median
colleges %>% 
  filter(State == "IA" & ACT_median > 25) %>%
  arrange(ACT_median)

##                 Name         City State Enrollment Private Region Adm_Rate
## 1     Luther College      Decorah    IA       1974 Private Plains   0.6257
## 2 University of Iowa    Iowa City    IA      23410  Public Plains   0.8267
## 3    Cornell College Mount Vernon    IA       1022 Private Plains   0.6102
## 4   Drake University   Des Moines    IA       2952 Private Plains   0.6766
## 5   Grinnell College     Grinnell    IA       1683 Private Plains   0.2438
##   ACT_median ACT_Q1 ACT_Q3  Cost Net_Tuition Avg_Fac_Salary PercentFemale
## 1         26     23     23 54045       16779          67833     0.5242718
## 2         26     23     23 22607       14547          91440     0.5623043
## 3         27     23     23 55817       16457          68832     0.4674157
## 4         27     24     24 53507       21160          85563     0.6097561
## 5         32     30     30 65814       20369         101979     0.5348837
##   PercentWhite PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1       0.9335       0.0223          0.0282       0.0172          0.6457399
## 2       0.9158       0.0303          0.0377       0.0224          0.6086310
## 3       0.8856       0.0456          0.0595       0.0209          0.6776860
## 4       0.9025       0.0370          0.0407       0.0250          0.6641791
## 5       0.7933       0.0971          0.0841       0.0400          0.7413793
##   FourYearComp_Females Debt_median Salary10yr_median
## 1            0.7553191       25250             47400
## 2            0.6572554       16173             51900
## 3            0.7310345       22130             43000
## 4            0.7139175       19197             58300
## 5            0.8372093       15000             49100

When sorting by multiple variables, the one listed first will be given priority. Additionally, values can be arranged in descending order via the desc() function:

## Sort by ACT median first, then by Adm_Rate among ties
IA_selective <- colleges %>% 
  filter(State == "IA" & ACT_median > 25) %>%
  arrange(ACT_median, desc(Adm_Rate))
IA_selective

##                 Name         City State Enrollment Private Region Adm_Rate
## 1 University of Iowa    Iowa City    IA      23410  Public Plains   0.8267
## 2     Luther College      Decorah    IA       1974 Private Plains   0.6257
## 3   Drake University   Des Moines    IA       2952 Private Plains   0.6766
## 4    Cornell College Mount Vernon    IA       1022 Private Plains   0.6102
## 5   Grinnell College     Grinnell    IA       1683 Private Plains   0.2438
##   ACT_median ACT_Q1 ACT_Q3  Cost Net_Tuition Avg_Fac_Salary PercentFemale
## 1         26     23     23 22607       14547          91440     0.5623043
## 2         26     23     23 54045       16779          67833     0.5242718
## 3         27     24     24 53507       21160          85563     0.6097561
## 4         27     23     23 55817       16457          68832     0.4674157
## 5         32     30     30 65814       20369         101979     0.5348837
##   PercentWhite PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1       0.9158       0.0303          0.0377       0.0224          0.6086310
## 2       0.9335       0.0223          0.0282       0.0172          0.6457399
## 3       0.9025       0.0370          0.0407       0.0250          0.6641791
## 4       0.8856       0.0456          0.0595       0.0209          0.6776860
## 5       0.7933       0.0971          0.0841       0.0400          0.7413793
##   FourYearComp_Females Debt_median Salary10yr_median
## 1            0.6572554       16173             51900
## 2            0.7553191       25250             47400
## 3            0.7139175       19197             58300
## 4            0.7310345       22130             43000
## 5            0.8372093       15000             49100

Question #4: Filter the “colleges” data to include only private colleges in the Mid East and New England regions. Then sort these schools according to the variable “PercentFemale” such that the school with the largest share of female students appears at the top of the list.

\(~\)

Select

The select() function is used to reduce the number of variables in a data set:

### Keep only the Name, ACT_median, and Net_Tuition variables
IA_selective <- colleges %>% 
  filter(State == "IA" & ACT_median > 25) %>%
  select(Name, ACT_median, Cost, Net_Tuition)
IA_selective

##                 Name ACT_median  Cost Net_Tuition
## 1    Cornell College         27 55817       16457
## 2   Drake University         27 53507       21160
## 3   Grinnell College         32 65814       20369
## 4     Luther College         26 54045       16779
## 5 University of Iowa         26 22607       14547

Sometimes you’ll want to keep most of your variables, dropping only a few that are no longer necessary. To drop a variable using select(), you can place a - character in front of its name:

### Keep everything but the State and City variables
IA_selective <- colleges %>% 
  filter(State == "IA" & ACT_median > 25) %>%
  select(-State, -City)
IA_selective

##                 Name Enrollment Private Region Adm_Rate ACT_median ACT_Q1
## 1    Cornell College       1022 Private Plains   0.6102         27     23
## 2   Drake University       2952 Private Plains   0.6766         27     24
## 3   Grinnell College       1683 Private Plains   0.2438         32     30
## 4     Luther College       1974 Private Plains   0.6257         26     23
## 5 University of Iowa      23410  Public Plains   0.8267         26     23
##   ACT_Q3  Cost Net_Tuition Avg_Fac_Salary PercentFemale PercentWhite
## 1     23 55817       16457          68832     0.4674157       0.8856
## 2     24 53507       21160          85563     0.6097561       0.9025
## 3     30 65814       20369         101979     0.5348837       0.7933
## 4     23 54045       16779          67833     0.5242718       0.9335
## 5     23 22607       14547          91440     0.5623043       0.9158
##   PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1       0.0456          0.0595       0.0209          0.6776860
## 2       0.0370          0.0407       0.0250          0.6641791
## 3       0.0971          0.0841       0.0400          0.7413793
## 4       0.0223          0.0282       0.0172          0.6457399
## 5       0.0303          0.0377       0.0224          0.6086310
##   FourYearComp_Females Debt_median Salary10yr_median
## 1            0.7310345       22130             43000
## 2            0.7139175       19197             58300
## 3            0.8372093       15000             49100
## 4            0.7553191       25250             47400
## 5            0.6572554       16173             51900

\(~\)

Mutate

The mutate() function is used to add a new column to your data that is a function of one or more existing variables:

IA_selective <- colleges %>% 
  filter(State == "IA" & ACT_median > 25) %>%
  mutate(Expected_Discount = (Cost - Net_Tuition)/Cost) %>%
  select(Name, Cost, Net_Tuition, Expected_Discount)
IA_selective

##                 Name  Cost Net_Tuition Expected_Discount
## 1    Cornell College 55817       16457         0.7051615
## 2   Drake University 53507       21160         0.6045377
## 3   Grinnell College 65814       20369         0.6905066
## 4     Luther College 54045       16779         0.6895365
## 5 University of Iowa 22607       14547         0.3565267

In the example shown above we add a new variable, “Expected_Discount”, that is a function of “Cost” and “New_Tuition”.

Question #5: Using the entire “colleges” data set, create a new data frame containing only “Name”, “State”, and a new variable named “Debt_Cost_Ratio” that describes each college’s “Debt_median” relative to its expected cumulative cost of attendance under the assumption that a student enrolls for 4 years and “Cost” increases by 3% each year. Print this new data frame as part of your answer. Hint: In year 1 the cumulative cost for a student is Cost, in year 2 the cumulative cost is Cost + 1.03*Cost, etc.

\(~\)

Summarize

The summarize() function will calculate summary statistics that require an aggregation of rows. For example:

colleges %>% 
  filter(State == "IA") %>%
  summarize(Median_Cost = median(Cost))

##   Median_Cost
## 1       43520

Without group-wise manipulation (explained in the next section), summarize() is most useful for generating a customized set of summary measures:

colleges %>% 
  filter(State == "IA") %>%
  summarize(Min_Cost = min(Cost),
            TenPer_Cost = quantile(Cost, 0.1),     ## 10th percentile
            Median_Cost = median(Cost),
            NinetyPer_Cost = quantile(Cost, 0.9),  ## 90th percentile
            Max_Cost = max(Cost))

##   Min_Cost TenPer_Cost Median_Cost NinetyPer_Cost Max_Cost
## 1    20476     22368.2       43520        54256.2    65814

Question #6: Using the summarize() function, report the interquartile range (IQR) and standard deviation of the variable “Debt_Cost_Ratio” that you created in Question #5. Hint: recall that the IQR is calculated as the 75th percentile minus the 25th percentile.

\(~\)

Group-wise Manipulation

Frequently, we’d like to perform parallel calculations for different subsets of data. For example, you might want to find the mean and standard deviation of the variable “Cost” separately for every state, or you might even want to find it separately for the private and public schools in each different state.

Group-wise manipulation involves two steps:

First, the group_by() function is used to internally add grouping tags to the rows of your data. You will not see these tags, but other tidyverse functions will recognize and use them.
Next, the tagged data are passed into any of the aforementioned dplyr functions (usually summarize() or mutate()), and those functions are executed separately on each group.

Shown below are a few examples.

In Example #1, we find the state-specific median cost of the colleges in each state located in the “Plains” region:

## Example #1
colleges %>% 
  filter(Region == "Plains") %>%
  group_by(State) %>%
  summarize(Median_Cost = median(Cost, na.rm = TRUE))

## # A tibble: 7 × 2
##   State Median_Cost
##   <chr>       <dbl>
## 1 IA         43520 
## 2 KS         38832 
## 3 MN         35887 
## 4 MO         30279 
## 5 ND         19299 
## 6 NE         29258.
## 7 SD         22609

In Example #2, we find each state’s median cost separately for private and public colleges located in Iowa, Minnesota, or Missouri. We also use n() to count the number of colleges, recorded as “N”, belonging to each group reported in the summary.

## Example #2
colleges %>% 
  filter(State == "IA" | State == "MN" | State == "MO") %>%
  group_by(State, Private) %>%
  summarize(Median_Cost = median(Cost, na.rm = TRUE),
            N = n())

## # A tibble: 6 × 4
## # Groups:   State [3]
##   State Private Median_Cost     N
##   <chr> <chr>         <dbl> <int>
## 1 IA    Private      44206     25
## 2 IA    Public       21295      4
## 3 MN    Private      48860     24
## 4 MN    Public       21416     12
## 5 MO    Private      37788.    32
## 6 MO    Public       19346     13

In Example #3, we find how much each state (either IA or MN) deviates from the average cost of all colleges within the same state.

## Example #3
colleges %>% 
  filter(State == "IA" | State == "MN" ) %>%
  group_by(State) %>%
  mutate(Cost_Avg = mean(Cost, na.rm = TRUE),
         Cost_vs_Avg = Cost - mean(Cost, na.rm = TRUE)) %>%
  select(Name, State, Cost, Cost_Avg, Cost_vs_Avg)

## # A tibble: 65 × 5
## # Groups:   State [2]
##    Name                     State  Cost Cost_Avg Cost_vs_Avg
##    <chr>                    <chr> <int>    <dbl>       <dbl>
##  1 Augsburg University      MN    51251   37714.      13537.
##  2 Bemidji State University MN    20379   37714.     -17335.
##  3 Bethany Lutheran College MN    36961   37714.       -753.
##  4 Bethel University        MN    50325   37714.      12611.
##  5 Briar Cliff University   IA    42423   41987.        436.
##  6 Buena Vista University   IA    45332   41987.       3345.
##  7 Capella University       MN    20152   37714.     -17562.
##  8 Carleton College         MN    68835   37714.      31121.
##  9 Central College          IA    50547   41987.       8560.
## 10 Clarke University        IA    44891   41987.       2904.
## # ℹ 55 more rows

Notice how summarize() returns an object with 1 row per group, while mutate() returns an object with 1 row per observation.

\(~\)

Practice (required)

Question #7: The “airlines” data set (loaded below) contains data used in the article Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past? that appeared on fivethirtyeight.com.

airlines <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/airline-safety.csv", as.is = TRUE)
head(airlines)

##                 airline avail_seat_km_per_week incidents.1985_1999
## 1            Aer Lingus              320906734                   2
## 2             Aeroflot*             1197672318                  76
## 3 Aerolineas Argentinas              385803648                   6
## 4           Aeromexico*              596871813                   3
## 5            Air Canada             1865253802                   2
## 6            Air France             3004002661                  14
##   fatal_accidents.1985_1999 fatalities.1985_1999 incidents.2000_2014
## 1                         0                    0                   0
## 2                        14                  128                   6
## 3                         0                    0                   1
## 4                         1                   64                   5
## 5                         0                    0                   2
## 6                         4                   79                   6
##   fatal_accidents.2000_2014 fatalities.2000_2014
## 1                         0                    0
## 2                         1                   88
## 3                         0                    0
## 4                         0                    0
## 5                         0                    0
## 6                         2                  337

Recall that a “tidy” version of these data should satisfy the following:

Each row is a single airline in a specific time period (ie: Air Canada in 1985-1999 or Alaska Airlines in 2000-2014)
Each column contains only a single variable
Each cell contains only a single value

Part A: Use pivot_longer() to gather the last six columns of the “airlines” data into a column named “accidents” and a column named “count”.

Part B: Use separate() to split the “accidents” column into two variables named “var” and “years”. Hint: remember that the period is a special character in R.

Part C: Use pivot_wider() to spread out the “var” column into three new columns containing the type of accident. Your data should now contain two rows per airline (one for each time period), you can check if the first few rows match those printed below.

## # A tibble: 6 × 6
##   airline      avail_seat_km_per_week years incidents fatal_accidents fatalities
##   <chr>                         <dbl> <chr>     <int>           <int>      <int>
## 1 Aer Lingus                320906734 1985…         2               0          0
## 2 Aer Lingus                320906734 2000…         0               0          0
## 3 Aeroflot*                1197672318 1985…        76              14        128
## 4 Aeroflot*                1197672318 2000…         6               1         88
## 5 Aerolineas …              385803648 1985…         6               0          0
## 6 Aerolineas …              385803648 2000…         1               0          0

\(~\)

Question #8: The iris data (from the datasets package) is a collection of measurements (in cm) of the the sepal and petal dimensions of 50 different flowers coming from 3 different species of iris. These data are frequently attributed to the famous statistician Ronald Fisher

# install.packages("datasets")
data(iris)

Your goal in this question is to recreate the following graphic, which requires the use of tidyr functions covered in this lab.

Note: you can use the theme() function with the argument axis.text.x = element_text(angle = 45) to rotate the x-axis labels.

Intensive care units, or ICUs, are primary spaces in hospitals that are reserved for patients in critical condition. The data below is a random sample of n = 200 ICU patients from a research hospital affiliated with Carnegie Mellon University (CMU).

icu = read.csv("https://remiller1450.github.io/data/ICUAdmissions.csv")

The data dictionary below documents each variable that was recorded:

ID - Patient ID number
Status - Patient status: 0=lived or 1=died
Age - Patient’s age (in years)
Sex - 0=male or 1=female
Race - Patient’s race: 1=white, 2=black, or 3=other
Service - Type of service: 0=medical or 1=surgical
Cancer - Is cancer involved? 0=no or 1=yes
Renal - Is chronic renal failure involved? 0=no or 1=yes
Infection - Is infection involved? 0=no or 1=yes
CPR - Patient received CPR prior to admission? 0=no or 1=yes
Systolic - Systolic blood pressure (in mm of Hg)
HeartRate - Pulse rate (beats per minute)
Previous - Previous admission to ICU within 6 months? 0=no or 1=yes
Type - Admission type: 0=elective or 1=emergency
Fracture - Fractured bone involved? 0=no or 1=yes
PO2 - Partial oxygen level from blood gases under 60? 0=no or 1=yes
PH - pH from blood gas under 7.25? 0=no or 1=yes
PCO2 - Partial carbon dioxide level from blood gas over 45? 0=no or 1=yes
Bicarbonate - Bicarbonate from blood gas under 18? 0=no or 1=yes
Creatinine - Creatinine from blood gas over 2.0? 0=no or 1=yes
Consciousness - Level upon arrival: 0=conscious, 1=deep stupor, or 2=coma

Question #9:

Part A: Filter the ICU data to include only patients whose visit involves an infection. Then, for the “Age” variable, find the mean, standard deviation, and group size (found using the function n()) of patients with and without a previous admission in the prior 6 months.

Part B: Considering all ICU patients in these data, use the group_by() and mutate() functions to a sex-specific Z-score for the variable “HeartRate” of each patient. Note: you should be using different means and standard deviations within each sex to calculate this Z-score. If you are unfamiliar with Z-scores, they take the form: \(Z = \tfrac{\text{Value} - \text{Mean}}{\text{Std. Dev}}\)

Part C: Using your results from Part B, recreate the following ggplot graphic:

Hints: The code used to generate this graph uses aes(y = after_stat(density)) to put the histograms on the density scale, and position = "identity" to prevent them from overlapping. The graphics parameter alpha was used to provide partial transparency, and scale_fill_manual() was used to modify the color scheme and color/fill labels.

Lab 4: Data Manipulation

2025-09-20

Preamble

Packages

Motivation

Long vs. Wide Data

Tidy Data

Packages and Datasets

Workflow and Piping

Lab

Pivoting between long and wide formats

Other tidyr functions

Overview of Data Manipulation

Group-wise Manipulation

Practice (required)

Other `tidyr` functions