This lab focuses on manipulating, cleaning, and preparing data for
visualization (or other analyses) using packages from the tidyverse suite. In the
first part of the lab we will use the
tidyr` package. In the second part of the lab, we will work withdplyr`
package, which is used for “data wrangling”, or the process of cleaning,
restructuring, and enriching a data set to make it more usable.
Directions for all labs (read before starting)
The “Lab” section is something you will work on with a partner using paired programming, a framework defined as follows:
Partners are encouraged to switch roles throughout the “Lab” section.
\(~\)
#install.packages("tidyverse") # we will be using these packages throughout
# the semester, it is best to download them all now.
Shown below are the acceptance rates over time of three different liberal arts colleges from 2018 to 2020:
As a data scientist, you must be able to connect the format of your data and a desired data visualization.
ggplot?\(~\)
Data in a wide format record many different values or variables for a single entity in a single row (ie: acceptance rates for a college in different years, or different test scores for the same employee).
Data in a long format use multiple rows and a single column for the outcome or value of interest (ie: acceptance rate, test score, etc.) with additional columns identifying the meaning of that value.
Example:
The ggplot2 package, as well as the implementations of
many statistical models, expect data in long format.
However, many data manipulations are easier to do in wide format; for
example, calculating the average improvement from pre-test to
post-test.
\(~\)
Converting between “wide” and “long” formats is often the most challenging step in creating a “tidy” data set, or one that is fully prepared for graphing/modeling.
In general, tidy data are defined by the following criteria:
This lab will introduce several data manipulation functions used to help tidy a data set into a more useful format.
\(~\)
# Please install and load the following packages
# install.packages("tidyr")
library(tidyr)
library(ggplot2)
library(dplyr)
The lab will use several data sets in its examples:
collegeAdm = read.csv("https://remiller1450.github.io/data/college_adm.csv")
bluechips = read.csv("https://remiller1450.github.io/data/bluechips.csv")
polls <- read.csv("https://remiller1450.github.io/data/polls2016.csv")
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
\(~\)
As a data scientist, you should strive to write code that is:
Below is a “bad” example that begins with the
bluechips data and creates a new data frame containing the
average price (of the four stocks) for the years 2013, 2017, and
2021.
## Bad example
temp1 = subset(bluechips, Year %in% c(2013, 2017, 2021))
temp2 = data.frame(Year = temp1$Year, Avg = (temp1$AAPL + temp1$KO + temp1$JNJ + temp1$AXP)/4)
temp2
## Year Avg
## 1 2013 46.69955
## 2 2017 65.50687
## 3 2021 114.17750
We can greatly streamline our workflow in this example using a method known as piping:
## Good example
bluechips %>%
filter(Year %in% c(2013, 2017, 2021)) %>% # Subset to include the target years
mutate(Avg = (AAPL + KO + JNJ + AXP)/4) %>% # Calculate average
select(Year, Avg) # Drop everything but year and average
## Year Avg
## 1 2013 46.69955
## 2 2017 65.50687
## 3 2021 114.17750
The %>% symbol will “pipe” the output of a preceding
function into a subsequent function (usually as the “data”
argument).
Below is a description of each line within the piping example given above:
bluechips is piped forward (into the
filter() function on the next line)filter() subsets the data it receives to include only
the target years and the resulting subset is piped forward (into the
mutate() function on the next line)mutate() adds a new column called “Avg” to the data
frame it received, and the resulting data frame is piped forward (into
the select() function on the next line)select() function drops all variables other than “Year”
and “Avg”Because the output of this pipeline is not stored as an object, the final data frame is simply printed. If we planned on using data frame prepared by this pipeline in a future data visualization or model, we’d want to store it as its own object:
## Storing the manipulated data set
new_bluechips <- bluechips %>%
filter(Year %in% c(2013, 2017, 2021)) %>% # Include only target years
mutate(Avg = (AAPL + KO + JNJ + AXP)/4) %>% # Calculate average
select(Year, Avg) # Drop extra vars
Note: all functions in the tidyverse suite of
packages are compatible with the %>% operator, so we can
including pivoting steps in a pipeline.
\(~\)
At this point you will begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.
\(~\)
Consider the collegeAdm data frame:
head(collegeAdm)
## Adm_Rate Year College
## 1 28.9 2018 Grinnell
## 2 24.4 2019 Grinnell
## 3 23.1 2020 Grinnell
## 4 21.2 2018 Carlton
## 5 19.8 2019 Carlton
## 6 19.1 2020 Carlton
These data are currently in “long” format, but we could convert them
to a “wide” format using the pivot_wider() function:
## Pivot from long to wide to get 1 row per Year
wideCollegeAdm <- pivot_wider(collegeAdm,
id_cols = Year,
names_from = College,
values_from = Adm_Rate)
head(wideCollegeAdm)
## # A tibble: 3 × 4
## Year Grinnell Carlton Oberlin
## <int> <dbl> <dbl> <dbl>
## 1 2018 28.9 21.2 33.7
## 2 2019 24.4 19.8 36.2
## 3 2020 23.1 19.1 36.4
The following arguments guide this transformation:
id_cols determines what will be given its own row in
the “wide” data set (ie: each row will be a unique value of the variable
“Year”)names_from defines the single column from the
“long” data that should be spread into multiple distinct
columns in the “wide” data (ie: each value of “College” is given a
column named after it)values_from defines the single column from the
“long” data containing the values used to populate the cells of
the “wide” data (ie: the columns created for each “College” will contain
the values of “Adm_Rate”)Notice what happens when id_cols and
names_from are swapped:
## Pivot from long to wide to get 1 row per College
wideCollegeAdm2 <- pivot_wider(collegeAdm,
id_cols = College,
names_from = Year,
values_from = Adm_Rate)
head(wideCollegeAdm2)
## # A tibble: 3 × 4
## College `2018` `2019` `2020`
## <chr> <dbl> <dbl> <dbl>
## 1 Grinnell 28.9 24.4 23.1
## 2 Carlton 21.2 19.8 19.1
## 3 Oberlin 33.7 36.2 36.4
Similarly, the pivot_longer() function will transform
“wide” data into “long” data:
pivot_longer(wideCollegeAdm2,
cols = !College,
names_to = "Year",
values_to = "Adm_Rate")
## # A tibble: 9 × 3
## College Year Adm_Rate
## <chr> <chr> <dbl>
## 1 Grinnell 2018 28.9
## 2 Grinnell 2019 24.4
## 3 Grinnell 2020 23.1
## 4 Carlton 2018 21.2
## 5 Carlton 2019 19.8
## 6 Carlton 2020 19.1
## 7 Oberlin 2018 33.7
## 8 Oberlin 2019 36.2
## 9 Oberlin 2020 36.4
cols defines the column(s) used in the pivot
(!College will include everything but the variable
“College”). The values of these variables will be collapsed into a
single column.names_to is the name of the single column in the “long”
data frame that will store the column names of the “wide” data
framevalues_to is the name of the single column in the
“long” data frame that will store the values from the cells of
the “wide” data frameNote we could interchangeably use the argument
cols = c("2018", "2019", "2020") or cols = 2:4
(instead of cols = !College) to achieve the exact same
result. The former approach explicitly names the columns that should be
pivoted, and the later gives their index positions.
Question #1: Convert the bluechips data
to a long format where each stock’s closing price on the first trading
day of each year is recorded in a single column named “Price”.
Question #2: Starting with the long format data
frame you created in Question #1, recreate the original
bluechips data set using pivot_wider().
\(~\)
tidyr functionsPivoting or reshaping is often only one of many steps needed to tidy
a data set. Another common occurrence is that data will contain multiple
variables in a single column. For example, consider the “Date” and
“Sample” columns in the polls data set:
head(polls)
## Poll Date Sample MoE Clinton..D. Trump..R.
## 1 Monmouth 7/14 - 7/16 688 LV 3.7 45 43
## 2 CNN/ORC 7/13 - 7/16 872 RV 3.5 42 37
## 3 ABC News/Wash Post 7/11 - 7/14 816 RV 4.0 42 38
## 4 NBC News/Wall St. Jrnl 7/9 - 7/13 1000 RV 3.1 41 35
## 5 Economist/YouGov 7/9 - 7/11 932 RV 4.5 40 37
## 6 Associated Press-GfK 7/7 - 7/11 837 RV NA 40 36
## Johnson..L. Stein..G.
## 1 5 1
## 2 13 5
## 3 8 5
## 4 11 6
## 5 5 2
## 6 6 2
The column “Date” contains two distinct variables, the start and end of the poll’s sampling period. Similarly, “Sample” also contains two variables, the number of participants in the poll and the population that was sampled (registered voters or likely voters).
The separate() function is used to split a column into
multiple new columns using a defined separator:
## Example #1
tidy_polls <- separate(polls,
col = Date,
into = c("Begin", "End"),
sep = " - ")
head(tidy_polls)
## Poll Begin End Sample MoE Clinton..D. Trump..R.
## 1 Monmouth 7/14 7/16 688 LV 3.7 45 43
## 2 CNN/ORC 7/13 7/16 872 RV 3.5 42 37
## 3 ABC News/Wash Post 7/11 7/14 816 RV 4.0 42 38
## 4 NBC News/Wall St. Jrnl 7/9 7/13 1000 RV 3.1 41 35
## 5 Economist/YouGov 7/9 7/11 932 RV 4.5 40 37
## 6 Associated Press-GfK 7/7 7/11 837 RV NA 40 36
## Johnson..L. Stein..G.
## 1 5 1
## 2 13 5
## 3 8 5
## 4 11 6
## 5 5 2
## 6 6 2
col is the single column to be separatedinto indicates the names of the new columns produced by
the separationsep is the character string used to determine how to
split. In this example, the split happens when - surrounded
by a space on each side is present.In Example #2 (shown below), the “sep” argument is not explicitly
given. In this situation, the default behavior of
separate() is to try and guess an appropriate
separator.
## Example #2
tidy_polls <- separate(polls,
col = Sample,
into = c("Size", "Population"))
head(tidy_polls)
## Poll Date Size Population MoE Clinton..D. Trump..R.
## 1 Monmouth 7/14 - 7/16 688 LV 3.7 45 43
## 2 CNN/ORC 7/13 - 7/16 872 RV 3.5 42 37
## 3 ABC News/Wash Post 7/11 - 7/14 816 RV 4.0 42 38
## 4 NBC News/Wall St. Jrnl 7/9 - 7/13 1000 RV 3.1 41 35
## 5 Economist/YouGov 7/9 - 7/11 932 RV 4.5 40 37
## 6 Associated Press-GfK 7/7 - 7/11 837 RV NA 40 36
## Johnson..L. Stein..G.
## 1 5 1
## 2 13 5
## 3 8 5
## 4 11 6
## 5 5 2
## 6 6 2
While this is not generally recommended, it can work well if there’s a clear pattern in your variable. More complex strings might require the use of regular expressions, a topic we’ll cover later this semester.
Question #3 (Part A): Using either the
pivot_longer() or pivot_wider() function,
create a version of the “tidy_polls” data containing the variables
“Candidate” and “Percentage”, where “Candidate” is taken from the names
of the last four columns of the data frame, and “Percentage” is taken
from the values contained in these columns.
Question #3 (Part B): Using the
separate() function, split the column “Candidate” (created
in Part A) into two distinct columns containing the name of the
candidate (ie: Clinton, Trump, etc.) and their political party (ie: D,
R, etc.). Hint: periods, or ., are a special
character in R, but you can reference one using the
expression: [.]. You can also try letting
separate() guess the proper splitting characters.
\(~\)
The dplyr package contains a suite of functions designed
to make data manipulation easier. The package’s core functions can be
viewed as verbs:
| Verb/Function | Meaning |
|---|---|
filter |
pick specific observations (i.e. specific rows) similar to
subset() |
arrange |
reorder the rows |
select |
pick variables by their names (i.e. specific columns) |
mutate |
add new derived columns to a data frame |
summarize |
aggregate many rows into a summary measure |
Importantly, these functions can be strung together using piping. But first, let’s see a few examples of how they work individually.
\(~\)
Filter
The filter() function is nearly identical to the
subset() function from base R. The only difference is that
you can provide multiple logical conditions as separate arguments
(rather than building a single condition using &). You
should also be aware of it because of its compatibility with piping.
Example:
## Colleges in Iowa w/ an ACT median above 25
colleges %>% filter(State == "IA", ACT_median > 25)
## Name City State Enrollment Private Region Adm_Rate
## 1 Cornell College Mount Vernon IA 1022 Private Plains 0.6102
## 2 Drake University Des Moines IA 2952 Private Plains 0.6766
## 3 Grinnell College Grinnell IA 1683 Private Plains 0.2438
## 4 Luther College Decorah IA 1974 Private Plains 0.6257
## 5 University of Iowa Iowa City IA 23410 Public Plains 0.8267
## ACT_median ACT_Q1 ACT_Q3 Cost Net_Tuition Avg_Fac_Salary PercentFemale
## 1 27 23 23 55817 16457 68832 0.4674157
## 2 27 24 24 53507 21160 85563 0.6097561
## 3 32 30 30 65814 20369 101979 0.5348837
## 4 26 23 23 54045 16779 67833 0.5242718
## 5 26 23 23 22607 14547 91440 0.5623043
## PercentWhite PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1 0.8856 0.0456 0.0595 0.0209 0.6776860
## 2 0.9025 0.0370 0.0407 0.0250 0.6641791
## 3 0.7933 0.0971 0.0841 0.0400 0.7413793
## 4 0.9335 0.0223 0.0282 0.0172 0.6457399
## 5 0.9158 0.0303 0.0377 0.0224 0.6086310
## FourYearComp_Females Debt_median Salary10yr_median
## 1 0.7310345 22130 43000
## 2 0.7139175 19197 58300
## 3 0.8372093 15000 49100
## 4 0.7553191 25250 47400
## 5 0.6572554 16173 51900
\(~\)
Arrange
The arange() function sorts the rows of your data by one
or more numeric variables:
## Sort by ACT median
colleges %>%
filter(State == "IA" & ACT_median > 25) %>%
arrange(ACT_median)
## Name City State Enrollment Private Region Adm_Rate
## 1 Luther College Decorah IA 1974 Private Plains 0.6257
## 2 University of Iowa Iowa City IA 23410 Public Plains 0.8267
## 3 Cornell College Mount Vernon IA 1022 Private Plains 0.6102
## 4 Drake University Des Moines IA 2952 Private Plains 0.6766
## 5 Grinnell College Grinnell IA 1683 Private Plains 0.2438
## ACT_median ACT_Q1 ACT_Q3 Cost Net_Tuition Avg_Fac_Salary PercentFemale
## 1 26 23 23 54045 16779 67833 0.5242718
## 2 26 23 23 22607 14547 91440 0.5623043
## 3 27 23 23 55817 16457 68832 0.4674157
## 4 27 24 24 53507 21160 85563 0.6097561
## 5 32 30 30 65814 20369 101979 0.5348837
## PercentWhite PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1 0.9335 0.0223 0.0282 0.0172 0.6457399
## 2 0.9158 0.0303 0.0377 0.0224 0.6086310
## 3 0.8856 0.0456 0.0595 0.0209 0.6776860
## 4 0.9025 0.0370 0.0407 0.0250 0.6641791
## 5 0.7933 0.0971 0.0841 0.0400 0.7413793
## FourYearComp_Females Debt_median Salary10yr_median
## 1 0.7553191 25250 47400
## 2 0.6572554 16173 51900
## 3 0.7310345 22130 43000
## 4 0.7139175 19197 58300
## 5 0.8372093 15000 49100
When sorting by multiple variables, the one listed first will be
given priority. Additionally, values can be arranged in descending order
via the desc() function:
## Sort by ACT median first, then by Adm_Rate among ties
IA_selective <- colleges %>%
filter(State == "IA" & ACT_median > 25) %>%
arrange(ACT_median, desc(Adm_Rate))
IA_selective
## Name City State Enrollment Private Region Adm_Rate
## 1 University of Iowa Iowa City IA 23410 Public Plains 0.8267
## 2 Luther College Decorah IA 1974 Private Plains 0.6257
## 3 Drake University Des Moines IA 2952 Private Plains 0.6766
## 4 Cornell College Mount Vernon IA 1022 Private Plains 0.6102
## 5 Grinnell College Grinnell IA 1683 Private Plains 0.2438
## ACT_median ACT_Q1 ACT_Q3 Cost Net_Tuition Avg_Fac_Salary PercentFemale
## 1 26 23 23 22607 14547 91440 0.5623043
## 2 26 23 23 54045 16779 67833 0.5242718
## 3 27 24 24 53507 21160 85563 0.6097561
## 4 27 23 23 55817 16457 68832 0.4674157
## 5 32 30 30 65814 20369 101979 0.5348837
## PercentWhite PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1 0.9158 0.0303 0.0377 0.0224 0.6086310
## 2 0.9335 0.0223 0.0282 0.0172 0.6457399
## 3 0.9025 0.0370 0.0407 0.0250 0.6641791
## 4 0.8856 0.0456 0.0595 0.0209 0.6776860
## 5 0.7933 0.0971 0.0841 0.0400 0.7413793
## FourYearComp_Females Debt_median Salary10yr_median
## 1 0.6572554 16173 51900
## 2 0.7553191 25250 47400
## 3 0.7139175 19197 58300
## 4 0.7310345 22130 43000
## 5 0.8372093 15000 49100
Question #4: Filter the “colleges” data to include only private colleges in the Mid East and New England regions. Then sort these schools according to the variable “PercentFemale” such that the school with the largest share of female students appears at the top of the list.
\(~\)
Select
The select() function is used to reduce the number of
variables in a data set:
### Keep only the Name, ACT_median, and Net_Tuition variables
IA_selective <- colleges %>%
filter(State == "IA" & ACT_median > 25) %>%
select(Name, ACT_median, Cost, Net_Tuition)
IA_selective
## Name ACT_median Cost Net_Tuition
## 1 Cornell College 27 55817 16457
## 2 Drake University 27 53507 21160
## 3 Grinnell College 32 65814 20369
## 4 Luther College 26 54045 16779
## 5 University of Iowa 26 22607 14547
Sometimes you’ll want to keep most of your variables, dropping only a
few that are no longer necessary. To drop a variable using
select(), you can place a - character in front
of its name:
### Keep everything but the State and City variables
IA_selective <- colleges %>%
filter(State == "IA" & ACT_median > 25) %>%
select(-State, -City)
IA_selective
## Name Enrollment Private Region Adm_Rate ACT_median ACT_Q1
## 1 Cornell College 1022 Private Plains 0.6102 27 23
## 2 Drake University 2952 Private Plains 0.6766 27 24
## 3 Grinnell College 1683 Private Plains 0.2438 32 30
## 4 Luther College 1974 Private Plains 0.6257 26 23
## 5 University of Iowa 23410 Public Plains 0.8267 26 23
## ACT_Q3 Cost Net_Tuition Avg_Fac_Salary PercentFemale PercentWhite
## 1 23 55817 16457 68832 0.4674157 0.8856
## 2 24 53507 21160 85563 0.6097561 0.9025
## 3 30 65814 20369 101979 0.5348837 0.7933
## 4 23 54045 16779 67833 0.5242718 0.9335
## 5 23 22607 14547 91440 0.5623043 0.9158
## PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1 0.0456 0.0595 0.0209 0.6776860
## 2 0.0370 0.0407 0.0250 0.6641791
## 3 0.0971 0.0841 0.0400 0.7413793
## 4 0.0223 0.0282 0.0172 0.6457399
## 5 0.0303 0.0377 0.0224 0.6086310
## FourYearComp_Females Debt_median Salary10yr_median
## 1 0.7310345 22130 43000
## 2 0.7139175 19197 58300
## 3 0.8372093 15000 49100
## 4 0.7553191 25250 47400
## 5 0.6572554 16173 51900
\(~\)
Mutate
The mutate() function is used to add a new column to
your data that is a function of one or more existing variables:
IA_selective <- colleges %>%
filter(State == "IA" & ACT_median > 25) %>%
mutate(Expected_Discount = (Cost - Net_Tuition)/Cost) %>%
select(Name, Cost, Net_Tuition, Expected_Discount)
IA_selective
## Name Cost Net_Tuition Expected_Discount
## 1 Cornell College 55817 16457 0.7051615
## 2 Drake University 53507 21160 0.6045377
## 3 Grinnell College 65814 20369 0.6905066
## 4 Luther College 54045 16779 0.6895365
## 5 University of Iowa 22607 14547 0.3565267
In the example shown above we add a new variable, “Expected_Discount”, that is a function of “Cost” and “New_Tuition”.
Question #5: Using the entire “colleges” data set,
create a new data frame containing only “Name”, “State”, and a
new variable named “Debt_Cost_Ratio” that describes each college’s
“Debt_median” relative to its expected cumulative cost of attendance
under the assumption that a student enrolls for 4 years and “Cost”
increases by 3% each year. Print this new data frame as part of your
answer. Hint: In year 1 the cumulative cost for a student is
Cost, in year 2 the cumulative cost is
Cost + 1.03*Cost, etc.
\(~\)
Summarize
The summarize() function will calculate summary
statistics that require an aggregation of rows. For example:
colleges %>%
filter(State == "IA") %>%
summarize(Median_Cost = median(Cost))
## Median_Cost
## 1 43520
Without group-wise manipulation (explained in the next section),
summarize() is most useful for generating a customized set
of summary measures:
colleges %>%
filter(State == "IA") %>%
summarize(Min_Cost = min(Cost),
TenPer_Cost = quantile(Cost, 0.1), ## 10th percentile
Median_Cost = median(Cost),
NinetyPer_Cost = quantile(Cost, 0.9), ## 90th percentile
Max_Cost = max(Cost))
## Min_Cost TenPer_Cost Median_Cost NinetyPer_Cost Max_Cost
## 1 20476 22368.2 43520 54256.2 65814
Question #6: Using the summarize()
function, report the interquartile range (IQR) and standard deviation of
the variable “Debt_Cost_Ratio” that you created in Question #5.
Hint: recall that the IQR is calculated as the 75th percentile
minus the 25th percentile.
\(~\)
Frequently, we’d like to perform parallel calculations for different subsets of data. For example, you might want to find the mean and standard deviation of the variable “Cost” separately for every state, or you might even want to find it separately for the private and public schools in each different state.
Group-wise manipulation involves two steps:
group_by() function is used to internally
add grouping tags to the rows of your data. You will not see these tags,
but other tidyverse functions will recognize and use
them.dplyr functions (usually summarize() or
mutate()), and those functions are executed separately on
each group.Shown below are a few examples.
In Example #1, we find the state-specific median cost of the colleges in each state located in the “Plains” region:
## Example #1
colleges %>%
filter(Region == "Plains") %>%
group_by(State) %>%
summarize(Median_Cost = median(Cost, na.rm = TRUE))
## # A tibble: 7 × 2
## State Median_Cost
## <chr> <dbl>
## 1 IA 43520
## 2 KS 38832
## 3 MN 35887
## 4 MO 30279
## 5 ND 19299
## 6 NE 29258.
## 7 SD 22609
In Example #2, we find each state’s median cost separately for
private and public colleges located in Iowa, Minnesota, or Missouri. We
also use n() to count the number of colleges, recorded as
“N”, belonging to each group reported in the summary.
## Example #2
colleges %>%
filter(State == "IA" | State == "MN" | State == "MO") %>%
group_by(State, Private) %>%
summarize(Median_Cost = median(Cost, na.rm = TRUE),
N = n())
## # A tibble: 6 × 4
## # Groups: State [3]
## State Private Median_Cost N
## <chr> <chr> <dbl> <int>
## 1 IA Private 44206 25
## 2 IA Public 21295 4
## 3 MN Private 48860 24
## 4 MN Public 21416 12
## 5 MO Private 37788. 32
## 6 MO Public 19346 13
In Example #3, we find how much each state (either IA or MN) deviates from the average cost of all colleges within the same state.
## Example #3
colleges %>%
filter(State == "IA" | State == "MN" ) %>%
group_by(State) %>%
mutate(Cost_Avg = mean(Cost, na.rm = TRUE),
Cost_vs_Avg = Cost - mean(Cost, na.rm = TRUE)) %>%
select(Name, State, Cost, Cost_Avg, Cost_vs_Avg)
## # A tibble: 65 × 5
## # Groups: State [2]
## Name State Cost Cost_Avg Cost_vs_Avg
## <chr> <chr> <int> <dbl> <dbl>
## 1 Augsburg University MN 51251 37714. 13537.
## 2 Bemidji State University MN 20379 37714. -17335.
## 3 Bethany Lutheran College MN 36961 37714. -753.
## 4 Bethel University MN 50325 37714. 12611.
## 5 Briar Cliff University IA 42423 41987. 436.
## 6 Buena Vista University IA 45332 41987. 3345.
## 7 Capella University MN 20152 37714. -17562.
## 8 Carleton College MN 68835 37714. 31121.
## 9 Central College IA 50547 41987. 8560.
## 10 Clarke University IA 44891 41987. 2904.
## # ℹ 55 more rows
Notice how summarize() returns an object with 1 row per
group, while mutate() returns an object with 1 row per
observation.
\(~\)
Question #7: The “airlines” data set (loaded below) contains data used in the article Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past? that appeared on fivethirtyeight.com.
airlines <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/airline-safety.csv", as.is = TRUE)
head(airlines)
## airline avail_seat_km_per_week incidents.1985_1999
## 1 Aer Lingus 320906734 2
## 2 Aeroflot* 1197672318 76
## 3 Aerolineas Argentinas 385803648 6
## 4 Aeromexico* 596871813 3
## 5 Air Canada 1865253802 2
## 6 Air France 3004002661 14
## fatal_accidents.1985_1999 fatalities.1985_1999 incidents.2000_2014
## 1 0 0 0
## 2 14 128 6
## 3 0 0 1
## 4 1 64 5
## 5 0 0 2
## 6 4 79 6
## fatal_accidents.2000_2014 fatalities.2000_2014
## 1 0 0
## 2 1 88
## 3 0 0
## 4 0 0
## 5 0 0
## 6 2 337
Recall that a “tidy” version of these data should satisfy the following:
Part A: Use pivot_longer() to gather
the last six columns of the “airlines” data into a column named
“accidents” and a column named “count”.
Part B: Use separate() to split the
“accidents” column into two variables named “var” and “years”.
Hint: remember that the period is a special character in
R.
Part C: Use pivot_wider() to spread out
the “var” column into three new columns containing the type of accident.
Your data should now contain two rows per airline (one for each time
period), you can check if the first few rows match those printed
below.
## # A tibble: 6 × 6
## airline avail_seat_km_per_week years incidents fatal_accidents fatalities
## <chr> <dbl> <chr> <int> <int> <int>
## 1 Aer Lingus 320906734 1985… 2 0 0
## 2 Aer Lingus 320906734 2000… 0 0 0
## 3 Aeroflot* 1197672318 1985… 76 14 128
## 4 Aeroflot* 1197672318 2000… 6 1 88
## 5 Aerolineas … 385803648 1985… 6 0 0
## 6 Aerolineas … 385803648 2000… 1 0 0
\(~\)
Question #8: The iris data (from the
datasets package) is a collection of measurements (in cm)
of the the sepal and petal dimensions of 50 different flowers coming
from 3 different species of iris. These data are frequently attributed
to the famous statistician Ronald Fisher
# install.packages("datasets")
data(iris)
Your goal in this question is to recreate the following graphic,
which requires the use of tidyr functions covered in this
lab.
theme() function with
the argument axis.text.x = element_text(angle = 45) to
rotate the x-axis labels.Intensive care units, or ICUs, are primary spaces in hospitals that are reserved for patients in critical condition. The data below is a random sample of n = 200 ICU patients from a research hospital affiliated with Carnegie Mellon University (CMU).
icu = read.csv("https://remiller1450.github.io/data/ICUAdmissions.csv")
The data dictionary below documents each variable that was recorded:
Question #9:
Part A: Filter the ICU data to include only patients
whose visit involves an infection. Then, for the “Age” variable, find
the mean, standard deviation, and group size (found using the function
n()) of patients with and without a previous admission in
the prior 6 months.
Part B: Considering all ICU patients in these data,
use the group_by() and mutate() functions to a
sex-specific Z-score for the variable “HeartRate” of each patient.
Note: you should be using different means and standard
deviations within each sex to calculate this Z-score. If you are
unfamiliar with Z-scores, they take the form: \(Z = \tfrac{\text{Value} - \text{Mean}}{\text{Std.
Dev}}\)
Part C: Using your results from Part B, recreate the
following ggplot graphic:
Hints: The code used to generate this graph uses
aes(y = after_stat(density)) to put the histograms on the
density scale, and position = "identity" to prevent them
from overlapping. The graphics parameter alpha was used to
provide partial transparency, and scale_fill_manual() was
used to modify the color scheme and color/fill labels.