Lab 6: Data Processing

This lab focuses on manipulating characters strings using the stringr package within the tidyverse suite as well as select functions in the lubridate package that are useful in working with dates and times.

# Please install and load the following packages
# install.packages("stringr")
# install.packages("lubridate")
library(stringr)
library(ggplot2)
library(lubridate)
library(dplyr)

Directions (Please read before starting)

Please work together with your assigned partner. Make sure you both fully understand each concept before you move on.
Please record your answers and any related code for all embedded lab questions. I encourage you to try out the embedded examples, but you shouldn’t turn them in.
Please ask for help, clarification, or even just a check-in if anything seems unclear.

$~$

Preamble

String basics

A “string” is a single element of a character variable (vector), or a stand alone collection of characters enclosed by quotes.

x <- "Single Element String"
y <- c("apple", "banana", "pear")  ## strings within a character vector

There are many similarities between strings and other types of data. For example, the individual characters in a string have their own positions, and strings have their own length:

str_length(x)
str_length(y)
length(y)

## [1] 21
## [1] 5 6 4
## [1] 3

Positional indices are important to many string processing functions, such as str_sub():

str_sub(x, start = 1, end = 4)  ## str_sub() will subset a string
str_sub(y, start = 1, end = 4)

## [1] "Sing"
## [1] "appl" "bana" "pear"

Positions can also be referenced from the end of a string using negative numbers:

str_sub(x, start = -4, end = -1)
str_sub(y, start = -2, end = -1)

## [1] "ring"
## [1] "le" "na" "ar"

$~$

Strings and other data types

The presence of a character string within a vector will dominate all other types of data:

x1 <- c(1, 2, 3)
typeof(x1)

x2 <- c(1, 2, 3, "A")
typeof(x2)

## [1] "double"
## [1] "character"

y1 <- c(TRUE, FALSE)
typeof(y1)

y2 <- c(TRUE, FALSE, "A")
typeof(y2)

## [1] "logical"
## [1] "character"

Note: NA’s in vectors do not get counted as strings. Why?

y3 = c(1, 2, 3, NA)
typeof(y3)

## [1] "double"

We can run into an issue if the raw data uses character strings to represent missing data or footnotes:

my_data <- read.csv("https://remiller1450.github.io/data/char_dom.csv")
head(my_data)

##   ID messy_x messy_y
## 1  1     100      50
## 2  2      90      40
## 3  3      85      55
## 4  4      90      45
## 5  5     110     55*
## 6  6 Missing      60

str(my_data)

## 'data.frame':    10 obs. of  3 variables:
##  $ ID     : int  1 2 3 4 5 6 7 8 9 10
##  $ messy_x: chr  "100" "90" "85" "90" ...
##  $ messy_y: chr  "50" "40" "55" "45" ...

Fortunately, “messy_x” can be fixed using as.numeric(), and “messy_y” can be fixed using the parse_number() function in the readr package:

## Coerce x to numeric
## any not-numeric values become NAs
as.numeric(my_data$messy_x)

##  [1] 100  90  85  90 110  NA 115  NA 105 100

## Extract only the numeric values in y
## Any others get tossed out
library(readr)
parse_number(my_data$messy_y)

##  [1] 50 40 55 45 55 60 40 35 40 50

DateTimes

Suppose you wanted to calculate the number of days that have elapsed between Dec 12th 2019 and today. You could get today’s date in R using the Sys.Date() function:

todays_date = Sys.Date()
print(todays_date)

## [1] "2025-10-05"

While this looks like a character string, we can use the class() function to see that “today” is an object of class “Date”:

class(todays_date)

## [1] "Date"

Operations involving a mixture of Date and character types are not allowed, but arithmetic operations can be applied to two Date objects:

todays_date - "2019-12-12"           ## Causes an error
todays_date - as.Date("2019-12-12")  ## Works as intended

Dates can be coerced into numeric, with “day 0” being Jan 1, 1970:

as.numeric(todays_date) - as.numeric(as.Date("2019-12-12"))
as.numeric(as.Date("1970-01-01"))

## [1] 2124
## [1] 0

R is set up to store dates internally as the number of days passed since Jan 1, 1970. Why? See here.

$~$

Date classes

The default storage mode for dates in R is ISO 8610, which uses a format of yyyy-mm-dd (and hh:mm:ss if time is known).

The lubridate package, which offers vastly improved handling of dates uses a different storage mode.We can use the now() function in the lubridate package to retrieve the current date/time (to the nearest second):

right_now = now()
print(right_now)

## [1] "2025-10-05 19:58:37 CDT"

Notice the class of “right_now”:

class(right_now)

## [1] "POSIXct" "POSIXt"

The “POSIXct” class records date/time information with an associated time zone. “POSIX” is an acronym for “Portable Operating System Interface” and “ct” stands for “calender time”, while “t” stands for “text” (POSIXt is a mixed character and date format)

A benefit of the “POSIXct” class is its handling of time zones:

as.POSIXct("05/24/2017 08:45", format = "%m/%d/%Y %H:%M", tz = "America/Chicago") - 
   as.POSIXct("05/24/2017 08:45", format = "%m/%d/%Y %H:%M", tz = "America/Denver")

## Time difference of -1 hours

You can access a full list of acceptable inputs to the tz argument using the command OlsonNames(tzdir = NULL):

head(OlsonNames(tzdir = NULL))

## [1] "Africa/Abidjan"     "Africa/Accra"       "Africa/Addis_Ababa"
## [4] "Africa/Algiers"     "Africa/Asmara"      "Africa/Asmera"

$~$

Lab

At this point you should begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.

$~$

Workhorse Functions

The stringr package contains dozens of string processing tools. We will focus our attention on the following functions:

Function	Description
`str_sub()`	Extract substring from a given start to end position
`str_detect()`	Detect presence/absence of first occurrence of substring
`str_locate()`	Give position (start, end) of first occurrence of substring
`str_locate_all()`	Give positions of all occurrences of a substring
`str_replace()`	Replace the first instance of a substring with another
`str_replace_all()`	Replace all instances of a substring with another

For illustration purposes, we will use the vector created below:

fruits <- c("Apple", "Pineapple", "Pear", "Orange", "Peach", "Banana")

String Detect:

str_detect() returns TRUE or FALSE for each element in character vector depending upon whether it contains the target pattern:

str_detect(fruits, "ap") ## returns TRUE if "ap" is found

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

Because strings are case-sensitive, "ap" is not found in "Apple", but is found in "Pineapple".

String Locate:

str_locate() returns the start and end positions of the first instance of the target pattern:

str_locate(fruits, "an")

##      start end
## [1,]    NA  NA
## [2,]    NA  NA
## [3,]    NA  NA
## [4,]     3   4
## [5,]    NA  NA
## [6,]     2   3

Notice the sixth fruit, "banana", has two instances of "an", but only the first is considered.

str_locate_all() can be used to find every instance of a target pattern, but it can be more cumbersome to work with because its output is a list object (rather than the matrix returned by str_locate()):

str_locate_all(fruits, "an")

## [[1]]
##      start end
## 
## [[2]]
##      start end
## 
## [[3]]
##      start end
## 
## [[4]]
##      start end
## [1,]     3   4
## 
## [[5]]
##      start end
## 
## [[6]]
##      start end
## [1,]     2   3
## [2,]     4   5

A common goal might be to count the instances of a pattern, in which case the unlist() function can be used to coerce the list into a vector.

out <- str_locate_all(fruits, "an")
v <- unlist(out) ## coerce the list into a vector
length(v)/2  ## total number of times "an" occurs in "fruits"

## [1] 3

String Replace:

str_replace() will replace the target pattern with another expression:

str_replace(fruits, "an", "XX")

## [1] "Apple"     "Pineapple" "Pear"      "OrXXge"    "Peach"     "BXXana"

Similar to str_locate() it also works only on the first instance of the pattern, but str_replace_all() can be used if every instance should be replaced:

str_replace_all(fruits, "an", "XX")

## [1] "Apple"     "Pineapple" "Pear"      "OrXXge"    "Peach"     "BXXXXa"

Question #1: Using appropriate stringr functions, identify which elements in the vector “fruits” contain a lower case “p”. Then, use the which() function and indices to print these fruits.

Question #2: Using appropriate stringr functions, count the total number of times a lower case “p” occurs in the vector “fruits”.

$~$

Pre-processing

With strings it can beneficial to pre-process your data to simplify later operations. For example, you might convert all of your strings to lower case via the str_to_lower() function:

str_to_lower(fruits)

## [1] "apple"     "pineapple" "pear"      "orange"    "peach"     "banana"

Similar functions exist for converting to upper, title, or sentence case:

str_to_upper(fruits)

## [1] "APPLE"     "PINEAPPLE" "PEAR"      "ORANGE"    "PEACH"     "BANANA"

str_to_title(fruits)

## [1] "Apple"     "Pineapple" "Pear"      "Orange"    "Peach"     "Banana"

str_to_sentence("aPPles AND Bananas are THE Most popular FRUits")

## [1] "Apples and bananas are the most popular fruits"

Another common problem in string processing is the presence of white space, such as excessive spacing between characters, or at the beginning and end of strings.

The string_squish() function will eliminate all leading and trailing white space and reduce any repeated spaces inside the body of the string to a single instance:

str_squish("  String     A ")

## [1] "String A"

If you wish to remove leading/trailing spaces without modifying the body of the string you can do so using the string_trim() function:

str_trim("  String     A ")

## [1] "String     A"

Question #3: Beginning with the character string given below, use the functions str_squish(), str_to_title(), and str_replace_all() to produce the string "United_States".

q3 <- c("  UNITED  STATES  ")

$~$

Regular Expressions

Regular expressions, or “regex”, are special sequences of characters used to identify patterns in strings. At its simplest, a regular expression might look exactly like the pattern you’re trying to find. For example "ea" will match the "ea" in "pear".

From here, we can increase the flexibility of the expression using ., sometimes called a wildcard, which can match any character.

To understand the wildcard, consider searching for ".a." in y = c("apple", "banana", "pear"):

## [1] │ apple
## [2] │ <ban>ana
## [3] │ p<ear>

The patterns matching ".a." are highlighted above. Notice how each match has the character “a” surrounded by exactly one character on both sides, but the exact characters surrounding the “a” can be different.

$~$

Anchoring

By default, regex will match with any portion of a string, but it’s sometimes useful to anchor an expression to only a match characters at either the start or end of a string.

^ anchors matching to the start of a string
$ anchors matching to at the end of a string

fruits[str_detect(fruits, "e$")] ## Fruits ending in "e"

## [1] "Apple"     "Pineapple" "Orange"

fruits[str_detect(fruits, "^P")] ## Fruits starting with "P"

## [1] "Pineapple" "Pear"      "Peach"

Question #4: Use the wildcard character, ., and anchoring to find all strings that start with “P”, followed by any character, followed by “a”.

$~$

Repeated Characters

The meta-characters, * and + can be used to match patterns with a flexible number of repeated characters.

* is used to match 0 or more instances of the preceding character
+ is used to match 1 or more instances of the preceding character

fruits[str_detect(fruits, "ap*")] ## "a" followed by 0 or more instances of "p"
fruits[str_detect(fruits, "ap+")] ## "a" followed by 1 or more occurrences of "p"

## [1] "Pineapple" "Pear"      "Orange"    "Peach"     "Banana"   
## [1] "Pineapple"

In these examples, the usefulness of * isn’t obvious, but consider the following examples:

strings <- c("good", "goood", "goooood", "gooooood!")
str_detect(strings, "goo*d")  ## Handles

## [1] TRUE TRUE TRUE TRUE

fruits[str_detect(fruits, "^A.*e$")]  ## Fruits starting with "A", followed by 0 or more of any char, ending in "e"

## [1] "Apple"

Question #5: Suppose you want to detect variations of “good” with excessive o’s, but you do not want to detect “god”. Why won’t the expression "goo*d: work for this aim? How can you modify it to address the problem?

$~$

Other Meta-characters

In addition to ., *, +, ^, and $, other meta-characters include:

[] - indicates the literal interpretation of a character, or a limited set of exchangeable characters.
- [Pp] will match either “P” or “p”
{} - indicates a fixed number of repetitions of the preceding character(s).
- [0-9]{2} will match any two-digit number (ie: 01, 15, 78, etc.)
\\ - an escape character used to match something that is a meta-character.
- \\. will match the character . (a period appearing in a string)
- \\\\ is needed to match the character "\" (because \ is itself a special character)

In addition to these meta-characters, there are some special pattern shortcuts that are worth knowing:

\\d will match any digit
\\s will match any white space (ie: a space, a tab, or a newline)
[^abd] will match anything character other than “a”, “b” or “d”. When used inside a square parentheses ^ operates differently (recall it’s used in anchoring).
() can be used for organization and will not influence pattern matching.

To illustrate these meta-characters, consider the task of extracting 10-digit phone numbers from text string data:

phone_strings <- c("Home: 507-645-5489", 
                   "Cell: 219.917.9871", 
                   "My work phone is 507-202-2332", 
                   "I don't have a phone")

A valid US area code is 3 digits that start with a 2 or higher. In regex, this can be expressed as [2-9]\\d{2}
The area code is separated from remaining digits using a . or -. So, we now have [2-9]\\d{2}[-.]
Finally, the phone number should 3 digits, followed by a . or -, then 4 more digits, making the full expression [2-9]\\d{2}[-.]\\d{3}[-.]\\d{4}

phone_pattern = "[2-9]\\d{2}[-.]\\d{3}[-.]\\d{4}"
str_detect(phone_strings, phone_pattern)   ## Identify strings with matches

## [1]  TRUE  TRUE  TRUE FALSE

str_extract(phone_strings, phone_pattern)  ## Extract the matches

## [1] "507-645-5489" "219.917.9871" "507-202-2332" NA

It’s also possible to use stringr functions to help make sensitive information anonymous:

str_replace(phone_strings, phone_pattern, "XXX-XXX-XXXX")

## [1] "Home: XXX-XXX-XXXX"            "Cell: XXX-XXX-XXXX"           
## [3] "My work phone is XXX-XXX-XXXX" "I don't have a phone"

$~$

Matching Brackets and HTML Tags

In some circumstances you might want to use regex to identify brackets, such as [8] or (8), or html tags, such as <tag>.

Consider the following character strings:

out <- c("abc[8]", "abc[9][20]", "abc[9]def[10][7]", "abc[]")

The regex pattern "\\[([^]]*)\\]" will produce the following matches:

## [1] │ abc<[8]>
## [2] │ abc<[9]><[20]>
## [3] │ abc<[9]>def<[10]><[7]>
## [4] │ abc<[]>

The regex pattern "\\[(.*)\\]" will produce the following:

## [1] │ abc<[8]>
## [2] │ abc<[9][20]>
## [3] │ abc<[9]def[10][7]>
## [4] │ abc<[]>

Notice how [^]] is necessary to match any character besides a the right bracket ].

Question #6: Some people might choose to wrap the area code of their phone number in ( ). Modify the regular expression provided above so that it will successfully match phone numbers whose area codes use this syntax. Then, use string_extract() to extract the valid phone numbers and string_replace() convert them to a uniform style including only numbers and dashes (-).

phone_strings <- c("Home: (507)-645-5489", 
                   "Cell: 219.917.9871", 
                   "My work phone is 507-202-2332",
                   "I don't have a phone")

$~$

Date Components

As you’d expect, any date can be decomposed into its constituent components using the following functions:

Component	Function
Year	`year()`
Month	`month()`
Day	`day()`
Hour	`hour()`
Minute	`minute()`
Second	`second()`

There is also a milliseconds() function, but it isn’t broadly compatible with the standard date/time classes.

Shown below are quick examples of these functions:

date1 = as.POSIXct("05/24/2017 08:45", format = "%m/%d/%Y %H:%M", tz = "America/Chicago")

## Year
year(date1)

## Month
month(date1)

## Day
day(date1)

## Hour
hour(date1)

## Minute
minute(date1)

## Second
second(date1)

## [1] 2017
## [1] 5
## [1] 24
## [1] 8
## [1] 45
## [1] 0

You should be aware that there are a similarly named functions: hours(), $\ldots$, seconds(); however, these functions do not provide the same output as their singular counterparts:

## Example of the "hours" function
hours(date1)

## [1] "1495633500H 0M 0S"

Question #7: Compare the output of days() and day() when date1 (defined in the examples above) is used as an input. What do you think is being returned by days()? Can you confirm this? Hint: Consider the information about how R stores dates given in the lab’s preamble.

$~$

Common Date Calculations

The lubridate package also contains a handful of functions to help perform common date/time calculations:

Function	Output
`yday()`	day of the year (number from 1-365)
`wday()`	day of week (number from 1-7 or factor label when `label=TRUE` is used)
`floor_date()`	rounds the date downward
`ceiling_date()`	rounds the date upward
`round_date()`	rounds the date upward/downward (whichever is closer)

A few examples demonstrating these functions are given below:

## Day of year
yday(date1)

## Day of week
wday(date1)
wday(date1, label = TRUE)

## Rounding
floor_date(date1, unit = "month")
ceiling_date(date1, unit = "month")
round_date(date1, unit = "month")

## [1] 144
## [1] 4
## [1] Wed
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
## [1] "2017-05-01 CDT"
## [1] "2017-06-01 CDT"
## [1] "2017-06-01 CDT"

Question #8: Use as.POSIXct() to create a date/time object representing 9:15pm in Los Angeles on February 14, 2020. Then, round this date to the nearest day and determine which day of the week the result is.

$~$

Format Conversions

Arguably the biggest challenge when working with dates/times is the multiplicity of formats that can arise. For example, the date “May 12, 2017” might get recorded as any of the following:

May 12, 2017
5/12/2017
05-12-2017
2017-05-12
20170512

lubridate provides a collection of functions to standardize different date/time formats:

Function	Expected Input Format
`mdy()`	Month - Day - Year
`ymd()`	Year - Month - Day
`dmy()`	Day - Month - Year
`myd()`	Month - Year - Day

Each of these is accompanied by a related function that incorperate time. For example mdy() has the accompanying functions mdy_h(), mdy_hm(), and mdy_hms() depending upon whether the time component contains hours, hours and minutes, or hours and minutes and seconds.

Show below are several examples:

## Examples 1-3 (mdy)
mdy("May 12, 2017")
mdy("5/12/2017")
mdy("05-12-2017")

## [1] "2017-05-12"
## [1] "2017-05-12"
## [1] "2017-05-12"

## Examples 4-5 (ymd)
ymd("2017-05-12")
ymd("20170512")

## [1] "2017-05-12"
## [1] "2017-05-12"

## Additional examples (w/out time)
dmy("3rd May, 2019")
myd("May 2019, the 30th")

## [1] "2019-05-03"
## [1] "2019-05-30"

## Additional examples (w/ time)
mdy_hm("May 12, 2017 4:45pm", tz = "America/Chicago")
mdy_hms("05-12-2017 16:45:00", tz = "America/Chicago")

## [1] "2017-05-12 16:45:00 CDT"
## [1] "2017-05-12 16:45:00 CDT"

Question #9: On January 27th at 6:31 PM, the Apollo 1 spacecraft, planned to be the first manned mission of the Apollo space program, experienced a cabin fire on the landing pad in Cape Kennedy Air Force Station, Florida during a launch simulation, killing all three crew members on board. Nearly 19 years later, on January 28, 1986 at 11:39 AM, the Challenger Shuttle exploded just off the coast of Cape Canveral, Florida. Rounding each date to the nearest day, determine how many days passed between these two events.

apollo <- "1967 Jan 27th at 6:31:19 PM UTC"
challenger <- "28 January 1986, 1139am"

$~$

Times Without Dates

Sometimes you’ll encounter data consisting of times without an attached date. These could be times within a day such as “01:30:00” or 1:30 AM, but more commonly they’ll be a duration of time such as 1 hour, 30 minutes, and 0 seconds.

The lubridate package provides a simple storage class for times without dates via the hms() function. In the example below, this function is expressed using the namespace lubridate (ie: it is called using lubridate::hms()) because there is a different function named “hms” in the hms package that doesn’t behave interchangeably.

## Example
time1 = lubridate::hms("01:10:00")
60*hour(time1) + minute(time1)

## [1] 70

Here, we create an hms object from the string “00:10:00”, then we convert it to into minutes ourselves.

Because hms objects are stored as the number of seconds since 00:00:00, we can perform arithmetic with them directly:

lubridate::hms("01:10:00") - lubridate::hms("01:05:00")

## [1] "5M 0S"

We can also exploit this fact to easily convert results to seconds using pipelines:

(lubridate::hms("01:10:00") - lubridate::hms("01:05:00")) %>% seconds()

## [1] "300S"

$~$

Practice (required)

The data below comes from a real driving simulator experiment. The “disposition” file records the experimental participants, their assigned conditions, and the driving simulator output files that record a time-series of driver/vehicle inputs in a particular simulated drive. Output from the simulator are stored as DAQ file.

disp = read.csv("https://remiller1450.github.io/data/disposition.csv")
head(disp)

##   Analyze Reduced Ignore Discard                DaqPath
## 1               X             NA    Control\\C_001_POST
## 2               X             NA    Control\\C_002__PRE
## 3               X             NA    Control\\C_003__PRE
## 4               X             NA Occasional\\O_001__PRE
## 5               X             NA    Control\\C_004_POST
## 6               X             NA   Frequent\\F_001__PRE
##                              DaqName            Date
## 1 1_RuralRedLight_20180905113244.daq  9/5/2018 23:01
## 2 3_RuralRedLight_20180912092144.daq 9/12/2018 23:11
## 3 2_RuralRedLight_20180914094223.daq 9/14/2018 23:12
## 4 1_RuralRedLight_20180917093224.daq 9/17/2018 23:10
## 5 3_RuralRedLight_20180928135832.daq 9/28/2018 23:04
## 6 4_RuralRedLight_20181003103829.daq 10/3/2018 23:11

The columns “Analyze”, “Reduced”, “Ignore” and “Discard” record how the researchers intend to handle certain DAQ files. Often problems arise during a simulated drive that prevent it from being useable. Only files with an “X” for reduced should be analyzed.
The column “DaqPath” contains information about the driver, including:
- Their group, either “Control”, “Occasional”, or “Frequent”
- Their subject ID, a prefix of C/O/F followed by a three-digit number
- Their experimental condition, either “PRE” for pre-dosing, or “POST” for post-dosing
The column “DaqName” contains information about the drive, including:
- The drive scenario, in this disposition file the two scenarios are “RuralRedLight” and “RedLight”
- The order in which the participant experienced the stated scenario. For example, “1” means this was the first of four drives by that subject, “2” means it was the second of four drives by that subject, etc.
- A time stamp for when the DAQ was recorded. For example, the first row’s DAQ was record on Sept 5th, 2018. The numbers following the date are specific to the simulator’s internal clock and generally not meaningful.

Question #10: Exclude any drives that do not have an “X” in the reduced column. Then, using stringr functions as appropriate, process the information recorded in “disposition.csv” into a data frame with the following columns:

“Group” - either “Control”, “Occasional”, or “Frequent”
“SubjectID” - a string starting with “C”,” “O”, or “F” followed by an underscore and a 3-digit numeric identifier
“Treatment” - either “PRE” or “POST”
“Scenario” - either “RuralRedLight” or “RedLight”
“DriveNumber” - an integer between 1 and 4 indicating the ordering of the drive

Hints:

Remember that the expression \\\\ is needed to match a single \
You may be find the expression _[^\\d]*$ useful. Think about what it will will match.
I recommend creating each constituent vector separately, then using data.frame to assemble them into a data frame.

Printed below are the first 10 rows of the target data.frame in the requested format.

##         Group SubjectID Treatment      Scenario DriveNumber
## 1     Control     C_001      POST RuralRedLight           1
## 2     Control     C_002       PRE RuralRedLight           3
## 3     Control     C_003       PRE RuralRedLight           2
## 4  Occasional     O_001       PRE RuralRedLight           1
## 5     Control     C_004      POST RuralRedLight           3
## 6    Frequent     F_001       PRE RuralRedLight           4
## 7    Frequent     F_002       PRE RuralRedLight           4
## 8     Control     C_005      POST RuralRedLight           3
## 9    Frequent     F_003       PRE RuralRedLight           4
## 10 Occasional     O_002       PRE RuralRedLight           1

$~$

Question #11: The 2015 Boston Marathon took place on April 20th, 2015. It was the 119th running of one of the world’s most well-known races. The data below contain information, results, and splits for each finisher:

marathon = read.csv("https://remiller1450.github.io/data/BostonMarathon2015.csv")

Part A: A marathon is approximately 26.2 miles, making the first half 13.1 miles. Using this information, calculate the per mile pace (in seconds) for each participant in the first half of the race. Be sure to store your results.

Part B: Now calculate the pace per mile in the second half of the race. Be sure to store your results.

Part C: Now create a scatterplot displaying the relationship between pace per mile in the first half of the race vs. pace per mile in the second half of the race by age and sex. To facilitate this, you should assemble your results from Parts A and B into a data frame that also includes the “Age” and “M.F” columns from the original data. A target graphic is given below. Note: scale_x_time() and scale_y_time() can be used to display your first half and second half paces on a time scale. The graph shown below uses the argument alpha = 0.2 to reduce the impact of over-plotting, and a 45-degree line is added using geom_abline().

$~$

Question #12: (Only if extra time, do not do outside of class). The two CSV files below contain results from a real experiment on cannabis impaired driving conducted in an advanced driving simulator. The file “startdose.csv” is a list of participant IDs and the time at which each began a 10-min ad libitum dose of inhaled cannabis, and the file “high_effects.csv” records each participant’s self-reported feelings of “high” (on a 0-100 scale) at various points throughout the experiment.

dose = read.csv("https://remiller1450.github.io/data/startdose.csv")
high = read.csv("https://remiller1450.github.io/data/high_effects.csv")

Your task in this question is to recreate the visualization below: