# Please load the following packages (installing if necessary)
library(tidyr)
library(dplyr)
library(readr)
library(ggplot2)
Directions for all labs (read before starting)
The “Lab” section is something you will work on with a partner using paired programming, a framework defined as follows:
Partners are encouraged to switch roles throughout the “Lab” section, but for the first few labs the less experienced coder should spend more time as the driver.
\(~\)
The first part of this lab focuses on creating functions: the why, the how, and some best practices. The second of part of this lab is focused on on iterations (for loops, while loops, etc). The final part will deal with simulations.
Much of this comes from the R for Data Science online texts online text https://r4ds.had.co.nz/iteration.html, https://r4ds.hadley.nz/iteration.html, conversations with CS faculty (Prof. Will Rebelsky), and experience (i.e. learn from my mistakes).
\(~\)
\(~\)
%in%: returns the values of x that are in y. For example c(2,3,4) %in% c(2,5,6) would return TRUE FALSE FALSE. Sometimes it is useful to use the initial vectors and only return those values:
x=c(2,3,4)
y=c(2,5,6)
x%in%y #booleans
## [1] TRUE FALSE FALSE
x[x%in%y] #boolean indexing
## [1] 2
which(boolean): returns the indices where the boolean is true
which(x%in%y)
## [1] 1
\(~\)
Iterations are used when you are continuously repeating the same operation(s) on different values that are known in advance. Like functions, utilizing iterations helps to reduce code space and errors.
\(~\)
Suppose we want to create a function that does the following on a given vector input:
Let’s work with an example that has non numeric values such as the x below:
x <- c(1,2,3, "four", 5, "six", 7, 6, 2)
We want to remove all non-numeric values. Let’s see if we can figure out how to do that using the help function
?numeric
It looks like the “as.numeric” function may help, although it will give NAs for non numeric values. The na.omit function will be useful in this course, it removes values with “NA”.
x2<-as.vector(na.omit(as.numeric(x)))
x2
## [1] 1 2 3 5 7 6 2
We can see that this kept only the values that we wanted (1,2,3,5,7,6,2). I used the as.vector() function since we don’t care about the additional output from na.omit.
** is the R function for “take to the power of”
x3<-x2**3
x3
## [1] 1 8 27 125 343 216 8
For evidence that it worked:
x3/x2/x2/x2
## [1] 1 1 1 1 1 1 1
x4<-x3+x2-5
x4
## [1] -3 5 25 125 345 217 5
We will do this using boolean indices
current_over10=x4>10
current_over10
## [1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE
original_less6=x2<6
original_less6
## [1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE
to_keep=current_over10 & original_less6
to_keep
## [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE
x4[to_keep]
## [1] 25 125
This is what we have so far:
x <- c(1,2,3, "four", 5, "six", 7, 6, 2)
x2<-as.vector(na.omit(as.numeric(x)))
x2
## [1] 1 2 3 5 7 6 2
x3<-x2**3
x3
## [1] 1 8 27 125 343 216 8
x4<-x3+x2-5
x4
## [1] -3 5 25 125 345 217 5
current_over10=x4>10
current_over10
## [1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE
original_less6=x2<6
original_less6
## [1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE
to_keep=current_over10 & original_less6
to_keep
## [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE
x4[to_keep]
## [1] 25 125
To turn this into a function, we need to use the function syntax: example_function<-function(x){}. We know that x is our input so we can rewrite the above chunk as follows
x <- c(1,2,3, "four", 5, "six", 7, 6, 2)
example_function<-function(x){
x2<-as.vector(na.omit(as.numeric(x)))
x2
x3<-x2**3
x3
x4<-x3+x2-5
x4
current_over10=x4>10
current_over10
original_less6=x2<6
original_less6
to_keep=current_over10 & original_less6
to_keep
x4[to_keep]
}
example_function(x)
## [1] 25 125
x <- c(1,2,3, "four", 5, "six", 7, 6, 2)
example_function<-function(x){
x2<-as.vector(na.omit(as.numeric(x)))
x3<-x2**3
x4<-x3+x2-5
current_over10=x4>10
original_less6=x2<6
to_keep=current_over10 & original_less6
x4[to_keep]
}
example_function(x)
## [1] 25 125
x <- c(1,2,3, "four", 5, "six", 7, 6, 2)
example_function<-function(input){
input_num<-as.vector(na.omit(as.numeric(input))) # convert the input vector to a numeric vector and drop NAs
input_cube<-input_num**3 # cube input
out<-input_cube+input_num-5 # add cubed input to input then subtract 5
# The following lines filter the lower and upper values of the output
out_over10=out>10
input_less6=input_num<6
to_keep=out_over10 & input_less6
# print the output
out[to_keep]
}
example_function(x)
## [1] 25 125
x <- c(1,2,3, "four", 5, "six", 7, 6, 2)
example_function<-function(input){
input_num<-as.vector(na.omit(as.numeric(input))) # convert the input vector to a numeric vector and drop NAs
out<-input_num**3+input_num-5 # add cubed input to input then subtract 5
# The following lines filter the lower and upper values of the output
out_over10=out>10
input_less6=input_num<6
to_keep=out_over10 & input_less6
return(out[to_keep]) # Return the output
}
example_function(x)
## [1] 25 125
x1 <- c(1,2,3, "four", 5, "six", 7, 6, 2) #initial vector. Expected output: 15 125
x2 <- c(3, 7, "four", 2) #smaller vector. Expected output: 25
x3 <- c("a","b","c") #No Numbers. Expected output: None
x4 <- c(1,2,8,9,10) #All integers, none that should be printed (both above and below). Expected output: None
x5 <- c(4,5) #All integers, all should be printed. Expected output: 63 125
x6 <- c(1,2,3,4,5,6,7,8,9,10) #All integers, some should be printed. Expected output: 25, 63, 125
x7 <- c(5.1,4.1,3.9,7.1,6.8,55/10,75/10) #Non integer numbers. Expected output: 132.751 68.021 58.219 166.875
testVectors<-list(x1,x2,x3,x4,x5,x6,x7)
for (testVect in testVectors){ #we will work with Loops more on Thursday
print(example_function(testVect))
}
## Warning in na.omit(as.numeric(input)): NAs introduced by coercion
## [1] 25 125
## Warning in na.omit(as.numeric(input)): NAs introduced by coercion
## [1] 25
## Warning in na.omit(as.numeric(input)): NAs introduced by coercion
## numeric(0)
## numeric(0)
## [1] 63 125
## [1] 25 63 125
## [1] 132.751 68.021 58.219 166.875
for (i in 1:5){#Sequence, for every i in the range 1:5 do
print(i) #Body: Print i
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
i=1
while (i <6){#Sequence, while i is < 6
print(i) #Body: Print i, increment i
i=i+1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
\(~\)
At this point you should begin working with your partner. This lab
will continue building on the fundamental aspects of R
introduced previously.
my_fun <- function(x = 99){
d2 <- (mean(x, na.rm = TRUE) - median(x, na.rm = TRUE))^2
return(d2)
}
## Try the function
ex_data <- c(0,2,10)
my_fun(x = ex_data)
## [1] 4
x = 99 dictates that the function should
have an argument “x”, and specifies a default value of 99.
If a user of the function doesn’t supply their own “x” argument the
function will use the value 99.return() defines the function’s output, in this example
the function will output the object d2.x and d2 only exist internally when
the function is being used.We will start with vector functions: Functions that take a vector or set of vectors and return a vector.
Question #1 Write a function that takes a vector, and then by element: squares the element and then adds 5. An example input and expected output are below:
## [1] 1.0 2.0 3.0 4.0 5.5
## [1] 6.00 9.00 14.00 21.00 35.25
Mutate functions are functions that work well inside mutate and filter because they return an output of the same length as the input. we will use z scores in a number of future labs. Since this is something we will use multiple times, it makes sense to create a function to do this.
zScore <- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
Summarize functions are functions that return a single value for use in summarize(). For example, we could create a squared error function:
squaredError <- function(x) {
sum((x-mean(x))**2,na.rm=TRUE)
}
Question #2 Taken from the reading. Write bothNA(), a summary function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors. Test your code on the following two vectors.
# 1st vector
x <- as.numeric(c(1,"a",3, "four", 5, "six", 7, 6))
# 2nd vector
x2 <- as.numeric(c(3, 7, "four", 2,5,"six", 1, 2))
bothNA(x,x2)
## [1] 1
Data frame functions work like dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and return a data frame or a vector.
Sometimes we need to use embracing. wrapping a variable in braces {} tells dplyr to use the value stored inside the argument not the argument as the variable name.
I will give you an example below of a dataframe function and embracing, but in this class we will not be building any of these
dropAndAdd<-function(df,col1,col2){
df=df%>%
filter(!is.na({{col1}}))%>%
filter(!is.na({{col2}}))%>%
mutate(sumCol1Col2={{col1}}+{{col2}})
return(df)
}
# Testing using airquality base R dataset
airquality2=dropAndAdd(airquality,Ozone,Solar.R)
airquality3<-airquality%>%
filter(!is.na(Ozone))%>%
filter(!is.na(Solar.R))%>%
mutate(sumCol1Col2=Ozone+Solar.R)
sum((airquality3==airquality2)==FALSE,na.rm=TRUE)
## [1] 0
Plot functions take a dataframe and return a plot. Once again, I will give you an example below, but in this class we will not be building any of these
For example, I make a significant number of histograms where the color and fill are based on the same variable, the colors are the same (so the borders match the fill), there is a minimal theme, and I use stacked bars.
basicHistogramPlot<-function(df,xCol,fillName,colorsToUse,alpha=0.5,bins=10){ #note that alpha and bins have defaults
outPlot=ggplot(data=df,aes(x={{xCol}},fill={{fillName}},color={{fillName}}))+
geom_histogram(position = "stack", alpha = alpha, bins=bins) +
scale_fill_manual(values = colorsToUse)+
scale_color_manual(values = colorsToUse)+
theme_minimal()
return(outPlot)
}
diamonds_data=diamonds
diamonds2 = diamonds_data%>%
filter(cut=='Fair')%>%
filter(clarity %in% c('VVS2','VVS1','IF'))%>%
filter(color %in% c('D','H','I','J'))
basicHistogramPlot(diamonds2,price,clarity,c("red","blue","green"),bins=2) #note that I overwrote the default bin, but not the default alpha
\(~\)
We discussed in Lab 1 that one of the advantages of user-defined functions is that we can reuse them in future projects. However, we didn’t discuss how to actually do that. In order to reuse functions, the best practice is to save them in an r script and then import them using source(). These functions are available here. Please import them.
Example if it’s in the same directory. If not, you will have to use the “setwd()” from last week or include the full filepath.
source("Week2functions.R")
Question #3: Demonstrate that you were successful by running the following
exampleFunction1(Week2functionsTest)
## [1] 334.5
exampleFunction2(Week2functionsTest)
## [1] -148.5
Question #4: In your own words, what do exampleFunction1 and exampleFunction2 do?
\(~\)
Question #5: Create a function named
top_cat() that accepts a categorical variable, “y”, and
returns the frequency of the most frequent category. You can accomplish
this using the table() and max() functions.
Next, include code that tests your function on the variable “Region” in
the colleges data set (the correct output should be
400).
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
Custom functions are most useful when you’d like to repeat an action several times with slightly different inputs. For example, you might want to find the squared distance between the mean and median for every numeric variable in a data set.
\(~\)
The following example uses the “cars” dataset and each of the 3 basic ways to loop over a vector are demonstrated.
cardata<-cars
print(head(cardata))
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
cardata2<-cars
#Loop over indices
for (i in 1:2){
cardata2[,i]<-cardata2[,i]*5
}
print(head(cardata2))
## speed dist
## 1 20 10
## 2 20 50
## 3 35 20
## 4 35 110
## 5 40 80
## 6 45 50
#loop over names
cardata3<-cars
for (col in colnames(cardata3)){
cardata3[,col]<-cardata3[,col]*5
}
print(head(cardata3))
## speed dist
## 1 20 10
## 2 20 50
## 3 35 20
## 4 35 110
## 5 40 80
## 6 45 50
#Loop over elements: difficult to save elements efficiently
cardata4<-cars
for (val in cardata4){
print(val[1:6]*5)
}
## [1] 20 20 35 35 40 45
## [1] 10 50 20 110 80 50
\(~\)
In addition to looping through vectors, a for loop is one way to iterate through the relevant portions of a data frame:
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
num_data <- colleges %>% select_if(is.numeric) ## Selects all columns that return TRUE to is.numeric
for(i in 1:ncol(num_data)){
print(my_fun(num_data[,i]))
}
## [1] 10274337
## [1] 0.0006547193
## [1] 0.5254721
## [1] 0.6265691
## [1] 0.6265691
## [1] 6919279
## [1] 1653630
## [1] 12303334
## [1] 6.353848e-06
## [1] 0.001075313
## [1] 0.001652711
## [1] 0.002807293
## [1] 9.5804e-05
## [1] 5.645276e-05
## [1] 1.662577e-05
## [1] 1.995113
## [1] 2548162
The code for(i in 1:ncol(num_data)) will create a
variable, “i”, which will increment along the sequence
1:ncol(num_data) each time the loop is run. This allows for
the ith column to given as an input to the
num_dat() function during the ith repetition of
the loop.
The example above merely prints the output of num_data()
for each numeric column; however, we’ll generally want to store these
results. To do this, we can set up an empty object before the loop and
assign values into one of its positions on each iteration:
## Create a blank numeric vector with a length equal to "ncol(num_data)"
colleges_d2 <- numeric(length = ncol(num_data))
## Loop and store
for(i in 1:ncol(num_data)){
colleges_d2[i] <- my_fun(num_data[,i])
}
print(colleges_d2)
## [1] 1.027434e+07 6.547193e-04 5.254721e-01 6.265691e-01 6.265691e-01
## [6] 6.919279e+06 1.653630e+06 1.230333e+07 6.353848e-06 1.075313e-03
## [11] 1.652711e-03 2.807293e-03 9.580400e-05 5.645276e-05 1.662577e-05
## [16] 1.995113e+00 2.548162e+06
Question #6: Write code that includes a for
loop and stores the frequency in the most common category for the
first three columns in the “colleges” data set using the function you
created in Question #5. Then, print the object storing these frequencies
(the correct result should be 3 24 127)
\(~\)
Question 7 Using the Diamonds dataset and a for loop, rescale all numeric values to the range [0,1].
Notes:
\(~\)
Question 8 Which is more general: A while loop or a for loop? That is, can you rewrite any for loop as a while loop? Any while loop as a for loop? Both can always be rewritten as the other, or neither can be guaranteed to be rewritten as the other.
If you claim that one type can be rewritten as the other type: Give me a series of steps do follow that work.
If you claim that one type can not be rewritten as the other type: give me an example that would fail.
\(~\)
Apply works over array margins, lapply on lists/vectors. I will give an example of both, but in general I use lapply more in R:
The apply() function can sometimes be used as an
alternative to a for loop. Consider the example below:
apply(num_data, MARGIN = 2, FUN = my_fun)
## Enrollment Adm_Rate ACT_median
## 1.027434e+07 6.547193e-04 5.254721e-01
## ACT_Q1 ACT_Q3 Cost
## 6.265691e-01 6.265691e-01 6.919279e+06
## Net_Tuition Avg_Fac_Salary PercentFemale
## 1.653630e+06 1.230333e+07 6.353848e-06
## PercentWhite PercentBlack PercentHispanic
## 1.075313e-03 1.652711e-03 2.807293e-03
## PercentAsian FourYearComp_Males FourYearComp_Females
## 9.580400e-05 5.645276e-05 1.662577e-05
## Debt_median Salary10yr_median
## 1.995113e+00 2.548162e+06
The argument MARGIN dictates which of the object’s
dimensions the function should be iterated across, with
MARGIN = 1 going through the rows, and
MARGIN = 2 going through the columns.
We also could have used lapply to make our tests more readable in the Example 1 function in the preamble
x1 <- c(1,2,3, "four", 5, "six", 7, 6, 2) #initial vector. Expected output: 15 125
x2 <- c(3, 7, "four", 2) #smaller vector. Expected output: 25
x3 <- c("a","b","c") #No Numbers. Expected output: None
x4 <- c(1,2,8,9,10) #All integers, none that should be printed (both above and below). Expected output: None
x5 <- c(4,5) #All integers, all should be printed. Expected output: 63 125
x6 <- c(1,2,3,4,5,6,7,8,9,10) #All integers, some should be printed. Expected output: 25, 63, 125
x7 <- c(5.1,4.1,3.9,7.1,6.8,55/10,75/10) #Non integer numbers. Expected output: 132.751 68.021 58.219 166.875
testVectors<-list(x1,x2,x3,x4,x5,x6,x7)
lapply(testVectors,example_function)
## Warning in na.omit(as.numeric(input)): NAs introduced by coercion
## Warning in na.omit(as.numeric(input)): NAs introduced by coercion
## Warning in na.omit(as.numeric(input)): NAs introduced by coercion
## [[1]]
## [1] 25 125
##
## [[2]]
## [1] 25
##
## [[3]]
## numeric(0)
##
## [[4]]
## numeric(0)
##
## [[5]]
## [1] 63 125
##
## [[6]]
## [1] 25 63 125
##
## [[7]]
## [1] 132.751 68.021 58.219 166.875
another example
x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
# compute the list mean for each list element
lapply(x, mean)
## $a
## [1] 5.5
##
## $beta
## [1] 4.535125
##
## $logic
## [1] 0.5
Question #9: Use apply() to find the
frequency in the most common category for the first three columns in the
“colleges” data set. You should prepare your input data beforehand by
using either the select() function or subsetting using
indices.
Question #10: Using the Diamonds dataset. Write a function to calculate the mean absolute cubed difference of the values in the numeric columns when compared to the mean. (i.e. calculate the mean by column, subtract the mean from each value, take the absolute third power, take the mean of the column) . Then use lapply() to test your function.
As an example, your number for carat should match the below:
lapply(diamonds,q10Func)$carat
## Warning in mean.default(x): argument is not numeric or logical: returning NA
## Warning in Ops.ordered(x, mean(x)): '-' is not meaningful for ordered factors
## Warning in mean.default(x): argument is not numeric or logical: returning NA
## Warning in Ops.ordered(x, mean(x)): '-' is not meaningful for ordered factors
## Warning in mean.default(x): argument is not numeric or logical: returning NA
## Warning in Ops.ordered(x, mean(x)): '-' is not meaningful for ordered factors
## [1] 0.1876116
\(~\)
References: - https://pubs.wsb.wisc.edu/academics/analytics-using-r-2019/simulation-basics.html - https://web.stanford.edu/class/bios221/labs/simulation/lab_3_simulation.html
Simulation in Statistics/Data Science is the use of computer simulations rather than collected data. This can be useful in a number of circumstances including estimating a p value. This is especially useful when collecting data or calculating a probability via the underlying distribution is complicated.
Theory:
Running a Simulation:
\(~\)
For example, how likely is it that I flip at least 700 heads out of 1000 coin flips of a weighted coin that comes up heads 65% of the time?
Note: so as to not use an explicit loop, I will be using the “replicate” function in R which is a wrapper of the “lapply” function (above)
set.seed(0) #set the seed to 0 for replicability
prob_heads=0.65 #65% probability of heads
x=replicate(10000, #Replicate the simulation 10,000 times
sum(runif(1000)<=prob_heads)) #simulate 1000 head flips of the weighted coin
Let’s look at the plot
ggplot(data.frame(x),aes(x=x))+
geom_histogram() +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
It doesn’t look like there are very many cases where this is true. In
fact, when we check directly with our simulation, we see that
approximately 0.06% of the time would we expect 700 or more heads.
sum(x>=700)/10000
## [1] 6e-04
\(~\)
Proof of the use of the seed: 100% of x and y should match, minimal x and z should match.
set.seed(0) #set the seed to 0 for replicability
prob_heads=0.65 #65% probability of heads
y=replicate(10000, #Replicate the simulation 10,000 times
sum(runif(1000)<=prob_heads)) #simulate 1000 head flips of the weighted coin
sum((x==y)==TRUE,na.rm=TRUE)/10000
## [1] 1
z=replicate(10000, #Replicate the simulation 10,000 times
sum(runif(1000)<=prob_heads)) #simulate 1000 head flips of the weighted coin
sum((x==z)==TRUE,na.rm=TRUE)/10000
## [1] 0.0179
\(~\)
You can also run simulations with multiple underlying datasets.
Question #11 Determine the likelihood of winning the following game:
Part a Using Simulation with the following parameters:
Part b Using what you remember from STA 209, calculate it directly.
Part c Are these values the same? If not, are they close? (within 1%)