# Please load the following packages (installing if necessary)
library(tidyr)
library(dplyr)
library(readr)
library(ggplot2)

Directions for all labs (read before starting)

  1. Please work together with your assigned partner. Make sure you both fully understand something before moving on.
  2. Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
  3. Ask for help, clarification, or even just a check-in if anything seems unclear.

The “Lab” section is something you will work on with a partner using paired programming, a framework defined as follows:

Partners are encouraged to switch roles throughout the “Lab” section, but for the first few labs the less experienced coder should spend more time as the driver.

\(~\)

Preamble

The first part of this lab focuses on creating functions: the why, the how, and some best practices. The second of part of this lab is focused on on iterations (for loops, while loops, etc). The final part will deal with simulations.

Much of this comes from the R for Data Science online texts online text https://r4ds.had.co.nz/iteration.html, https://r4ds.hadley.nz/iteration.html, conversations with CS faculty (Prof. Will Rebelsky), and experience (i.e. learn from my mistakes).

\(~\)

Goals

  • Functions
    • What are they
    • Why should we use them
    • How should we use them
    • Best practices
    • Importing User-Defined Functions
  • Iterations
    • For Loops
    • While Loops
    • Apply
  • Simulations

\(~\)

Useful built in functions:

%in%: returns the values of x that are in y. For example c(2,3,4) %in% c(2,5,6) would return TRUE FALSE FALSE. Sometimes it is useful to use the initial vectors and only return those values:

x=c(2,3,4)
y=c(2,5,6)
x%in%y #booleans
## [1]  TRUE FALSE FALSE
x[x%in%y] #boolean indexing
## [1] 2

which(boolean): returns the indices where the boolean is true

which(x%in%y)
## [1] 1

Functions

Why create and use functions:

  • Readability: a well named function tells the reader what is going on. It also makes reading long sections of code easier if much of it is condensed to a single line/name.
  • Ease of updates and minimization of mistakes: you only need to change one place not many if your use changes
  • Ease of use: it is much easier to reuse a function from project to project.

When to use functions:

  • If you find yourself copy-pasting the same code more than twice
  • When you believe you will be reusing the code in the future

When not to use functions:

  • If this is a one time use. Don’t over-generalize

How to create functions:

  • It is generally easiest to start with a working example and then generalize (e.g. the magic numbers from Homework 1)
  • A function consists of a name, the arguments, and the body of the function.
    • Naming conventions:
      • Names should be consistent using either camelCase or snake_case
      • Similar functions should have similar prefixes to make Rs auto completion easier for you
      • DO NOT Overwrite existing functions: e.g. don’t name a new function “sum”
    • Arguments:
      • The inputs to the function
      • you can include default values with the normal way we assign values (e.g. var1=1)
    • Body:
      • What the function does
    • General Format:
      • functionName=function(arguments){Body return(output)}
  • In R we tend to have 3 types of functions:
    • Vector functions:
      • Input: Vector(s)
      • Output: Vector(s)
    • Data Frame Functions:
      • Input: Data Frame
      • Output: Data Frame
    • Plot Functions:
      • Input: Dataframe
      • Output: plot
  • It is good practice to test your code
    • In this class: informal tests will normally be sufficient
    • In general: test suites, including edge cases, are very helpful
  • Comments are important
    • I disagree with the text, explicitly commenting what a line should do can make bug finding easier
    • I like to include a comment with the expected inputs and outputs

\(~\)

Iterations

Iterations are used when you are continuously repeating the same operation(s) on different values that are known in advance. Like functions, utilizing iterations helps to reduce code space and errors.

  • Imperative Loops:
    • For Loops
    • While Loops
    • Contents:
      • Sequence (what you iterate over)
      • Body (What you do)
      • (output) (what you return, not always explicit)
  • Functionals:
    • In R you can call functions within functions.
    • You could also turn a loop into a general function and then call the function
    • map functions: loop over a vector, do something, save the results
      • map() makes a list. There is a map_X() for every standard data type x
      • input: vector, function
      • output: vector where the function has been applied to every input
      • map() comes from the purrr package (written in c: faster but less legible)
    • Base R equivalent: lapply

\(~\)

Examples

Example 1: Writing a function

Suppose we want to create a function that does the following on a given vector input:

  1. remove all non-numeric values
  2. create a new vector with the cubed numeric values of the original vector
  3. adds the vectors together then subtracts 5 from every value
  4. returns the values of the resulting vector greater than 10 with inputs less than 6

Step 1 Remove non-numeric values

Let’s work with an example that has non numeric values such as the x below:

x <- c(1,2,3, "four", 5, "six", 7, 6, 2)

We want to remove all non-numeric values. Let’s see if we can figure out how to do that using the help function

?numeric

It looks like the “as.numeric” function may help, although it will give NAs for non numeric values. The na.omit function will be useful in this course, it removes values with “NA”.

x2<-as.vector(na.omit(as.numeric(x)))
x2
## [1] 1 2 3 5 7 6 2

We can see that this kept only the values that we wanted (1,2,3,5,7,6,2). I used the as.vector() function since we don’t care about the additional output from na.omit.

Step 2: cube the new vector

** is the R function for “take to the power of”

x3<-x2**3
x3
## [1]   1   8  27 125 343 216   8

For evidence that it worked:

x3/x2/x2/x2
## [1] 1 1 1 1 1 1 1

Step 3: add the vectors together, subtract 5

x4<-x3+x2-5
x4
## [1]  -3   5  25 125 345 217   5

Step 4 subset to original values <6 and current values > 10

We will do this using boolean indices

current_over10=x4>10
current_over10
## [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
original_less6=x2<6
original_less6
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
to_keep=current_over10 & original_less6
to_keep
## [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
x4[to_keep]
## [1]  25 125

Combine and Generalize

This is what we have so far:

x <- c(1,2,3, "four", 5, "six", 7, 6, 2)
x2<-as.vector(na.omit(as.numeric(x)))
x2
## [1] 1 2 3 5 7 6 2
x3<-x2**3
x3
## [1]   1   8  27 125 343 216   8
x4<-x3+x2-5
x4
## [1]  -3   5  25 125 345 217   5
current_over10=x4>10
current_over10
## [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
original_less6=x2<6
original_less6
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
to_keep=current_over10 & original_less6
to_keep
## [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
x4[to_keep]
## [1]  25 125

To turn this into a function, we need to use the function syntax: example_function<-function(x){}. We know that x is our input so we can rewrite the above chunk as follows

x <- c(1,2,3, "four", 5, "six", 7, 6, 2)
example_function<-function(x){
x2<-as.vector(na.omit(as.numeric(x)))
x2
x3<-x2**3
x3
x4<-x3+x2-5
x4
current_over10=x4>10
current_over10
original_less6=x2<6
original_less6

to_keep=current_over10 & original_less6
to_keep
x4[to_keep]
}
example_function(x)
## [1]  25 125

Good Practices

  • We shouldn’t print the intermediate values
x <- c(1,2,3, "four", 5, "six", 7, 6, 2)
example_function<-function(x){
x2<-as.vector(na.omit(as.numeric(x)))
x3<-x2**3
x4<-x3+x2-5
current_over10=x4>10
original_less6=x2<6
to_keep=current_over10 & original_less6
x4[to_keep]
}
example_function(x)
## [1]  25 125
  • names should be interpretable, and we should have comments
x <- c(1,2,3, "four", 5, "six", 7, 6, 2)

example_function<-function(input){
input_num<-as.vector(na.omit(as.numeric(input))) # convert the input vector to a numeric vector and drop NAs
input_cube<-input_num**3 # cube input
out<-input_cube+input_num-5 # add cubed input to input then subtract 5
# The following lines filter the lower and upper values of the output
out_over10=out>10
input_less6=input_num<6
to_keep=out_over10 & input_less6
# print the output
out[to_keep]
}
example_function(x)
## [1]  25 125
  • We shouldn’t store unnecessary values, we should return the output as a vector so it can be used
x <- c(1,2,3, "four", 5, "six", 7, 6, 2)

example_function<-function(input){
input_num<-as.vector(na.omit(as.numeric(input))) # convert the input vector to a numeric vector and drop NAs
out<-input_num**3+input_num-5 # add cubed input to input then subtract 5
# The following lines filter the lower and upper values of the output
out_over10=out>10
input_less6=input_num<6
to_keep=out_over10 & input_less6
return(out[to_keep]) # Return the output
}
example_function(x)
## [1]  25 125
  • We should test the function on other inputs to check that it works
x1 <- c(1,2,3, "four", 5, "six", 7, 6, 2) #initial vector. Expected output: 15 125
x2 <- c(3, 7, "four", 2) #smaller vector. Expected output: 25
x3 <- c("a","b","c") #No Numbers. Expected output: None
x4 <- c(1,2,8,9,10) #All integers, none that should be printed (both above and below). Expected output: None
x5 <- c(4,5) #All integers, all should be printed. Expected output: 63 125
x6 <- c(1,2,3,4,5,6,7,8,9,10) #All integers, some should be printed. Expected output: 25, 63, 125
x7 <- c(5.1,4.1,3.9,7.1,6.8,55/10,75/10) #Non integer numbers. Expected output: 132.751  68.021  58.219 166.875
testVectors<-list(x1,x2,x3,x4,x5,x6,x7)

for (testVect in testVectors){ #we will work with Loops more on Thursday
  print(example_function(testVect))
}
## Warning in na.omit(as.numeric(input)): NAs introduced by coercion
## [1]  25 125
## Warning in na.omit(as.numeric(input)): NAs introduced by coercion
## [1] 25
## Warning in na.omit(as.numeric(input)): NAs introduced by coercion
## numeric(0)
## numeric(0)
## [1]  63 125
## [1]  25  63 125
## [1] 132.751  68.021  58.219 166.875

Example 2: For loop

for (i in 1:5){#Sequence, for every i in the range 1:5 do
  print(i) #Body: Print i
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

Example 3: While loop

i=1
while (i <6){#Sequence, while i is < 6
  print(i) #Body: Print i, increment i
  i=i+1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

\(~\)


Lab

At this point you should begin working with your partner. This lab will continue building on the fundamental aspects of R introduced previously.

Part 1: Functions

Default Values

my_fun <- function(x = 99){
  d2 <- (mean(x, na.rm = TRUE) - median(x, na.rm = TRUE))^2
  return(d2)
}

## Try the function
ex_data <- c(0,2,10)
my_fun(x = ex_data)
## [1] 4
  • The syntax x = 99 dictates that the function should have an argument “x”, and specifies a default value of 99. If a user of the function doesn’t supply their own “x” argument the function will use the value 99.
  • return() defines the function’s output, in this example the function will output the object d2.
  • Note that functions use their own local environment, so the objects x and d2 only exist internally when the function is being used.

Vector Functions

Basic Vector Functions

We will start with vector functions: Functions that take a vector or set of vectors and return a vector.

Question #1 Write a function that takes a vector, and then by element: squares the element and then adds 5. An example input and expected output are below:

## [1] 1.0 2.0 3.0 4.0 5.5
## [1]  6.00  9.00 14.00 21.00 35.25
Mutate Functions

Mutate functions are functions that work well inside mutate and filter because they return an output of the same length as the input. we will use z scores in a number of future labs. Since this is something we will use multiple times, it makes sense to create a function to do this.

zScore <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
Summarize Functions

Summarize functions are functions that return a single value for use in summarize(). For example, we could create a squared error function:

squaredError <- function(x) {
  sum((x-mean(x))**2,na.rm=TRUE)
}

Question #2 Taken from the reading. Write bothNA(), a summary function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.

## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## [1] "Vector1: " "1"         "NA"        "3"         "NA"        "5"        
## [7] "NA"        "7"         "6"
## [1] "Vector2: " "3"         "7"         "NA"        "2"         "5"        
## [7] "NA"        "1"         "2"
## [1] "Expected Output: " "1"

Dataframe Functions

Data frame functions work like dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and return a data frame or a vector.

Sometimes we need to use embracing. wrapping a variable in braces {} tells dplyr to use the value stored inside the argument not the argument as the variable name.

I will give you an example below of a dataframe function and embracing, but in this class we will not be building any of these

dropAndAdd<-function(df,col1,col2){
  df=df%>%
    filter(!is.na({{col1}}))%>%
    filter(!is.na({{col2}}))%>%
    mutate(sumCol1Col2={{col1}}+{{col2}})
  return(df)
}

# Testing using airquality base R dataset

airquality2=dropAndAdd(airquality,Ozone,Solar.R)

airquality3<-airquality%>%
  filter(!is.na(Ozone))%>%
  filter(!is.na(Solar.R))%>%
  mutate(sumCol1Col2=Ozone+Solar.R)

sum((airquality3==airquality2)==FALSE,na.rm=TRUE)
## [1] 0

Plot Functions

Plot functions take a dataframe and return a plot. Once again, I will give you an example below, but in this class we will not be building any of these

For example, I make a significant number of histograms where the color and fill are based on the same variable, the colors are the same (so the borders match the fill), there is a minimal theme, and I use stacked bars.

basicHistogramPlot<-function(df,xCol,fillName,colorsToUse,alpha=0.5,bins=10){ #note that alpha and bins have defaults
  outPlot=ggplot(data=df,aes(x={{xCol}},fill={{fillName}},color={{fillName}}))+
    geom_histogram(position = "stack", alpha = alpha, bins=bins) +
    scale_fill_manual(values = colorsToUse)+
    scale_color_manual(values = colorsToUse)+
    theme_minimal()
  return(outPlot)
}
diamonds_data=diamonds
diamonds2 = diamonds_data%>%
  filter(cut=='Fair')%>%
  filter(clarity %in% c('VVS2','VVS1','IF'))%>%
  filter(color %in% c('D','H','I','J'))

basicHistogramPlot(diamonds2,price,clarity,c("red","blue","green"),bins=2) #note that I overwrote the default bin, but not the default alpha

\(~\)

Importing User-Defined Functions

We discussed in Lab 1 that one of the advantages of user-defined functions is that we can reuse them in future projects. However, we didn’t discuss how to actually do that. In order to reuse functions, the best practice is to save them in an r script and then import them using source(). These functions are available here. Please import them.

Example if it’s in the same directory. If not, you will have to use the “setwd()” from last week or include the full filepath.

source("Week2functions.R")

Question #3: Demonstrate that you were successful by running the following

exampleFunction1(Week2functionsTest)
## [1] 334.5
exampleFunction2(Week2functionsTest)
## [1] -148.5

Question #4: In your own words, what do exampleFunction1 and exampleFunction2 do?

\(~\)

More Function Practice

Question #5: Create a function named top_cat() that accepts a categorical variable, “y”, and returns the frequency of the most frequent category. You can accomplish this using the table() and max() functions. Next, include code that tests your function on the variable “Region” in the colleges data set (the correct output should be 400).

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

Custom functions are most useful when you’d like to repeat an action several times with slightly different inputs. For example, you might want to find the squared distance between the mean and median for every numeric variable in a data set.

\(~\)

Part 2: Iteration

Loops

  • There are 4 types of loops you will generally use in R:
    • For Loops
      • Modifying an existing object
      • Looping over names instead of indices
    • While Loops
      • Handling outputs of unknown length
      • Handling sequences of unknown length
  • There are 3 basic ways to loop over a vector:
    • Loop over indices (e.g. i in 1:10)
    • Loop over elements
    • Loop over names

The following example uses the “cars” dataset and each of the 3 basic ways to loop over a vector are demonstrated.

cardata<-cars
print(head(cardata))
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
cardata2<-cars
#Loop over indices
for (i in 1:2){
  cardata2[,i]<-cardata2[,i]*5
}
print(head(cardata2))
##   speed dist
## 1    20   10
## 2    20   50
## 3    35   20
## 4    35  110
## 5    40   80
## 6    45   50
#loop over names
cardata3<-cars
for (col in colnames(cardata3)){
  cardata3[,col]<-cardata3[,col]*5
}
print(head(cardata3))
##   speed dist
## 1    20   10
## 2    20   50
## 3    35   20
## 4    35  110
## 5    40   80
## 6    45   50
#Loop over elements: difficult to save elements efficiently
cardata4<-cars
for (val in cardata4){
  print(val[1:6]*5)
}
## [1] 20 20 35 35 40 45
## [1]  10  50  20 110  80  50

\(~\)

In addition to looping through vectors, a for loop is one way to iterate through the relevant portions of a data frame:

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
num_data <- colleges %>% select_if(is.numeric)  ## Selects all columns that return TRUE to is.numeric

for(i in 1:ncol(num_data)){ 
   print(my_fun(num_data[,i]))
}
## [1] 10274337
## [1] 0.0006547193
## [1] 0.5254721
## [1] 0.6265691
## [1] 0.6265691
## [1] 6919279
## [1] 1653630
## [1] 12303334
## [1] 6.353848e-06
## [1] 0.001075313
## [1] 0.001652711
## [1] 0.002807293
## [1] 9.5804e-05
## [1] 5.645276e-05
## [1] 1.662577e-05
## [1] 1.995113
## [1] 2548162

The code for(i in 1:ncol(num_data)) will create a variable, “i”, which will increment along the sequence 1:ncol(num_data) each time the loop is run. This allows for the ith column to given as an input to the num_dat() function during the ith repetition of the loop.

The example above merely prints the output of num_data() for each numeric column; however, we’ll generally want to store these results. To do this, we can set up an empty object before the loop and assign values into one of its positions on each iteration:

## Create a blank numeric vector with a length equal to "ncol(num_data)"
colleges_d2 <- numeric(length = ncol(num_data))

## Loop and store
for(i in 1:ncol(num_data)){ 
  colleges_d2[i] <- my_fun(num_data[,i])
}

print(colleges_d2)
##  [1] 1.027434e+07 6.547193e-04 5.254721e-01 6.265691e-01 6.265691e-01
##  [6] 6.919279e+06 1.653630e+06 1.230333e+07 6.353848e-06 1.075313e-03
## [11] 1.652711e-03 2.807293e-03 9.580400e-05 5.645276e-05 1.662577e-05
## [16] 1.995113e+00 2.548162e+06

Question #6: Write code that includes a for loop and stores the frequency in the most common category for the first three columns in the “colleges” data set using the function you created in Question #5. Then, print the object storing these frequencies (the correct result should be 3 24 127)

\(~\)

Loop Practice

Question 7 Using the Diamonds dataset and a for loop, rescale all numeric values to the range [0,1].

Notes:

  • Carat is numeric, Cut is not.
  • I don’t care which of the standard ways to normalize/rescale data you use as long as you are consistent.
  • DO NOT write or use a built in function to rescale the values.
  • I do not care if you keep the non numeric values, if they become NA, or if you drop them for this question

\(~\)

Question 8 Which is more general: A while loop or a for loop? That is, can you rewrite any for loop as a while loop? Any while loop as a for loop? Both can always be rewritten as the other, or neither can be guaranteed to be rewritten as the other.

If you claim that one type can be rewritten as the other type: Give me a series of steps do follow that work.

If you claim that one type can not be rewritten as the other type: give me an example that would fail.

\(~\)

Apply/lapply

Apply works over array margins, lapply on lists/vectors. I will give an example of both, but in general I use lapply more in R:

  • input: vector/list, function
  • output: list where the function has been applied to every input

The apply() function can sometimes be used as an alternative to a for loop. Consider the example below:

apply(num_data, MARGIN = 2, FUN = my_fun)
##           Enrollment             Adm_Rate           ACT_median 
##         1.027434e+07         6.547193e-04         5.254721e-01 
##               ACT_Q1               ACT_Q3                 Cost 
##         6.265691e-01         6.265691e-01         6.919279e+06 
##          Net_Tuition       Avg_Fac_Salary        PercentFemale 
##         1.653630e+06         1.230333e+07         6.353848e-06 
##         PercentWhite         PercentBlack      PercentHispanic 
##         1.075313e-03         1.652711e-03         2.807293e-03 
##         PercentAsian   FourYearComp_Males FourYearComp_Females 
##         9.580400e-05         5.645276e-05         1.662577e-05 
##          Debt_median    Salary10yr_median 
##         1.995113e+00         2.548162e+06

The argument MARGIN dictates which of the object’s dimensions the function should be iterated across, with MARGIN = 1 going through the rows, and MARGIN = 2 going through the columns.

We also could have used lapply to make our tests more readable in the Example 1 function in the preamble

x1 <- c(1,2,3, "four", 5, "six", 7, 6, 2) #initial vector. Expected output: 15 125
x2 <- c(3, 7, "four", 2) #smaller vector. Expected output: 25
x3 <- c("a","b","c") #No Numbers. Expected output: None
x4 <- c(1,2,8,9,10) #All integers, none that should be printed (both above and below). Expected output: None
x5 <- c(4,5) #All integers, all should be printed. Expected output: 63 125
x6 <- c(1,2,3,4,5,6,7,8,9,10) #All integers, some should be printed. Expected output: 25, 63, 125
x7 <- c(5.1,4.1,3.9,7.1,6.8,55/10,75/10) #Non integer numbers. Expected output: 132.751  68.021  58.219 166.875
testVectors<-list(x1,x2,x3,x4,x5,x6,x7)

lapply(testVectors,example_function)
## Warning in na.omit(as.numeric(input)): NAs introduced by coercion
## Warning in na.omit(as.numeric(input)): NAs introduced by coercion
## Warning in na.omit(as.numeric(input)): NAs introduced by coercion
## [[1]]
## [1]  25 125
## 
## [[2]]
## [1] 25
## 
## [[3]]
## numeric(0)
## 
## [[4]]
## numeric(0)
## 
## [[5]]
## [1]  63 125
## 
## [[6]]
## [1]  25  63 125
## 
## [[7]]
## [1] 132.751  68.021  58.219 166.875

another example

x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
# compute the list mean for each list element
lapply(x, mean)
## $a
## [1] 5.5
## 
## $beta
## [1] 4.535125
## 
## $logic
## [1] 0.5
lapply practice

Question #9: Use apply() to find the frequency in the most common category for the first three columns in the “colleges” data set. You should prepare your input data beforehand by using either the select() function or subsetting using indices.

Question #10: Using the Diamonds dataset. Write a function to calculate the mean absolute cubed difference of the values in the numeric columns when compared to the mean. (i.e. calculate the mean by column, subtract the mean from each value, take the absolute third power, take the mean of the column) . Then use lapply() to test your function.

As an example, your number for carat should match the below:

lapply(diamonds,q10Func)$carat
## Warning in mean.default(x): argument is not numeric or logical: returning NA
## Warning in Ops.ordered(x, mean(x)): '-' is not meaningful for ordered factors
## Warning in mean.default(x): argument is not numeric or logical: returning NA
## Warning in Ops.ordered(x, mean(x)): '-' is not meaningful for ordered factors
## Warning in mean.default(x): argument is not numeric or logical: returning NA
## Warning in Ops.ordered(x, mean(x)): '-' is not meaningful for ordered factors
## [1] 0.1876116

\(~\)

Part 3: Simulation

References: - https://pubs.wsb.wisc.edu/academics/analytics-using-r-2019/simulation-basics.html - https://web.stanford.edu/class/bios221/labs/simulation/lab_3_simulation.html

Simulation in Statistics/Data Science is the use of computer simulations rather than collected data. This can be useful in a number of circumstances including estimating a p value. This is especially useful when collecting data or calculating a probability via the underlying distribution is complicated.

Theory:

  • Law of Large Numbers: (STA 209)
    • If you’ve forgotten this, the wiki article is useful https://en.wikipedia.org/wiki/Law_of_large_numbers
    • Given a sample of independent and identically distributed values, the sample mean converges to the true mean.
    • As long as a finite expected value exists for the underlying distribution, in the long run sample average converges to the expected value
  • Central Limit Theorem: (STA 209?)
  • Overall:
    • We know that a large sample should have the same mean as the expected value of the underlying distribution (LLN)
    • We know that repeated sample means should be normally distributed (CLT)
    • Therefore we can simulate multiple samples and analyze the distribution of the results using the statistical methods we have for normal distributions

Running a Simulation:

  • Assumptions:
    • Choice of Distribution
    • Choice of Parameters within the distribution
  • Good Practice:
    • Set a seed

\(~\)

For example, how likely is it that I flip at least 700 heads out of 1000 coin flips of a weighted coin that comes up heads 65% of the time?

Note: so as to not use an explicit loop, I will be using the “replicate” function in R which is a wrapper of the “lapply” function (above)

set.seed(0) #set the seed to 0 for replicability
prob_heads=0.65 #65% probability of heads
x=replicate(10000, #Replicate the simulation 10,000 times
            sum(runif(1000)<=prob_heads)) #simulate 1000 head flips of the weighted coin

Let’s look at the plot

ggplot(data.frame(x),aes(x=x))+
  geom_histogram() +
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It doesn’t look like there are very many cases where this is true. In fact, when we check directly with our simulation, we see that approximately 0.06% of the time would we expect 700 or more heads.

sum(x>=700)/10000
## [1] 6e-04

\(~\)

Proof of the use of the seed: 100% of x and y should match, minimal x and z should match.

set.seed(0) #set the seed to 0 for replicability
prob_heads=0.65 #65% probability of heads
y=replicate(10000, #Replicate the simulation 10,000 times
            sum(runif(1000)<=prob_heads)) #simulate 1000 head flips of the weighted coin
sum((x==y)==TRUE,na.rm=TRUE)/10000
## [1] 1
z=replicate(10000, #Replicate the simulation 10,000 times
            sum(runif(1000)<=prob_heads)) #simulate 1000 head flips of the weighted coin
sum((x==z)==TRUE,na.rm=TRUE)/10000
## [1] 0.0179

\(~\)

You can also run simulations with multiple underlying datasets.

Question #11 Determine the likelihood of winning the following game:

  • Roll a die, then flip a coin (assume 6 sided fair die, 2 sided fair coin)
  • If the die is 5 or 6, or the coin is heads you win
  • You play this game with your friend 100 times. What percent of the games would you expect to win

Part a Using Simulation with the following parameters:

  • Seed of 2024
  • 1000 replications
  • hint: use sample on the vector c(1:6)

Part b Using what you remember from STA 209, calculate it directly.

Part c Are these values the same? If not, are they close? (within 1%)