The R function pnorm()
will be very helpful to use for
calculating probabilities. In order to use it we need to tell the
function a few specific things we are using:
Here is an example of using the pnorm()
function to
calculate the probability P(X > 25) for a normal distribution with
mean 30 and std.dev. 10.
pnorm(25, mean=30, sd=10, lower.tail = FALSE)
## [1] 0.6914625
# Nice
We will use a standard normal distribution for practice. For all of these question parts, show your R code and your calculations. I also recommend being able to draw a rough sketch to illustrate visually the probability on the normal distribution graph.
Part A: Use R to find the probability of randomly picking a value more than 0.5.
Part B: Use R to find the probability of randomly picking a value less than -0.5.
Part C: Do your answers to Parts A and B match?
Part D: Use Parts A and B to find the probability of randomly picking a value between -0.5 and 0.5.
Part E: Verify the 68-95-99.7% rule using 1, 2, and 3 std.dev’s, and state the results in your own words.
Mercury is a chemical that is toxic to humans (and many others animals). From the smokestacks of power plants to the discharges from wastewater treatment plants, and other places, mercury in the form of the compound methylmercury exists in the enviroment, and it can settle to the seafloor and be taken up by tiny organisms that live or feed on bottom sediments.
These compounds aren’t digested, they accumulate within the animals that ingest them, and become more and more concentrated as they pass along the food chain as animals eat and then are eaten in turn. This is biomagnification, and it means that higher-level predators-fish, birds, and marine mammals-build up greater and more dangerous amounts of toxic materials than animals lower on the food chain. (Info taken from here)
The U.S. Food and Drug Administration recommends avoiding eating fish with mercury levels higher than 0.46 \(\mu\)g/g (micro-grams mercury per gram of fish), as they may be harmful (source).
Suppose the population of yellowfin tuna follows a Normal distribution with an average mercury level of 0.354 \(\mu\)g/g , and a variance of 0.02. (Mean value taken from here, I made up the variance amount – hard to find this)
Part A: What is the standard deviation of this population?
Part B: What is the probability of catching a yellowfin tuna with an unsafe amount of mercury? Suppose you enjoy eating yellowfin tuna – would you frequently eat yellowfin tuna knowing this extra information?
Part C: Suppose the standard deviation is acutally .03 instead of .14. Find the probability of catching a yellowfin tuna with an unsafe amount of mercury. Does this change your answer to whether or not you would frequently eat yellowfin tuna?
For this problem we will return to the college dataset we’ve used previously.
We are going to use the Net_Tuition
variable and
practice some ideas related to variability.
college %>% ggplot(aes(Net_Tuition)) + geom_histogram(color = 'black', fill = 'gray') + facet_wrap(~Type)
Part A: Why would we not want to use mean and standard deviation to describe the histograms for each of these groups? (Regardless, we will use these to practice our concepts)
Part B: Compare the variability of both groups. Which would have a larger standard deviation?
Part C: The mean and variance for the Private colleges are given by the following. Write an interpration of the standard deviation in this context.
college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% mean()
## [1] 17243.76
college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% var()
## [1] 46634166
Part D: Suppose I add an observation for another college to the Private college data, which is an outlier with a large Net Tuition of $50,000. What would happen to the std. dev.?
Part E: Suppose I add an observation for another college to the Private college data, which is an outlier with a small Net Tuition of $2,000. What would happen to the std. dev.?
EXTRA: (no work is needed) Net_Tuition
as we just looked at it is right-skewed for both groups. A technique
that is sometimes used for right-skewed distributions to get them to
look more ‘Normal’ is to do a log-transformation. We take the natural
log of the variable i.e. log(variable) and graph it instead.
college %>% ggplot(aes(log(Net_Tuition))) + geom_histogram(color = 'black', fill = 'gray') + facet_wrap(~Type)
Both of these groups now look more ‘Normal’ (with small outliers for Private colleges). Intepreting the means and standard deviations becomes much more difficult though. What the heck does it mean to say ‘the mean of the natural log of net tuition of private colleges is 9.68 with a standard deviation of 0.39.’
# mean of log(net_tuition)
college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% log() %>% mean()
## [1] 9.680216
# sd of log(net_tuition)
college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% log() %>% sd()
## [1] 0.3948205
Another complication is the fact that working backwards to get our means and standard deviations in terms of the original unit measurement ($) is not perfect, and in fact this fails miserably for the std.dev. You can see this with the following where I have switched the order of computing log and mean/std.dev.
# mean of log(net_tuition)
college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% log() %>% mean()
## [1] 9.680216
# log of mean of net_tuition
college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% mean() %>% log()
## [1] 9.755206
# sd of log(net_tuition)
college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% log() %>% sd()
## [1] 0.3948205
# log of sd() of net_tuition
college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% sd() %>% log()
## [1] 8.828922