Normal Probabilities

Practice

The R function pnorm() will be very helpful to use for calculating probabilities. In order to use it we need to tell the function a few specific things we are using:

The value we are checking a probability for
The mean of the normal distribution
The standard deviation of the normal distribution
Whether to give the probability less than or more than (by default R returns ‘less than’ probabilities, use “lower.tail=F” to get ’greater than probabilities)

Here is an example of using the pnorm() function to calculate the probability P(X > 25) for a normal distribution with mean 30 and std.dev. 10.

pnorm(25, mean=30, sd=10, lower.tail = FALSE)

## [1] 0.6914625

# Nice

Question 1

We will use a standard normal distribution for practice. For all of these question parts, show your R code and your calculations. I also recommend being able to draw a rough sketch to illustrate visually the probability on the normal distribution graph.

Part A: Use R to find the probability of randomly picking a value more than 0.5.

pnorm(0.5, lower.tail=F)

## [1] 0.3085375

Part B: Use R to find the probability of randomly picking a value less than -0.5.

pnorm(-.5)

## [1] 0.3085375

Part C: Do your answers to Parts A and B match?

# Yes, they are the same.

Part D: Use Parts A and B to find the probability of randomly picking a value between -0.5 and 0.5.

pnorm(0.5) - pnorm(-.5)

## [1] 0.3829249

Part E: Verify the 68-95-99.7% rule using 1, 2, and 3 std.dev’s, and state the results in your own words.

pnorm(1) - pnorm(-1)

## [1] 0.6826895

pnorm(2) - pnorm(-2)

## [1] 0.9544997

pnorm(3) - pnorm(-3)

## [1] 0.9973002

Practical Example – Mercury in Fish

Mercury is a chemical that is toxic to humans (and many others animals). From the smokestacks of power plants to the discharges from wastewater treatment plants, and other places, mercury in the form of the compound methylmercury exists in the enviroment, and it can settle to the seafloor and be taken up by tiny organisms that live or feed on bottom sediments.

These compounds aren’t digested, they accumulate within the animals that ingest them, and become more and more concentrated as they pass along the food chain as animals eat and then are eaten in turn. This is biomagnification, and it means that higher-level predators-fish, birds, and marine mammals-build up greater and more dangerous amounts of toxic materials than animals lower on the food chain. (Info taken from here)

The U.S. Food and Drug Administration recommends avoiding eating fish with mercury levels higher than 0.46 $\mu$g/g (micro-grams mercury per gram of fish), as they may be harmful (source).

Suppose the population of yellowfin tuna follows a Normal distribution with an average mercury level of 0.354 $\mu$g/g , and a variance of 0.02. (Mean value taken from here, I made up the variance amount – hard to find this)

Question 2:

Part A: What is the standard deviation of this population?

{r]} sqrt(.02)

Part B: What is the probability of catching a yellowfin tuna with an unsafe amount of mercury? Suppose you enjoy eating yellowfin tuna – would you frequently eat yellowfin tuna knowing this extra information?

pnorm(.46, mean=.354, sd=.14, lower.tail = FALSE)

## [1] 0.2244821

Part C: Suppose the standard deviation is acutally .03 instead of .14. Find the probability of catching a yellowfin tuna with an unsafe amount of mercury. Does this change your answer to whether or not you would frequently eat yellowfin tuna?

pnorm(.46, mean=.354, sd=.03, lower.tail = FALSE)

## [1] 0.0002051774

Question 3 – Standard Dev. Practice

For this problem we will return to the college dataset we’ve used previously.

We are going to use the Net_Tuition variable and practice some ideas related to variability.

college %>% ggplot(aes(Net_Tuition)) + geom_histogram(color = 'black', fill = 'gray') + facet_wrap(~Type)

Part A: Why would we not want to use mean and standard deviation to describe the histograms for each of these groups? (Regardless, we will use these to practice our concepts)

# the distributions are skewed right and mean and s.d. are not good
# measures of center and spread in this scenario

Part B: Compare the variability of both groups. Which would have a larger standard deviation?

# The private group has more variability and will have a larger s.d.

Part C: The mean and variance for the Private colleges are given by the following. Write an interpration of the standard deviation in this context.

college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% mean()

## [1] 17243.76

college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% var()

## [1] 46634166

sqrt(46634166)

## [1] 6828.921

For Private colleges, the average difference of a college to the mean of $17243 is $6,829

Part D: Suppose I add an observation for another college to the Private college data, which is an outlier with a large Net Tuition of $50,000. What would happen to the std. dev.?

# The standard deviation will go up, as this value is much larger than the mean

Part E: Suppose I add an observation for another college to the Private college data, which is an outlier with a small Net Tuition of $2,000. What would happen to the std. dev.?

# The standard deviation will go up, as this value is much smaller than the mean
# (still far away)

EXTRA: (no work is needed) Net_Tuition as we just looked at it is right-skewed for both groups. A technique that is sometimes used for right-skewed distributions to get them to look more ‘Normal’ is to do a log-transformation. We take the natural log of the variable i.e. log(variable) and graph it instead.

college %>% ggplot(aes(log(Net_Tuition))) + geom_histogram(color = 'black', fill = 'gray') + facet_wrap(~Type)

Both of these groups now look more ‘Normal’ (with small outliers for Private colleges). Intepreting the means and standard deviations becomes much more difficult though. What the heck does it mean to say ‘the mean of the natural log of net tuition of private colleges is 9.68 with a standard deviation of 0.39.’

# mean of log(net_tuition)
college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% log() %>% mean()

## [1] 9.680216

# sd of log(net_tuition)
college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% log() %>% sd()

## [1] 0.3948205

Another complication is the fact that working backwards to get our means and standard deviations in terms of the original unit measurement ($) is not perfect, and in fact this fails miserably for the std.dev. You can see this with the following where I have switched the order of computing log and mean/std.dev.

# mean of log(net_tuition)
college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% log() %>% mean()

## [1] 9.680216

# log of mean of net_tuition
college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% mean() %>% log()

## [1] 9.755206

# sd of log(net_tuition)
college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% log() %>% sd()

## [1] 0.3948205

# log of sd() of net_tuition
college %>% filter(Type=="Private") %>% pull(Net_Tuition) %>% sd() %>% log()