Part A Create a frequency table using the
titanic
data set to find how many children and adults were
on board the Titanic.
with(titanic, table(Age))
## Age
## Child Adult
## 109 2092
Part B Determine what percentage of the passengers on-board the Titanic were adults.
with(titanic, table(Age)) %>% proportions()
## Age
## Child Adult
## 0.04952294 0.95047706
The percentage of passengers that were adults is 95.0%. You could also have used the previous table to get .95 = \(\frac{2092}{2092 + 109}\).
Part C Determine what percentage of the passengers on-board the Titanic were members of the crew.
with(titanic, table(Class)) %>% proportions()
## Class
## 1st 2nd 3rd Crew
## 0.1476602 0.1294866 0.3207633 0.4020900
40.2% of passengers were members of the crew.
Part A How many children were included in second class?
with(titanic, table(Age, Class))
## Class
## Age 1st 2nd 3rd Crew
## Child 6 24 79 0
## Adult 319 261 627 885
24 children.
Part B What percentage of the crew survived? How about children?
with(titanic, table(Survived, Class)) %>% proportions(margin = 2)
## Class
## Survived 1st 2nd 3rd Crew
## No 0.3753846 0.5859649 0.7478754 0.7604520
## Yes 0.6246154 0.4140351 0.2521246 0.2395480
with(titanic, table(Survived, Age)) %>% proportions(margin=2)
## Age
## Survived Child Adult
## No 0.4770642 0.6873805
## Yes 0.5229358 0.3126195
24% of the crew survived. 52%
of children survived. These are conditional probabilities
since we are narrowing our focus to one group for each table (crew and
children). We need survived and not survived to add up to 100% for each
group. The way I arranged my tables with survived on the y-axis means I
needed to use margins=2
argument in the
proportions()
function.
Part C What proportion of individuals who survived were members of the crew? Construct the plot associated with the table you create.
with(titanic, table(Survived, Class)) %>% proportions(margin=1) %>% addmargins(2)
## Class
## Survived 1st 2nd 3rd Crew Sum
## No 0.08187919 0.11208054 0.35436242 0.45167785 1.00000000
## Yes 0.28551336 0.16596343 0.25035162 0.29817159 1.00000000
The proportion of passengers
who survived that were members of the crew is .298. This is a conditional probability
since we are restricting ourselves to only looking at the Survived=Yes
row. We need the rows to add to 100% so I used
proportions(margin=1)
.
ggplot(titanic, aes(Class)) +
geom_bar() +
facet_grid(Survived ~ Sex)
Part A: Amongst female passengers, which class had the most who did not survive? How many female passengers in this class did not survive?
3rd class. 106 female passengers in 3rd class did not survive.
Part B: Amongst male passengers, which class had the fewest people survive? How many male passengers in this class survived?
2nd class. 25 male passengers in 2nd class survived.
Part A: Where does the name ‘order statistics’ come from?
We need to order the data from smallest to largest to calculate them.
Part B: Why do we use median and IQR (instead of mean and standard deviation) for the center and spread when we have skews, outliers, or both?
Skews and outliers can change the mean and standard deviation a lot, so they are unreliable measures of center and spread when skews or outliers are present.
Part C: What is a reason one may want to use z-scores to compare variables?
If the variables use a different scale or unit, then z-scores still let us compare them when we otherwise couldn’t.
Part A: Describe the distribution of
Sepal.Length
for ‘versicolor’ iris flowers using the
histogram and output below. (You will not need all of these numbers)
iris[iris$Species=='versicolor',] %>% ggplot(aes(x=Sepal.Length)) + geom_histogram(bins = 8, color = 'black', fill = 'gray') +
ggtitle("Species = Versicolor")
iris[iris$Species=='versicolor',]$Sepal.Length %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.900 5.600 5.900 5.936 6.300 7.000
iris[iris$Species=='versicolor',]$Sepal.Length %>% sd()
## [1] 0.5161711
Sepal.Length
for
Versicolor iris flowers is roughly symmetric and unimodal, with a mean
of 5.936 and standard deviation of .516 (units unclear). There are no
outliers according to the histogram. We prefer to use mean and standard
deviation when we can. Since there is no skew or outliers, we can use
mean and median.
Part B: Describe the distribution of
Petal.Length
for ‘versicolor’ iris flowers using the
histogram and boxplot below.
iris[iris$Species=='versicolor',] %>% ggplot(aes(x=Petal.Length)) + geom_histogram(bins = 14, color = 'black', fill = 'gray')+ ggtitle("Species = Versicolor")
iris[iris$Species=='versicolor',] %>% ggplot(aes(x=Petal.Length)) + geom_boxplot() + ggtitle("Species = Versicolor")
Petal length for Versicolor iris flowers is skewed left with outliers at low values. The median is around 4.3 and IQR is about 1.6 (units unclear). Since the distribution is skewed with outliers, we want to use median and IQR for center and spread respectively.
Part C: Use the boxplot below to compare
Sepal.Length
of the various Species
to each
other in terms of center and spread. (Hint: What measures of center and
spread are easiest to see in the boxplots?)
ggplot(iris, aes(x=Sepal.Length, y=Species)) + geom_boxplot()
Virginica has the largest median at 6.5, followed by Versicolor with a median around 5.9, then Setosa has the lowest median at 5. Virginica and Versicolor have similar IQRs with are larger than the IQR for Setosa.
Part D: Is there an association between
Sepal.Length
and Species
of iris flower?
There is an association between these two variables. If you know the species of flower, you can tell roughly what values of Sepal Length that species will have. This means Species provides info to tell us about Sepal length, so they are associated.