Titanic Data

One-way tables

Question 1

Part A Create a frequency table using the titanic data set to find how many children and adults were on board the Titanic.

with(titanic, table(Age))
## Age
## Child Adult 
##   109  2092

Part B Determine what percentage of the passengers on-board the Titanic were adults.

with(titanic, table(Age)) %>% proportions()
## Age
##      Child      Adult 
## 0.04952294 0.95047706

The percentage of passengers that were adults is 95.0%. You could also have used the previous table to get .95 = \(\frac{2092}{2092 + 109}\).

Part C Determine what percentage of the passengers on-board the Titanic were members of the crew.

with(titanic, table(Class)) %>% proportions()
## Class
##       1st       2nd       3rd      Crew 
## 0.1476602 0.1294866 0.3207633 0.4020900

40.2% of passengers were members of the crew.


Two-way Tables

Question 2

Part A How many children were included in second class?

with(titanic, table(Age, Class))
##        Class
## Age     1st 2nd 3rd Crew
##   Child   6  24  79    0
##   Adult 319 261 627  885

24 children.

Part B What percentage of the crew survived? How about children?

with(titanic, table(Survived, Class)) %>% proportions(margin = 2)
##         Class
## Survived       1st       2nd       3rd      Crew
##      No  0.3753846 0.5859649 0.7478754 0.7604520
##      Yes 0.6246154 0.4140351 0.2521246 0.2395480
with(titanic, table(Survived, Age)) %>% proportions(margin=2)
##         Age
## Survived     Child     Adult
##      No  0.4770642 0.6873805
##      Yes 0.5229358 0.3126195

24% of the crew survived. 52% of children survived. These are conditional probabilities since we are narrowing our focus to one group for each table (crew and children). We need survived and not survived to add up to 100% for each group. The way I arranged my tables with survived on the y-axis means I needed to use margins=2 argument in the proportions() function.

Part C What proportion of individuals who survived were members of the crew? Construct the plot associated with the table you create.

with(titanic, table(Survived, Class)) %>% proportions(margin=1) %>% addmargins(2)
##         Class
## Survived        1st        2nd        3rd       Crew        Sum
##      No  0.08187919 0.11208054 0.35436242 0.45167785 1.00000000
##      Yes 0.28551336 0.16596343 0.25035162 0.29817159 1.00000000

The proportion of passengers who survived that were members of the crew is .298. This is a conditional probability since we are restricting ourselves to only looking at the Survived=Yes row. We need the rows to add to 100% so I used proportions(margin=1).


Three-way tables

Question 3

ggplot(titanic, aes(Class)) + 
  geom_bar() + 
  facet_grid(Survived ~ Sex)

Part A: Amongst female passengers, which class had the most who did not survive? How many female passengers in this class did not survive?

3rd class. 106 female passengers in 3rd class did not survive.

Part B: Amongst male passengers, which class had the fewest people survive? How many male passengers in this class survived?

2nd class. 25 male passengers in 2nd class survived.


Numerical Summaries

Question 4 (Conceptual Questions)

Part A: Where does the name ‘order statistics’ come from?

We need to order the data from smallest to largest to calculate them.

Part B: Why do we use median and IQR (instead of mean and standard deviation) for the center and spread when we have skews, outliers, or both?

Skews and outliers can change the mean and standard deviation a lot, so they are unreliable measures of center and spread when skews or outliers are present.

Part C: What is a reason one may want to use z-scores to compare variables?

If the variables use a different scale or unit, then z-scores still let us compare them when we otherwise couldn’t.


Question 5

Part A: Describe the distribution of Sepal.Length for ‘versicolor’ iris flowers using the histogram and output below. (You will not need all of these numbers)

iris[iris$Species=='versicolor',] %>% ggplot(aes(x=Sepal.Length)) + geom_histogram(bins = 8, color = 'black', fill = 'gray') +
  ggtitle("Species = Versicolor")

iris[iris$Species=='versicolor',]$Sepal.Length %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   5.600   5.900   5.936   6.300   7.000
iris[iris$Species=='versicolor',]$Sepal.Length %>% sd()
## [1] 0.5161711

Sepal.Length for Versicolor iris flowers is roughly symmetric and unimodal, with a mean of 5.936 and standard deviation of .516 (units unclear). There are no outliers according to the histogram. We prefer to use mean and standard deviation when we can. Since there is no skew or outliers, we can use mean and median.

Part B: Describe the distribution of Petal.Length for ‘versicolor’ iris flowers using the histogram and boxplot below.

iris[iris$Species=='versicolor',] %>% ggplot(aes(x=Petal.Length)) + geom_histogram(bins = 14, color = 'black', fill = 'gray')+ ggtitle("Species = Versicolor")

iris[iris$Species=='versicolor',] %>% ggplot(aes(x=Petal.Length)) + geom_boxplot() + ggtitle("Species = Versicolor")

Petal length for Versicolor iris flowers is skewed left with outliers at low values. The median is around 4.3 and IQR is about 1.6 (units unclear). Since the distribution is skewed with outliers, we want to use median and IQR for center and spread respectively.

Part C: Use the boxplot below to compare Sepal.Length of the various Species to each other in terms of center and spread. (Hint: What measures of center and spread are easiest to see in the boxplots?)

ggplot(iris, aes(x=Sepal.Length, y=Species)) + geom_boxplot()

Virginica has the largest median at 6.5, followed by Versicolor with a median around 5.9, then Setosa has the lowest median at 5. Virginica and Versicolor have similar IQRs with are larger than the IQR for Setosa.

Part D: Is there an association between Sepal.Length and Species of iris flower?

There is an association between these two variables. If you know the species of flower, you can tell roughly what values of Sepal Length that species will have. This means Species provides info to tell us about Sepal length, so they are associated.