Visualization

Data:

diamonds_data=diamonds

Plot:

GoalPlot

\(~\)

How I approach the problem

\(~\)

Step 1: Figure out what data is plotted.

Based on the plot we can see the following:

The data plotted consists of:
- Fair diamonds (from the title)
- something to do with the letters D, H, I, and J
- clarity of VVS2, VVS1, IF
- approximately 40 diamonds plotted
Need to figure out:
- What “Fair” means
- What the letters mean

Next I import the dataset and check the first few values:

diamonds_data=diamonds
head(diamonds_data,10)

## # A tibble: 10 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39

From this data, it looks like the word “Fair” reffers to the cut, and the letters refer to the color. If we filter the dataset by those values, we get the following:

diamonds2 = diamonds_data%>%
  filter(cut=='Fair')%>%
  filter(clarity %in% c('VVS2','VVS1','IF'))%>%
  filter(color %in% c('D','H','I','J'))
dim(diamonds2)

## [1] 38 10

38 diamonds is similar to our approximation of 40, so this is probably the dataset used in the plot.

\(~\)

Step 2: Get a similar plot

Values: Based on the axes
- x value is price
- y value is a count
Geom:
- This is either a stacked histogram or bar chart
- Price is continuous not discrete, so we should probably use a histogram
- There are 2 bins
Scales: Based on the legend
- The bars are filled based on the value of “clarity”
Facet
- There are four subplots based on “color”

Using this we can generate the following graph:

ggplot(diamonds2,aes(x=price,fill=clarity))+
  geom_histogram(position = "stack", alpha = 0.5, bins = 2) +
  facet_wrap(~color)

This plot looks very similar to the original plot, but now we need to work on the fine details.

\(~\)

Step 3: Fine details

The differences between the graphs seem to be related to the following: - Axes - Both axis titles are different - Both axis values are different - Colors/Fill - Colors are different, look like red, blue, green - Theme - Different theme - Title - Our plot does not have a title

Start by fixing the colors:

ggplot(diamonds2,aes(x=price,fill=clarity))+
  geom_histogram(position = "stack", alpha = 0.5, bins = 2) +
  facet_wrap(~color)+
  scale_fill_manual(values=c("red","blue","green"))

Since that looks right, let’s work on the axes and title. Both price and count are continuous variables.

X
- breaks: 0, 5000, 15000
- name: “Price”
Y
- breaks: 0, 1, 5, 10
- name: “Count of Diamonds” -Title:
- “Prices of Fair Diamonds”

Making these changes gives us:

ggplot(diamonds2,aes(x=price,fill=clarity))+
  geom_histogram(position = "stack", alpha = 0.5, bins = 2) +
  facet_wrap(~color)+
  scale_fill_manual(values=c("red","blue","green"))+
  scale_y_continuous(name="Count of Diamonds",breaks=c(0,1,5,10,15))+
  scale_x_continuous(name="Price",breaks=c(0,5000,15000))+
  ggtitle("Prices of Fair Diamonds")

From here we see that:

Our theme is not correct
- it looks similar to theme_minimal so we will start there
Our title needs to be centered
- This is also a theme option

Making these changes:

ggplot(diamonds2,aes(x=price,fill=clarity))+
  geom_histogram(position = "stack", alpha = 0.5, bins = 2) +
  facet_wrap(~color)+
  scale_fill_manual(values=c("red","blue","green"))+
  scale_y_continuous(name="Count of Diamonds",breaks=c(0,1,5,10,15))+
  scale_x_continuous(name="Price",breaks=c(0,5000,15000))+
  ggtitle("Prices of Fair Diamonds")+
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5))

This looks almost right, however we still have extra gridlines which makes it harder to read (where is “1” for example). Since these lines don’t all line up with the axis, but the ones in the actual plot do, we probably need to remove the minor gridlines.

ggplot(diamonds2,aes(x=price,fill=clarity))+
  geom_histogram(position = "stack", alpha = 0.5, bins = 2) +
  facet_wrap(~color)+
  scale_fill_manual(values=c("red","blue","green"))+
  scale_y_continuous(name="Count of Diamonds",breaks=c(0,1,5,10,15))+
  scale_x_continuous(name="Price",breaks=c(0,5000,15000))+
  ggtitle("Prices of Fair Diamonds")+
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(panel.grid.minor = element_blank())

And We’ve recreated the original plot!

GoalPlot

Recreating a figure from raw data

Task

Visualization

How I approach the problem

Step 1: Figure out what data is plotted.

Step 2: Get a similar plot

Step 3: Fine details