Lab 3: Data Visualization with ggplot2

This lab focuses on data visualization and how to create high quality graphics using the ggplot2 package. In this lab we will use cleaned data, with subsequent labs covering how to manipulate data prior to graphing.The second part of the lab will focus on the concepts and strategy behind creating effective visualizations.

Directions for all labs (read before starting)

Please work together with your assigned partner. Make sure you both fully understand something before moving on.
Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
Ask for help, clarification, or even just a check-in if anything seems unclear.

The “Lab” section is something you will work on with a partner using paired programming, a framework defined as follows:

One partner is the driver, who physically writes code and operates the computer
One partner is the navigator, who reviews the actions of the driver and provides feedback and guidance

Partners are encouraged to switch roles throughout the “Lab” section, but for the first few labs the less experienced coder should spend more time as the driver.

\(~\)

Preamble

Packages and Datasets

This lab will use the ggplot2 package:

# install.packages("ggplot2")
library(ggplot2)

It will also use data from The College Scorecard:

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

Description: The colleges data set contains attributes and outcomes for all primarily undergraduate institutions in the United States with at least 400 full-time students for the year 2019.

\(~\)

How `ggplot2` creates graphics

ggplot2 package builds graphics in a structured manner using layers. Layers are sequentially added to a graph, with each serving a particular purpose, such as:

Displaying raw data
Displaying a statistical summary
Adding metadata (ie: annotations, context, references, etc.)

Consider the following examples:

## Example #1.1 - nothing (just a base layer)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition))

## Example #1.2 - add a layer displaying raw data
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition)) +
  geom_point()

## Example #1.3 - add another layer (smoother)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition)) +
  geom_point() +
  geom_smooth()

## Example #1.4 - add another layer (reference text)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition)) +
  geom_point() +
  geom_smooth() + 
  annotate("text", x =20000, y=40000, label = "Data from 2019")

Note: A high-quality graphic doesn’t require any particular number of layers. In fact, more layers can sometimes detract from the clarity of a data visualization.

\(~\)

More on the base layer

The mapping and data arguments provided in the base layer are carried forward to all subsequent layers (which is often desirable). However, we can avoid this behavior.

## Example #2.1
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
  geom_point()

## Example #2.2 (local override of color aesthetic)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
  geom_point(color = "red")

Specifying aesthetics locally within layers:

## Example #3.1
ggplot(data = colleges) +
  geom_point(mapping = aes(x = Cost, y = Net_Tuition)) 

## Example #3.2
ggplot(data = colleges) +
  geom_point(mapping = aes(x = Cost, y = Net_Tuition)) +
  geom_point(mapping = aes(x = Cost, y = Enrollment), color = "red")

Local specification is most useful when you want to add layers that involve the same parameters (“x” and “y” in the example above) but for different things. Common examples are drawing a line through several different group means of an ordinal categorical variable, or displaying both polygons and points on a map.

\(~\)

Creating Effective Visualizations

The fundamental principles of creating effective data visualizations are quite simple. In short, an effective visualization should:

Clearly convey a particular message
Let the data speak for itself with minimal distortion
Avoid “sales tactics” and unnecessary frills

It’s helpful to understand these principles with a few examples of effective and ineffective visuals:

Example #1

Consider survey data on the popularity of different internet browsers over time:

Which graph is more effective? Why?

\(~\)

Example #2

Consider the heights of men and women in the NHANES sample:

Which graph is more effective? Why?

\(~\)

Lab

At this point you will begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.

ggplot graphics are a very expansive topic and not everything can be self-contained in this lab. Throughout the lab, I encourage you to reference the following resources if you need to figure out the proper function or syntax for a particular task:

\(~\)

Terminology

The ggplot framework is unique because all graphics are grammatically defined using the following terminology:

Aesthetics (or “aes”) - mappings of variables to visual cues representing their values (ie: position on the x-axis)
Geometric elements (or “geom”) - what you actually see in the plot (ie: points, lines, etc.)
Scales - guidelines for how aesthetic mappings should be displayed (ie: logarithmic x-axis, red to blue color palate, etc.)
Guides (or “legends”) - references to help a human reader interpret the aesthetics
Facets - rules specifying how to break up and separately display subsets of data

Question #1: Identify and briefly describe each term mentioned above in the graphic created by the code below.

ggplot(data = colleges, mapping = aes(x = Adm_Rate, color = Private)) + 
  geom_density() +
  scale_x_continuous(transform= "reverse") +
  facet_wrap(~Region)

Question #2: Create a histogram of the variable “Enrollment” displayed on the log2-scale and faceted by the variable “Private”. Use the ggplot2 cheatsheet to help you identify the necessary functions and arguments.

\(~\)

Themes

Themes are pre-built style templates used to better tailor a graphic to the mode of publication.

The example below applies a black and white theme to Example 2.1 from the preamble.

## Example #2.1 w/ black and white theme
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
  geom_point() +
  theme_bw()

Other pre-built themes:

theme_bw()
theme_linedraw(), theme_light() and theme_dark()
theme_minimal()
theme_classic()
theme_void()

You can judge the differences in these themes below:

Any theme can be further customized using theme(). Most commonly this function is used to remove a graph’s legend:

## Example #2.1 w/ black and white theme
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
  geom_point() +
  theme_bw() +
  theme(legend.position = "none")

Question #3: The code below creates a line graph that depicts (via a smoothed moving average) the approval of former presidents Jimmy Carter, Ronald Reagan, and Barrack Obama. Modify second portion of this code to try out a few different non-default themes. Then, briefly discuss (1-2 sentences) which themes you feel are most effective and least effective for this type of graph. Include the graph with your preferred theme in your lab write-up.

## Data processing
approval <- read.csv("https://bit.ly/398YR6M")
approval$Week = as.numeric(difftime(as.Date(approval$End.Date, "%m/%d/%y"), as.Date(approval$Inaug.Date, "%m/%d/%y"), units = "weeks"))
approval2 = subset(approval, President %in% c("Reagan", "Carter", "Obama"))

## Creating the graph
ggplot(data = approval2, mapping = aes(x = Week, y = Approving, color = President)) +
 geom_smooth(method = "loess", span = 0.6, se = FALSE)

\(~\)

Labels and Annotations

Labels and annotations are important aspects of well-constructed data visualizations. They are used to provide context, or draw the viewer’s attention towards particular aspects of the graphic.

Labels corresponding to aesthetics (such as x, y, color, etc.) are controlled using the labs() function:

ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
  geom_point() +
  labs(x = "Sticker Cost", y = "Price Paid", color = "Admissions Rate")

Annotations are added the graphic as a layer using the annotate() function:

ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
  geom_point() + 
  annotate(geom = "rect", xmin = 65000, xmax = 77000, ymin = 20000, ymax = 52000, color = "red", alpha = 0.2)

The example above annotates a scatter plot by drawing a red rectangle with 20% transparency (controlled by alpha) around the cluster of colleges with high costs, high net tuition, and low admissions rates.

Question #4: Use the subset() function to create a data frame containing only colleges located in the state of Iowa. Using these data, create a box plot of admissions rates, rename the x-axis label to “Admissions Rate”, and add a text annotation above the outlier saying “Grinnell”.

\(~\)

Scales

Scales map values in the data space to the aesthetic space (ie: where and how should data with Adm_Rate=0.2, Cost=40000, and Net_Tuition=10000 appear on the graphic?).

Scales can be modified by adding layers using functions whose names follow the general format: “scale_aesthetic_function()”. Shown below are a few examples:

## Default Scales
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
  geom_point()

## Put cost (the "x" aesthetic) on the log2 scale
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
  geom_point() +
  scale_x_continuous(transform= "log2")

## Use a gradient from purple to yellow to display Adm_Rate
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
  geom_point() +
  scale_color_gradient(low = "purple", high = "yellow")

## Use the popular "viridis" color scale, reversing the default direction via "-1"
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
  geom_point() +
  scale_color_continuous(type = "viridis", direction = -1)

There are two things you should note at this point:

There are variants of scale functions depending upon the type of the variable being used by that aesthetic. In these examples, scale_color_continuous() is used because “Adm_Rate” is a continuous numeric variable. However, if colors were being determined by a character string you’d want to use scale_color_brewer() instead (or one of the related color scale functions designed for discrete variables).
The color aesthetic is not synonymous with all of the color you see on the graph. For example, the fill aesthetic can be used to add color to a bar chart, and you must use a function like scale_fill_brewer() to change the fill color.

Question #5: Create a 2-dimensional filled density plot (using geom_density_2d_filled()) with the aesthetics “x = Adm_Rate” and “y = Net_Tuition” using the “viridis” color scale and a “reverse” x-axis that goes from 1.00 to 0.00. Then, write a sentence or two describing the combinations of admissions rate and net tuition that occur most frequently among US colleges.

\(~\)

Stats

Sometimes we’d like to display a statistical transformation (mean, error bars, etc.) alongside the data itself. While this could be accomplished by creating a separate data frame, it’s generally better to use a stat_ function:

ggplot(data = colleges[1:30,], mapping = aes(x = Net_Tuition, y = Private)) +
  geom_point() + 
  stat_summary(fun = mean, geom = "point", color = "red", size = 3) +
  stat_summary(fun.data = mean_se, geom = "errorbar", color = "red")

The example above adds error bars at 1 standard error above and below the mean net tuition cost of for the private and public colleges in the first 30 rows of the colleges data frame.

Notice how the first instance of stat_summary() uses the fun argument (a simple option designed to return a single number/vector), while the second uses the fun.data argument (a complex option designed to return a data frame, which contains the lower and upper endpoints of the interval in this example).

Question #6: Using the data subset containing only colleges located in Iowa (you created this in Question #4), create a graph similar to the example above but using the arguments geom = "linerange", alpha = 0.3, size = 4, and color = "red" to depict 1 standard error above/below the means of private and public colleges.

\(~\)

Facets

Relying on large numbers of aesthetics to include additional variables on a graph can quickly become overwhelming. Facets allow you to display multiple side-by-side graphs according to one or more categorical variables.

facet_wrap() is designed to display the data broken by a single categorical variable
facet_grid() is designed to display the data broken by all combinations of two categorical variables

Two examples of faceting are shown below:

## create a subset for example purposes
reduced_colleges = subset(colleges, Region %in% c("Great Lakes", "Far West", "Plains"))

## facet_wrap
ggplot(data = reduced_colleges, mapping = aes(x = Cost)) +
  geom_density() +
  facet_wrap(~Region, nrow = 1)

## facet_grid
ggplot(data = reduced_colleges, mapping = aes(x = Cost)) +
  geom_density() +
  facet_grid(Private~Region)

The facet_wrap() function will wrap a series of plots according to a fixed number of rows or columns, while facet_grid() will construct a 2-dimensional grid whose panels correspond to the unique combinations of values in the variables used in the grid formula.

\(~\)

Visual Cues for Encoding Data

The fundamental idea of data visualization is to convey a particular message by exploiting human understanding of visual cues. The most common strategies represent differences in the observed data by visual differences in:

Position
Length
Angle
Area
Color hue
Color brightness/intensity

However, as we saw in Example #1 from the preamble, not all of these are equally effective. For example, bar charts (lengths) are more effective than pie charts (angles and areas).

In fact, we could make the pie chart from Example #1 even less effective by removing any possible comparison of angles (using a graph called a “donut chart”):

Research by Cleveland and McGill has shown that assessments based upon position and length are most accurate. Judgement of angles or color are somewhat less accurate, but still acceptable. While judgement of area and brightness are substantially less accurate; and judgement based upon volume is the least accurate.

Question #7: Consider the following data visualizations with a goal to compare the population of Suffolk County, MA (where Boston is located) versus all other counties in the north eastern United States.

Part A: Which visual cues are used in each visualization?
Part B: Which graph does the research of Cleveland and McGill suggest would be more effective? Do you agree?

Note: you do not need to write any R code for this question

\(~\)

Histograms and Density Plots

Histograms and density plots display the distribution of a quantitative variable.

Faceting provides a strategy to show distributional differences across subsets of data. Effective use of faceting will align the scales of your axes (using scales = fixed in the faceting function) throughout the graph to facilitate accurate comparisons:

library(dslabs)
data(heights)

ggplot(heights, aes(x = height)) + geom_histogram(aes(y = after_stat(density)), bins = 20) + 
    facet_wrap(~sex, nrow = 1, scales = "free") + labs(title = "Graph #1 (Difficult)")
ggplot(heights, aes(x = height)) + geom_histogram(aes(y = after_stat(density)), bins = 20) + 
    facet_wrap(~sex, nrow = 2, scales = "fixed") + labs(title = "Graph #2 (Effective)")

Notice how vertical alignment combined with a common x-axis facilitates a clear comparison, while the Graph #1 makes it very difficult to judge if there are differences in the male and female distributions.

Alternatively, depending upon the number of comparisons, you might want to consider a single plot:

ggplot(heights, aes(x = height, color = sex, fill = sex)) + geom_histogram(aes(y = after_stat(density)), alpha = 0.5, bins = 20, position = "identity") + labs(title = "Graph #3 (Also effective?)")

Question #8: In the examples shown above, the additional aesthetic mapping y = after_stat(density) instructs geom_histogram() to use the density scale for the heights of the histogram bars (displayed using the y-axis). What happens when you remove this argument from the code used produce Graph 3? What happens when you remove it from the code used to produce Graph 2? Based upon what you’ve observed, briefly describe when and how this additional mapping should be used.

Question #9: For the colleges data, create a collection of density plots that effectively display the distribution of the variable “Enrollment” across the following regions: “South East”, “Plains”, “Great Lakes”, and “New England”. Do not display enrollments in any other regions. Hint: you may use the %in% operator to check if a region is contained in a set of target character strings.

\(~\)

Dotplots, Boxplots, and Violin Plots

Histograms and density plots are great at showing the shape of a variable’s distribution, but they don’t scale well when many distributions are to be compared.

In contrast, dot plots, box plots, and violin plots are suitable alternatives when comparing more than 2-3 groupings of data.

However, simply switching the geom is not enough to produce an effective graph. Consider following three dot plots that each display the relationship between the mean of “Avg_Fac_Salary” and the variables “Region” and “Private”.

## Data subset used for these graphs
datasub = subset(colleges, Region %in% c("South East", "South West", "Far West", "Mid East", "Great Lakes", "Plains", "New England"))

## Dot plot #1
ggplot(datasub, aes(x = Avg_Fac_Salary, y = Private, color = Region)) +  labs(title = "Dot Plot #1") + stat_summary(fun.data = mean_se)

## Dot plot #2
ggplot(datasub, aes(x = Avg_Fac_Salary, y = Private)) +  labs(title = "Dot Plot #2") + stat_summary(fun.data = mean_se, color = "red", alpha = 0.3) + facet_wrap(~Region)

## Dot plot #3
ggplot(datasub, aes(x = Avg_Fac_Salary, y = reorder(Region, X = Avg_Fac_Salary, FUN = mean, na.rm = TRUE), color = Private)) +  labs(title = "Dot Plot #3", y = "Region") + stat_summary(fun.data = mean_se)

Dot plot #1 is ineffective because it’s too difficult to quickly compare within a region.
Dot plot #2 effectively allows for comparisons within a region, but comparing between regions is difficult because the facet panels are unorganized and take up too much space.
Dot plot #3 is the most effective of these graphs because it allows for comparisons within a region, and it also facilitates comparisons between regions through compactness and reordering of the y-axis

For simplicity these graphs only show the mean \(\pm\) 1 standard error for each region. But you could attempt to show the entire distribution using geom_violin() in place of or in addition to stat_summary(), or you could show a more detailed statistical summary using geom_boxplot():

ggplot(datasub, aes(x = Avg_Fac_Salary, y = reorder(Region, X = Avg_Fac_Salary, FUN = mean, na.rm = TRUE), color = Private)) + geom_violin() +  labs(title = "Using geom_violin", y = "Region")

ggplot(datasub, aes(x = Avg_Fac_Salary, y = reorder(Region, X = Avg_Fac_Salary, FUN = mean, na.rm = TRUE), color = Private)) + geom_boxplot() +  labs(title = "Using geom_boxplot", y = "Region")

Note: The examples above also illustrate the reorder() function, which can rearrange the categories of a variable according to a function of another variable in the data. In these examples, the categories of “Region” are rearranged by the mean value of “Avg_Fac_Salary” within each region.

Question #10: In the United States, federal law defines a Hispanic-serving institution (HSI) as a college or university where 25% or more of the total undergraduate student body is Hispanic or Latino. The code below uses this definition and the ifelse() function to add a new binary categorical variable, “HSI”, to the colleges data. For this question, your goal is to create a graph that effectively compares the variable “Net_Tuition” for HSI and non-HSI colleges in the states of “CA”, “FL”, “NY” and “TX”. That is, your graph should allow for easy comparisons of the variables “Net_Tuition” (or a summary of it), “HSI”, and “State” (displaying only the four aforementioned states). Hint: You should remove any colleges with missing values of “HSI” by including a logical condition involving !is.na() (which will return TRUE if a college is not missing that variable).

colleges$HSI = ifelse(colleges$PercentHispanic >= 0.25, "HSI", "No")

\(~\)

Scatterplots

Scatter plots are used to display relationships between two numeric variables using position. However, aesthetics like color, point character, or brightness allow for additional variables to be included in the graph. Below are a several tips for creating more effective scatter plots:

1. Use annotations rather than legends

The most natural way to add a third variable into a scatter plot is the color aesthetic. By default, adding color into aes will create a legend on the side of the plot describing how the chosen variable is mapped to the colors seen on the graph.

From a visual processing perspective this is less efficient than placing color annotations near the relevant regions of the plot:

data("iris")
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point() + labs(title = "Using a legend")

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point() + labs(title = "Using annotations") + 
  guides(color="none") +  ## This removes the "guides" for the color aesthetic
  annotate(geom = "text", x = c(2.5, 4, 6), y = c(0.25,0.75,1.25), label = c("Setosa", "Versicolor", "Virginica"), color = c("red", "darkgreen", "blue"))

The example above demonstrates how annotations can allow a viewer to understand the essence of a scatter plot more quickly.

\(~\)

2. Use scale transformations to show more of the data

The goal of any graph is to show your data. If outliers or skew are leading to a lot of blank space in your graph, you might consider a scale transformation:

ggplot(colleges, aes(x = Enrollment, y = Net_Tuition)) + geom_point() + labs(title = "A few outliers dominate your attention")

Sometimes scale transformations can lead to new insights:

ggplot(colleges, aes(x = Enrollment, y = Net_Tuition)) + geom_point() + labs(title = "Two distinct clusters?") + scale_x_continuous(transform = "log2")

ggplot(colleges, aes(x = Enrollment, y = Net_Tuition, color = Private)) + geom_point() + labs(title = "Simpson's Paradox!") + scale_x_continuous(transform= "log2") + 
  guides(color="none") + 
  annotate(geom = "text", x = c(2000, 52000), y = c(52000,25000), label = c("Private", "Public"), color = c("red","cyan3"))

\(~\)

3. Avoid using scale transformations that hide or obscure important information

In some situations, outliers distract from the main purposes of a graph, but in other situations the outliers are the most interesting aspect of the data. Shown below is a counter example to Tip #2:

In this application, people generally have the greatest interest in knowing the military expenditures of the small number of countries that are currently viewed as having strong geopolitical aspirations. The smaller countries are less interesting, and the main purpose of displaying them is to highlight just how extreme the spending of the world’s top military powers is relative to the average nation.

\(~\)

4. Know when to use a diverging color gradient

A third numeric variable can be included in a scatter plot using color, size or brightness, with color being the most effective choice. When mapping numeric values to different colors there are two possible options:

sequential scales - most useful in distinguishing high values from low values (the “viridis” scale is an example)
diverging scales - used to put equal emphasis on both the high and low ends of the data range

The examples above use pre-built color palettes via the function scale_color_distiller(). You should read the help documentation of this function to see a list of possible color palette choices.

\(~\)

Question #11: Using the “colleges” data, create a scatter plot displaying the relationship between the variables “Cost”, “ACT_median”, and “Private”. Use annotations rather than a legend to display the color aesthetic.

Question #12: Create a scatter plot that displays three numeric variables from the “colleges” data. Then, briefly write a 2-3 sentences justifying the choices (ie: diverging vs. sequential scales, etc.) you made in constructing the plot.

\(~\)

Practice

The code below will load a data set containing 970 Hollywood films released between 2007 and 2011, then reduce these data to only include variables that could be known prior to a film’s opening weekend. The data are then simplified further to only include the four largest studios (Warner Bros, Fox, Paramount, and Universal) in the three most common genres (action, comedy, drama). You will use the resulting data (ie: movies_subset) for Question #13.

movies = read.csv("https://remiller1450.github.io/data/HollywoodMovies.csv")
movies_subset = subset(movies, LeadStudio %in% c("Warner Bros", "Fox", "Paramount", "Universal") & 
                               Genre %in% c("Action", "Comedy", "Drama"),
                       select = c("Movie", "LeadStudio", "Story", "Genre","Budget",
                                  "TheatersOpenWeek","Year","OpeningWeekend"))

Question #13: Create a graphic that shows the relationship between the outcome variable “OpeningWeekend” and the explanatory variables: “Budget”, “Genre”, and “Studio” that satisfies the following requirements:

It uses a non-default theme.
It changes the label of at least one variable to make it appear more professional (for example, add a space so that your graph shows “Opening Weekend” instead of “OpeningWeekend”)
It uses the color aesthetic in some capacity, and it uses a non-default color scale.

\(~\)

Question 14:

For the following questions you should use the data contained at the link below:

mass = read.csv("https://remiller1450.github.io/data/MassShootings.csv")

These data come from a database of mass shootings maintained by the Mother Jones news organization. They consider two types of incidents, Type = "Mass" defined by at least 4 fatalities in a single location and Type = "Spree" defined by at least 4 fatalities across a set of related locations/incidents.

Part A: Following the example(s) at this reference page create a bar chart showing the number of mass shootings by “Place” and use the argument aes(label = after_stat(count)) to add a label above each bar to more clearly display the count.

Part B: Construct a scatterplot relating the variables “Year” and “Victims” and notice the outlier in 2017. Use an annotation to highlight this outlier as interesting.

Part C: Use a dot plot to effectively display the mean and standard error for the number of fatalities by “Place” and “Type”. Try to use strategies from the lab’s “Dotplots, Boxplots, and Violin Plots” section. Pay special attention to the use of the reorder() function in these examples. You may ignore any warnings related to missing values being removed (as some category combinations do not have enough data to plot).

Lab 3: Data Visualization with `ggplot2`

2025-09-15

Preamble

Packages and Datasets

How `ggplot2` creates graphics

More on the base layer

Creating Effective Visualizations

Example #1

Example #2

Lab

Terminology

Themes

Labels and Annotations

Scales

Stats

Facets

Visual Cues for Encoding Data

Histograms and Density Plots

Dotplots, Boxplots, and Violin Plots

Scatterplots

1. Use annotations rather than legends

2. Use scale transformations to show more of the data

3. Avoid using scale transformations that hide or obscure important information

4. Know when to use a diverging color gradient

Practice

Lab 3: Data Visualization with ggplot2

2025-09-15

Preamble

Packages and Datasets

How ggplot2 creates graphics

More on the base layer

Creating Effective Visualizations

Example #1

Example #2

Lab

Terminology

Themes

Labels and Annotations

Scales

Stats

Facets

Visual Cues for Encoding Data

Histograms and Density Plots

Dotplots, Boxplots, and Violin Plots

Scatterplots

1. Use annotations rather than legends

2. Use scale transformations to show more of the data

3. Avoid using scale transformations that hide or obscure important information

4. Know when to use a diverging color gradient

Practice

Lab 3: Data Visualization with `ggplot2`

How `ggplot2` creates graphics