Webscraping refers to the act of getting data from
websites. We have previously accessed info using read.
functions in R but these have specifically been used to access a data
file hosted on a website (example shown below), but this is not usually
considered webscraping.
# example only, do not read into R
pres_approval_data = read.csv("https://nfriedrichsen.github.io/data/pres_approval_data.csv")
In many cases, websites contain useful data in the form of tables, lists, articles, or links that would take a long time to copy by hand. Webscraping is the use of tools (packages, functions) that allow us to read the HTML structure of a webpage and extract the pieces of information we want in a more organized way.
There are different levels of web scraping. Manual web scraping usually means collecting information from a single webpage or a small number of pages by directly running code yourself. This is useful for quick data collection and learning how websites are structured.
Automated web scraping is more advanced and involves writing programs that repeatedly scrape many pages on a schedule or loop through large websites automatically.
Note: This lab will focus only on manual web scraping.
Many websites provide a file called robots.txt that
gives instructions to automated tools about which parts of the site
should or should not be accessed. While robots.txt is not
legally binding in most cases, it is considered good practice to check
and follow it. For example, a website may allow scraping of public blog
posts but block access to login pages or search features. You can often
view this file by adding /robots.txt to the end of a website’s URL. Here is a link to Reddit’s
robots.txt.
Some websites do not scrape cleanly because modern webpages are often
built dynamically using JavaScript. Packages like rvest
primarily read the raw HTML sent by the server, but some content is only
loaded later by your browser after the page is opened. In those cases,
the data may not appear in the HTML that rvest sees.
Websites may also intentionally make scraping difficult through login
systems, CAPTCHAs, or changing page structures.
We will be using the rvest package (pronounced like
‘harvest’ without the ‘h’ sound, clever). This package is one of many
that can be used for webscraping. It has functions that will allow us to
extract tables, text, or links from webpages and store it in R.
# install.packages("rvest")
library(rvest)
library(dplyr)
Some common functions in rvest include read_html(),
which reads a webpage into R, html_elements(), which
selects parts of the page, and html_text2(), which extracts
readable text from those elements. The function
html_table() is especially useful because it can
automatically convert HTML tables on a webpage into R data frames. Other
functions such as html_attr() can extract things like URLs
from links or image sources from pictures.
We are going to start with a very basic example using a nicely formatted Wikipedia page, specifically for the largest cities in the world. A link to the webpage is here so take a look at it. On this page we have a helpful data table that can be used to navigate the information. We want to get this in R.
To start we will save the URL in R and use the read_html
function to parse through it.
url <- "https://en.wikipedia.org/wiki/List_of_largest_cities"
page <- read_html(url)
page
## {html_document}
## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-menu-disabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available skin-theme-clientpref-thumb-standard" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...
Inspecting the page object shows it is stored as a list
(similar to working with API stuff). To visually see what is stored in
this object, go back to the actual wikipedia page and right-click then
Inspect. This will pull up the HTML code that our page
object is storing. Hovering over elements in this HTML browser will help
us find out where the table is actually stored in the code, if we cared
to do this.
Conveniently, we can do this to get the data tables from the webpage.
# this accesses all of the data tables
tables <- page %>%
html_table()
# there are 5 tables on this page
length(tables)
## [1] 5
# inspect the tables
tables
Not all of these tables are going to be helpful, as some are just for webpage formatting. Table 2 is the one we want.
cities <- tables[[2]]
glimpse(cities)
## Rows: 84
## Columns: 13
## $ `City[a]` <chr> "City[a]", "Jakarta", "Dhaka", "Tok…
## $ Country <chr> "Country", "Indonesia", "Bangladesh…
## $ `UN 2025 population estimates[11]` <chr> "UN 2025 population estimates[11]",…
## $ `City proper[b]` <chr> "Definition", "Special region", "Ca…
## $ `City proper[b]` <chr> "Population", "10,154,134", "10,295…
## $ `City proper[b]` <chr> "Area.mw-parser-output .nobold{font…
## $ `City proper[b]` <chr> "Density(/km2)", "15,292[13]", "30,…
## $ `Urban area[12]` <chr> "Population", "33,756,000", "19,134…
## $ `Urban area[12]` <chr> "Area(km2)", "3,546", "619", "8,231…
## $ `Urban area[12]` <chr> "Density(/km2)", "9,519[d]", "30,91…
## $ `Metropolitan area[c]` <chr> "Population", "33,430,285", "36,585…
## $ `Metropolitan area[c]` <chr> "Area(km2)", "7,063", "2,570", "13,…
## $ `Metropolitan area[c]` <chr> "Density(/km2)", "4,733[14]", "14,2…
Exercise 1: Reading this data table into R has caused some issues. Header names are borked. Clean the data table, giving proper names to columns and make sure we do not have a duplicate header as our first row in the dataset.
Exercise 2: Make some graphic using
ggplot2 to say something interesting with this data.
We can use rvest to extract other things from the
webpage if we really cared to. For example, the following code shows
various clickable links that are on this webpage, and pathways to access
those URLs.
links <- page %>%
html_elements("a") %>%
html_text2()
head(links, 5)
## [1] "Jump to content" "Main page" "Contents" "Current events"
## [5] "Random article"
urls <- page %>%
html_elements("a") %>%
html_attr("href")
head(urls, 5)
## [1] "#bodyContent" "/wiki/Main_Page"
## [3] "/wiki/Wikipedia:Contents" "/wiki/Portal:Current_events"
## [5] "/wiki/Special:Random"
We can access weather related info about Grinnell (and any other city, really) from NWS. The website link is here. Take a look at the website and see what features are on this page.
The following code can be used to make a small data table of weather forecasts for us. Note: tombstone-container is an object name specific to the NWS website. Why they chose this? I have no clue. You can verify this by inspecting the websites HTML elements.
url <- "https://forecast.weather.gov/MapClick.php?lat=41.7433&lon=-92.7274"
page <- read_html(url)
period <- page %>%
html_elements(".tombstone-container .period-name") %>% html_text2()
short_desc <- page %>%
html_elements(".tombstone-container .short-desc") %>% html_text2()
temp <- page %>%
html_elements(".tombstone-container .temp") %>%
html_text2()
forecast <- data.frame(
period = period,
short_desc = short_desc,
temp = temp
)
forecast
## period short_desc temp
## 1 Today Mostly Sunny High: 72 °F
## 2 Tonight Mostly Clear Low: 45 °F
## 3 Saturday Increasing\nClouds High: 75 °F
## 4 Saturday Night Partly Cloudy Low: 44 °F
## 5 Sunday Sunny High: 64 °F
## 6 Sunday Night Mostly Clear Low: 39 °F
## 7 Monday Sunny High: 68 °F
## 8 Monday Night Partly Cloudy\nthen Chance\nShowers and\nBreezy Low: 49 °F
## 9 Tuesday Mostly Sunny\nand Breezy High: 77 °F
We can combine webscraping with stuff we used to work with strings
earlier this semester. The goal of this next example is to parse through
a Reddit thread and perform a sort of ‘sentiment analysis’, which is
using the frequency of certain words to gauge their popularity. The
following code uses a slightly different approach than what we did with
stringr functions.
# install.packages("tidytext")
library(tidytext)
url <- "https://www.reddit.com/r/DnD/comments/1cipr1c/what_areis_your_favorite_dd_classesclass/"
page <- read_html(url)
# we will select 'paragraph' elements which are done
# with the "p" element type
text <- page %>%
html_elements("p") %>%
html_text2()
text_df <- data.frame(text = text)
text_df
## [1] text
## <0 rows> (or 0-length row.names)
This code should work, but does not (produces empty data frame) because of the particular way Reddit displays info purposefully making it hard to parse. We can instead use old.Reddit which displays info in a better way.
url <- "https://old.reddit.com/r/DnD/comments/1cipr1c/what_areis_your_favorite_dd_classesclass/"
page <- read_html(url)
# we will select 'paragraph' elements which are done
# with the "p" element type
text <- page %>%
html_elements("p") %>%
html_text2()
text_df <- data.frame(text = text)
text_df[6:15,]
## [1] "Wizards of the Coast, Dungeons & Dragons, and their logos are trademarks of Wizards of the Coast LLC in the United States and other countries. © 2025 Wizards. All Rights Reserved."
## [2] "This subreddit is not affiliated with, endorsed, sponsored, or specifically approved by Wizards of the Coast LLC. This subreddit may use the trademarks and other intellectual property of Wizards of the Coast LLC, which is permitted under Wizards' Fan Site Policy. For example, Dungeons & Dragons® is a trademark of Wizards of the Coast. For more information about Wizards of the Coast or any of Wizards' trademarks or other intellectual property, please visit their website at www.wizards.com."
## [3] "the front page of the internet."
## [4] "and join one of thousands of communities."
## [5] ""
## [6] "5th EditionWhat are/is your favorite D&D classes/class (self.DnD)"
## [7] "submitted 2 years ago by spino02_"
## [8] "Mine for example are druid and monk If there are enough comments I will make a table with percentages for each class"
## [9] "Post a comment!"
## [10] ""
This scrapes literally all text content on the page, not just comments, but still may prove useful. The following counts occurence of words, which is not exactly helpful.
word_counts <- text_df %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE)
head(word_counts)
## word n
## 1 0 276
## 2 2 218
## 3 years 203
## 4 ago 200
## 5 points 197
## 6 the 182
Question 3: Why is this particular count not that useful? Why do these ‘words’ keep showing up?
Question 4: The following code cleans this up a bit, but is not perfect. Spend a minute looking online for what constitues various DnD classes, then take a glance through the dataset and try to find which dnd classes are the most popular according to word occurrence in this thread.
word_counts_clean <- text_df |>
unnest_tokens(word, text) |>
anti_join(stop_words, by = "word") |>
count(word, sort = TRUE)
head(word_counts_clean)
## word n
## 1 0 276
## 2 2 218
## 3 ago 200
## 4 children 180
## 5 point2 125
## 6 points1 115
Question 5: This method is not perfect. Explain an issue in relying on just the occurrence of a class in the body of comments here. Think about frequency in a particular comment vs. frequency across many comments.
Question 6: Do you think social media opinions are representative of all DnD players? Explain.