Webscraping General Info

Webscraping refers to the act of getting data from websites. We have previously accessed info using read. functions in R but these have specifically been used to access a data file hosted on a website (example shown below), but this is not usually considered webscraping.

# example only, do not read into R
pres_approval_data = read.csv("https://nfriedrichsen.github.io/data/pres_approval_data.csv")

In many cases, websites contain useful data in the form of tables, lists, articles, or links that would take a long time to copy by hand. Webscraping is the use of tools (packages, functions) that allow us to read the HTML structure of a webpage and extract the pieces of information we want in a more organized way.

Manual vs. Automation

There are different levels of web scraping. Manual web scraping usually means collecting information from a single webpage or a small number of pages by directly running code yourself. This is useful for quick data collection and learning how websites are structured.

Automated web scraping is more advanced and involves writing programs that repeatedly scrape many pages on a schedule or loop through large websites automatically.

Note: This lab will focus only on manual web scraping.

robots.txt

Many websites provide a file called robots.txt that gives instructions to automated tools about which parts of the site should or should not be accessed. While robots.txt is not legally binding in most cases, it is considered good practice to check and follow it. For example, a website may allow scraping of public blog posts but block access to login pages or search features. You can often view this file by adding /robots.txt to the end of a website’s URL. Here is a link to Reddit’s robots.txt.

Other Webscraping Considerations

Some websites do not scrape cleanly because modern webpages are often built dynamically using JavaScript. Packages like rvest primarily read the raw HTML sent by the server, but some content is only loaded later by your browser after the page is opened. In those cases, the data may not appear in the HTML that rvest sees. Websites may also intentionally make scraping difficult through login systems, CAPTCHAs, or changing page structures.

R Packages

We will be using the rvest package (pronounced like ‘harvest’ without the ‘h’ sound, clever). This package is one of many that can be used for webscraping. It has functions that will allow us to extract tables, text, or links from webpages and store it in R.

# install.packages("rvest")
library(rvest)
library(dplyr)

Some common functions in rvest include read_html(), which reads a webpage into R, html_elements(), which selects parts of the page, and html_text2(), which extracts readable text from those elements. The function html_table() is especially useful because it can automatically convert HTML tables on a webpage into R data frames. Other functions such as html_attr() can extract things like URLs from links or image sources from pictures.

Wikipedia Example

We are going to start with a very basic example using a nicely formatted Wikipedia page, specifically for the largest cities in the world. A link to the webpage is here so take a look at it. On this page we have a helpful data table that can be used to navigate the information. We want to get this in R.

To start we will save the URL in R and use the read_html function to parse through it.

url <- "https://en.wikipedia.org/wiki/List_of_largest_cities"
page <- read_html(url)
page

## {html_document}
## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-menu-disabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available skin-theme-clientpref-thumb-standard" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...

Inspecting the page object shows it is stored as a list (similar to working with API stuff). To visually see what is stored in this object, go back to the actual wikipedia page and right-click then Inspect. This will pull up the HTML code that our page object is storing. Hovering over elements in this HTML browser will help us find out where the table is actually stored in the code, if we cared to do this.

Accessing the Data

Conveniently, we can do this to get the data tables from the webpage.

# this accesses all of the data tables
tables <- page %>% 
  html_table()
# there are 5 tables on this page
length(tables)

## [1] 5

# inspect the tables
tables

Not all of these tables are going to be helpful, as some are just for webpage formatting. Table 2 is the one we want.

cities <- tables[[2]]
glimpse(cities)

## Rows: 84
## Columns: 13
## $ `City[a]`                          <chr> "City[a]", "Jakarta", "Dhaka", "Tok…
## $ Country                            <chr> "Country", "Indonesia", "Bangladesh…
## $ `UN 2025 population estimates[11]` <chr> "UN 2025 population estimates[11]",…
## $ `City proper[b]`                   <chr> "Definition", "Special region", "Ca…
## $ `City proper[b]`                   <chr> "Population", "10,154,134", "10,295…
## $ `City proper[b]`                   <chr> "Area.mw-parser-output .nobold{font…
## $ `City proper[b]`                   <chr> "Density(/km2)", "15,292[13]", "30,…
## $ `Urban area[12]`                   <chr> "Population", "33,756,000", "19,134…
## $ `Urban area[12]`                   <chr> "Area(km2)", "3,546", "619", "8,231…
## $ `Urban area[12]`                   <chr> "Density(/km2)", "9,519[d]", "30,91…
## $ `Metropolitan area[c]`             <chr> "Population", "33,430,285", "36,585…
## $ `Metropolitan area[c]`             <chr> "Area(km2)", "7,063", "2,570", "13,…
## $ `Metropolitan area[c]`             <chr> "Density(/km2)", "4,733[14]", "14,2…

Exercise 1: Reading this data table into R has caused some issues. Header names are borked. Clean the data table, giving proper names to columns and make sure we do not have a duplicate header as our first row in the dataset.

Exercise 2: Make some graphic using ggplot2 to say something interesting with this data.

Some Other Stuff

We can use rvest to extract other things from the webpage if we really cared to. For example, the following code shows various clickable links that are on this webpage, and pathways to access those URLs.

links <- page %>% 
  html_elements("a") %>%  
  html_text2()

head(links, 5)

## [1] "Jump to content" "Main page"       "Contents"        "Current events" 
## [5] "Random article"

urls <- page %>%  
  html_elements("a") %>% 
  html_attr("href")

head(urls, 5)

## [1] "#bodyContent"                "/wiki/Main_Page"            
## [3] "/wiki/Wikipedia:Contents"    "/wiki/Portal:Current_events"
## [5] "/wiki/Special:Random"

National Weather Service Example

We can access weather related info about Grinnell (and any other city, really) from NWS. The website link is here. Take a look at the website and see what features are on this page.

The following code can be used to make a small data table of weather forecasts for us. Note: tombstone-container is an object name specific to the NWS website. Why they chose this? I have no clue. You can verify this by inspecting the websites HTML elements.

url <- "https://forecast.weather.gov/MapClick.php?lat=41.7433&lon=-92.7274"

page <- read_html(url)

period <- page %>% 
  html_elements(".tombstone-container .period-name") %>% html_text2()

short_desc <- page %>% 
  html_elements(".tombstone-container .short-desc") %>% html_text2()

temp <- page %>% 
  html_elements(".tombstone-container .temp") %>% 
  html_text2()

forecast <- data.frame(
  period = period,
  short_desc = short_desc,
  temp = temp
)

forecast

##           period                                      short_desc        temp
## 1          Today                                    Mostly Sunny High: 72 °F
## 2        Tonight                                    Mostly Clear  Low: 45 °F
## 3       Saturday                              Increasing\nClouds High: 75 °F
## 4 Saturday Night                                   Partly Cloudy  Low: 44 °F
## 5         Sunday                                           Sunny High: 64 °F
## 6   Sunday Night                                    Mostly Clear  Low: 39 °F
## 7         Monday                                           Sunny High: 68 °F
## 8   Monday Night Partly Cloudy\nthen Chance\nShowers and\nBreezy  Low: 49 °F
## 9        Tuesday                        Mostly Sunny\nand Breezy High: 77 °F

DND Class Popularity

We can combine webscraping with stuff we used to work with strings earlier this semester. The goal of this next example is to parse through a Reddit thread and perform a sort of ‘sentiment analysis’, which is using the frequency of certain words to gauge their popularity. The following code uses a slightly different approach than what we did with stringr functions.

# install.packages("tidytext")
library(tidytext)

url <- "https://www.reddit.com/r/DnD/comments/1cipr1c/what_areis_your_favorite_dd_classesclass/"

page <- read_html(url)

# we will select 'paragraph' elements which are done
# with the "p" element type
text <- page %>%
  html_elements("p") %>%
  html_text2()

text_df <- data.frame(text = text)
text_df

## [1] text
## <0 rows> (or 0-length row.names)

This code should work, but does not (produces empty data frame) because of the particular way Reddit displays info purposefully making it hard to parse. We can instead use old.Reddit which displays info in a better way.

url <- "https://old.reddit.com/r/DnD/comments/1cipr1c/what_areis_your_favorite_dd_classesclass/"

page <- read_html(url)

# we will select 'paragraph' elements which are done
# with the "p" element type
text <- page %>%
  html_elements("p") %>%
  html_text2()

text_df <- data.frame(text = text)
text_df[6:15,]

##  [1] "Wizards of the Coast, Dungeons & Dragons, and their logos are trademarks of Wizards of the Coast LLC in the United States and other countries. © 2025 Wizards. All Rights Reserved."                                                                                                                                                                                                                                                                                                                          
##  [2] "This subreddit is not affiliated with, endorsed, sponsored, or specifically approved by Wizards of the Coast LLC. This subreddit may use the trademarks and other intellectual property of Wizards of the Coast LLC, which is permitted under Wizards' Fan Site Policy. For example, Dungeons & Dragons® is a trademark of Wizards of the Coast. For more information about Wizards of the Coast or any of Wizards' trademarks or other intellectual property, please visit their website at www.wizards.com."
##  [3] "the front page of the internet."                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
##  [4] "and join one of thousands of communities."                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
##  [5] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
##  [6] "5th EditionWhat are/is your favorite D&D classes/class (self.DnD)"                                                                                                                                                                                                                                                                                                                                                                                                                                            
##  [7] "submitted 2 years ago by spino02_"                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
##  [8] "Mine for example are druid and monk If there are enough comments I will make a table with percentages for each class"                                                                                                                                                                                                                                                                                                                                                                                         
##  [9] "Post a comment!"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [10] ""

This scrapes literally all text content on the page, not just comments, but still may prove useful. The following counts occurence of words, which is not exactly helpful.

word_counts <- text_df %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE)

head(word_counts)

##     word   n
## 1      0 276
## 2      2 218
## 3  years 203
## 4    ago 200
## 5 points 197
## 6    the 182

Question 3: Why is this particular count not that useful? Why do these ‘words’ keep showing up?

Question 4: The following code cleans this up a bit, but is not perfect. Spend a minute looking online for what constitues various DnD classes, then take a glance through the dataset and try to find which dnd classes are the most popular according to word occurrence in this thread.

word_counts_clean <- text_df |>
  unnest_tokens(word, text) |>
  anti_join(stop_words, by = "word") |>
  count(word, sort = TRUE)

head(word_counts_clean)

##       word   n
## 1        0 276
## 2        2 218
## 3      ago 200
## 4 children 180
## 5   point2 125
## 6  points1 115

Question 5: This method is not perfect. Explain an issue in relying on just the occurrence of a class in the body of comments here. Think about frequency in a particular comment vs. frequency across many comments.

Question 6: Do you think social media opinions are representative of all DnD players? Explain.

Web-Scraping

Nathan Friedrichsen

Webscraping General Info

Manual vs. Automation

robots.txt

Other Webscraping Considerations

R Packages

Wikipedia Example

Accessing the Data

Some Other Stuff

National Weather Service Example

DND Class Popularity

Web-Scraping

Nathan Friedrichsen

Webscraping General Info

Manual vs. Automation

Ethical Issues Related to Webscraping

robots.txt

Other Webscraping Considerations

R Packages

Wikipedia Example

Accessing the Data

Some Other Stuff

National Weather Service Example

DND Class Popularity