Web Scraping – Matt Babb

Goal: Scrape information from https://www.cheese.com to obtain a dataset of characteristics about different cheeses, and gain deeper insight into your coding process. 🪤

Part 1: Locate and examine the robots.txt file for this website. Summarize what you learn from it.

The robots.txt file only had two lines. It read: User-agent: * Sitemap: https://www.cheese.com/sitemap.xml. This means that is has no restrictions. Presumably this is because it is a relatively low-traffic website.

Part 2: Learn about the html_attr() function from rvest. Describe how this function works with a small example.

This function gets the attributes associated with some HTML elements.

Code

library(tidyverse)
library(rvest)

read_html("https://www.cheese.com/alphabetical/?per_page=100") |>
  html_nodes(".product-img a img") |>
  html_attr("class")

  [1] "image-exists"  "image-missing" "image-exists"  "image-exists" 
  [5] "image-exists"  "image-exists"  "image-missing" "image-exists" 
  [9] "image-exists"  "image-exists"  "image-exists"  "image-missing"
 [13] "image-missing" "image-exists"  "image-missing" "image-missing"
 [17] "image-exists"  "image-missing" "image-exists"  "image-missing"
 [21] "image-missing" "image-missing" "image-missing" "image-exists" 
 [25] "image-exists"  "image-exists"  "image-missing" "image-missing"
 [29] "image-missing" "image-missing" "image-missing" "image-missing"
 [33] "image-exists"  "image-missing" "image-missing" "image-exists" 
 [37] "image-missing" "image-missing" "image-missing" "image-exists" 
 [41] "image-exists"  "image-exists"  "image-exists"  "image-exists" 
 [45] "image-exists"  "image-exists"  "image-exists"  "image-missing"
 [49] "image-missing" "image-missing" "image-missing" "image-missing"
 [53] "image-missing" "image-missing" "image-missing" "image-exists" 
 [57] "image-missing" "image-missing" "image-exists"  "image-missing"
 [61] "image-missing" "image-missing" "image-missing" "image-missing"
 [65] "image-exists"  "image-missing" "image-exists"  "image-exists" 
 [69] "image-missing" "image-missing" "image-missing" "image-exists" 
 [73] "image-missing" "image-missing" "image-missing" "image-exists" 
 [77] "image-exists"  "image-exists"  "image-missing" "image-missing"
 [81] "image-exists"  "image-exists"  "image-missing" "image-missing"
 [85] "image-exists"  "image-exists"  "image-missing" "image-missing"
 [89] "image-exists"  "image-missing" "image-missing" "image-exists" 
 [93] "image-exists"  "image-missing" "image-missing" "image-missing"
 [97] "image-exists"  "image-missing" "image-missing" "image-missing"

This small example demonstrates how to get whether the image exists for each cheese.

Part 3: (Do this alongside Part 4 below.) I used ChatGPT to start the process of scraping cheese information with the following prompt:

Write R code using the rvest package that allows me to scrape cheese information from cheese.com.

Fully document your process of checking this code. Record any observations you make about where ChatGPT is useful / not useful.

Code

# Load required libraries
library(rvest)
library(dplyr)

# Define the URL
url <- "https://www.cheese.com/alphabetical"

# Read the HTML content from the webpage
webpage <- read_html(url)

# Extract the cheese names and URLs
cheese_data <- webpage %>%
  html_nodes(".cheese-item") %>%
  html_nodes("a") %>%
  html_attr("href") %>%
  paste0("https://cheese.com", .)

cheese_names <- webpage %>%
  html_nodes(".cheese-item h3") %>%
  html_text()

# Create a data frame to store the results
cheese_df <- data.frame(
  Name = cheese_names,
  URL = cheese_data,
  stringsAsFactors = FALSE
)

# Print the data frame
print(cheese_df)

The structure is nominally useful though the LLM appears to be guessing and can’t access the page structure.

The code is easy to read and is transparent for a reader to understand, but it isn’t very efficient. The verboseness is communicative, though ultimately unnecessary.

Part 4: Obtain the following information for all cheeses in the database:

cheese name
URL for the cheese’s webpage (e.g., https://www.cheese.com/gouda/)
whether or not the cheese has a picture (e.g., gouda has a picture, but bianco does not).

To be kind to the website owners, please add a 1 second pause between page queries. (Note that you can view 100 cheeses at a time.)

Code

parse_page <- function(url, delay = 1) {
  Sys.sleep(delay)

  # cheese product listings
  cheeses <- read_html(url) |>
    html_nodes(".product-item")

  # temp object that's used twice
  nodes <- cheeses |>
    html_nodes("h3 a")

  # building tibble
  tibble(
    name = nodes |> html_text(),
    url = nodes |>
      html_attr("href") |>
      (\(slug) glue::glue("https://www.cheese.com{slug}"))(),
    has_image = cheeses |>
      html_nodes(".product-img a img") |>
      html_attr("class") |>
      str_detect("image-exists")
  )
}

Code

# can't run this on last page--will find number of pages to parse automatically.
num_pages <- "https://www.cheese.com/alphabetical/?per_page=100" |>
  read_html() |>
  html_nodes(".pagination a") |>
  html_text() |>
  last() |>
  as.numeric()

cheeses <- paste0(
  "https://www.cheese.com/alphabetical/?per_page=100&page=",
  1:num_pages
) |>
  map_df(parse_page)

Part 5: When you go to a particular cheese’s page (like gouda), you’ll see more detailed information about the cheese. For just 10 of the cheeses in the database, obtain the following detailed information:

milk information
country of origin
family
type
flavour

(Just 10 to avoid overtaxing the website! Continue adding a 1 second pause between page queries.)

Code

cheese_list <- c(
  "https://www.cheese.com/paneer/",
  "https://www.cheese.com/chevre-en-marinade/",
  "https://www.cheese.com/chevre-log/",
  "https://www.cheese.com/goat-gouda/",
  "https://www.cheese.com/gotcha-gouda/",
  "https://www.cheese.com/gouda/",
  "https://www.cheese.com/camembert-des-camarades/",
  "https://www.cheese.com/camembert-de-portneuf/",
  "https://www.cheese.com/camembert-de-normandie/",
  "https://www.cheese.com/camembert/"
)

Code

parse_cheese <- function(url, delay = 1) {
}

"https://www.cheese.com/camembert-de-normandie/" |>
  read_html() |>
  html_nodes(".panel-body ul li") |>
  html_text(trim = TRUE)
# sep wider into tibble with NAs and rowbind or use hashmap or something?

Code

parse_cheese <- function(url, delay = 1) {
  Sys.sleep(delay)
  
  # Read page
  page <- read_html(url)
  
  # Extract bullet point text
  raw_info <- page %>%
    html_nodes(".panel-body ul li") %>%
    html_text(trim = TRUE)
  
  # Extract milk line separately (doesn't contain ":")
  milk_line <- raw_info[grepl("^Made from", raw_info)]
  milk <- str_remove(milk_line, "^Made from ")

  # Extract remaining fields that use a colon
  info_pairs <- raw_info[grepl(":", raw_info)] %>%
    str_split_fixed(":", 2) %>%
    as_tibble() %>%
    rename(field = V1, value = V2) %>%
    mutate(
      field = str_trim(tolower(field)),
      value = str_trim(value)
    ) %>%
    filter(field %in% c("country of origin", "family", "type", "flavour"))
  
  # Convert to wide format
  cheese_info <- pivot_wider(info_pairs, names_from = field, values_from = value)
  
  # Add milk column
  cheese_info <- cheese_info %>%
    mutate(milk = milk) %>%
    relocate(milk)

  return(cheese_info)
}


cheese_details <- map_dfr(cheese_list, parse_cheese)

Code

# Nice kable table

library(knitr)
library(kableExtra)

cheese_details %>%
  rename(
    Milk    = milk,
    Country = `country of origin`,
    Family  = family,
    Type    = type,
    Flavour = flavour
  ) %>%
  kable(
    caption = "10 Cheeses",
    format = "html",
    escape = TRUE
  ) %>%
  kable_styling(
    full_width = FALSE,
    bootstrap_options = c("striped", "hover", "condensed"),
    position = "left"
  )

10 Cheeses
Milk	Country	Family	Type	Flavour
pasteurized cow's or water buffalo's milk	Bangladesh and India	Cottage	fresh firm	milky
pasteurized goat's milk	United States	NA	semi-soft, artisan	garlicky, herbaceous, spicy
pasteurized goat's milk	France	NA	fresh soft	citrusy
pasteurized goat's milk	Netherlands	Gouda	semi-hard, artisan	tangy
cow's milk	United States	Gouda	semi-hard, artisan	sweet, tangy
pasteurized or unpasteurized cow's, goat's or sheep's milk	Netherlands	Gouda	hard	full-flavored
pasteurized cow's milk	Canada	Camembert	soft, soft-ripened	buttery, nutty, subtle, sweet
pasteurized cow's milk	Canada	Camembert	soft, soft-ripened	buttery, creamy
unpasteurized cow's milk	France	Camembert	soft, soft-ripened	creamy
cow's milk	France	Camembert	soft, artisan	sweet

Part 6: Evaluate the code that you wrote in terms of efficiency. To what extent do your function(s) adhere to the principles for writing good functions? To what extent are your functions efficient? To what extent is your iteration of these functions efficient?

The code is efficient. There are few objects saved and pipelines are used extensively to further that goal. Furthermore, leveraging Quarto’s design, we can further bolster efficiency by selectively caching the output of a few cells and make a nice space/time deal–scraping the data takes time due to the number of pages that need to be scraped, so we elected to cache those outputs. Of course, our code still has to run once, but since the targets seem to be static, this allows us to spend some memory to speed up execution time. Since we are caching, we save two objects so as to obviate scraping the site again.