Web Scraping with R: Data Collection Tutorial

Web Scraping with R: Data Collection Tutorial

R is the natural choice for web scraping when your end goal is data analysis, statistical modeling, or visualization. While Python has more scraping libraries, R’s tidyverse integration means scraped data flows directly into dplyr, ggplot2, and modeling functions without format conversion. The rvest package makes HTML parsing as simple as reading a CSV file.

This tutorial covers rvest for HTML parsing, httr2 for HTTP requests, RSelenium for JavaScript-rendered pages, and best practices for turning web data into tidy data frames.

Table of Contents

Why R for Web Scraping

  • Direct data frame output — Scraped data becomes tibbles instantly
  • Tidyverse integration — Pipe scraped data directly into dplyr, tidyr, stringr
  • Statistical analysis — Apply models immediately after collection
  • Visualization — Plot scraped data with ggplot2 in the same script
  • Reproducibility — R Markdown documents combine scraping, analysis, and reporting

Setting Up

# Install packages
install.packages(c("rvest", "httr2", "dplyr", "purrr", "stringr", "jsonlite"))

# For JavaScript rendering
install.packages("RSelenium")

# Load libraries
library(rvest)
library(httr2)
library(dplyr)
library(purrr)
library(stringr)

rvest: HTML Scraping

rvest is the primary R scraping package, designed to work like Python’s BeautifulSoup but with tidyverse integration:

Basic Scraping

library(rvest)

# Read the page
page <- read_html("https://books.toscrape.com/")

# Extract book data using CSS selectors
books <- page %>%
  html_elements("article.product_pod") %>%
  map_dfr(function(book) {
    tibble(
      title  = book %>% html_element("h3 a") %>% html_attr("title"),
      price  = book %>% html_element(".price_color") %>% html_text2(),
      rating = book %>% html_element("p.star-rating") %>% html_attr("class") %>%
               str_remove("star-rating ")
    )
  })

print(books)

CSS Selectors

# By class
page %>% html_elements(".product_pod")

# By ID
page %>% html_element("#main-content")

# By attribute
page %>% html_elements("a[href]")

# Hierarchical
page %>% html_elements("div.sidebar > ul > li > a")

# Pseudo selectors
page %>% html_elements("tr:nth-child(odd)")

# Combining
page %>% html_elements("article.product_pod .price_color")

Extracting Data

# Text content
page %>% html_element("h1") %>% html_text2()

# Attribute values
page %>% html_element("a") %>% html_attr("href")

# All text from multiple elements
page %>% html_elements(".price_color") %>% html_text2()

# Table extraction (automatic)
page %>% html_element("table") %>% html_table()

# Inner HTML
page %>% html_element(".content") %>% html_children() %>% as.character()

Simplified Extraction

library(rvest)
library(dplyr)

page <- read_html("https://books.toscrape.com/")

# One-line extraction
titles <- page %>% html_elements("h3 a") %>% html_attr("title")
prices <- page %>% html_elements(".price_color") %>% html_text2()

# Create data frame
books <- tibble(title = titles, price = prices)
print(books)

httr2: HTTP Requests

For more control over HTTP requests:

library(httr2)

# Basic request
response <- request("https://books.toscrape.com/") %>%
  req_headers(
    `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
    Accept = "text/html"
  ) %>%
  req_timeout(30) %>%
  req_retry(max_tries = 3) %>%
  req_perform()

html <- resp_body_string(response)
page <- read_html(html)

JSON API Scraping

library(httr2)
library(jsonlite)

response <- request("https://api.example.com/products") %>%
  req_headers(Accept = "application/json") %>%
  req_url_query(page = 1, limit = 50) %>%
  req_perform()

data <- resp_body_json(response)

# Convert to data frame
products <- data$results %>%
  map_dfr(~ tibble(
    name  = .x$name,
    price = .x$price,
    category = .x$category
  ))

Session Management

library(rvest)

# Create a session (maintains cookies)
session <- session("https://example.com/login")

# Submit login form
form <- session %>%
  html_form() %>%
  .[[1]] %>%
  html_form_set(username = "user", password = "pass")

session <- session_submit(session, form)

# Navigate to protected pages
session <- session_jump_to(session, "https://example.com/dashboard")
dashboard_data <- session %>%
  read_html() %>%
  html_elements(".data-row") %>%
  html_text2()

Scraping Tables

R excels at scraping HTML tables:

library(rvest)
library(dplyr)

# Scrape a Wikipedia table
page <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)")

# Extract all tables
tables <- page %>% html_elements("table.wikitable") %>% html_table()

# Get the first table
gdp_table <- tables[[1]]
print(gdp_table)

# Clean the data
gdp_clean <- gdp_table %>%
  rename_with(~ str_replace_all(.x, "\\[.*\\]", "")) %>%
  mutate(across(where(is.character), str_trim))

Handling Pagination

library(rvest)
library(purrr)
library(dplyr)

scrape_page <- function(page_num) {
  url <- sprintf("https://books.toscrape.com/catalogue/page-%d.html", page_num)

  tryCatch({
    page <- read_html(url)

    books <- page %>%
      html_elements("article.product_pod") %>%
      map_dfr(function(book) {
        tibble(
          title  = book %>% html_element("h3 a") %>% html_attr("title"),
          price  = book %>% html_element(".price_color") %>% html_text2(),
          page   = page_num
        )
      })

    Sys.sleep(runif(1, 1, 2)) # Polite delay
    return(books)
  }, error = function(e) {
    message(sprintf("Error on page %d: %s", page_num, e$message))
    return(tibble())
  })
}

# Scrape all 50 pages
all_books <- map_dfr(1:50, scrape_page, .progress = TRUE)

cat(sprintf("Total: %d books from %d pages\n",
            nrow(all_books), max(all_books$page)))

RSelenium: JavaScript Pages

For JavaScript-rendered content:

library(RSelenium)
library(rvest)

# Start Selenium server
rD <- rsDriver(browser = "chrome",
               chromever = "latest",
               extraCapabilities = list(
                 chromeOptions = list(args = list("--headless", "--no-sandbox"))
               ))

remDr <- rD$client

# Navigate
remDr$navigate("https://books.toscrape.com/")

# Wait for content
Sys.sleep(3)

# Get page source and parse with rvest
page_source <- remDr$getPageSource()[[1]]
page <- read_html(page_source)

books <- page %>%
  html_elements("article.product_pod h3 a") %>%
  html_attr("title")

print(books)

# Interact with elements
search_box <- remDr$findElement(using = "css", "input[name='q']")
search_box$sendKeysToElement(list("web scraping", key = "enter"))

Sys.sleep(2)

# Click elements
next_button <- remDr$findElement(using = "css", "li.next a")
next_button$clickElement()

# Clean up
remDr$close()
rD$server$stop()

Data Cleaning with tidyverse

library(dplyr)
library(stringr)
library(tidyr)

# Clean scraped book data
clean_books <- all_books %>%
  mutate(
    # Extract numeric price
    price_num = str_extract(price, "[\\d.]+") %>% as.numeric(),

    # Convert rating words to numbers
    rating_num = case_when(
      str_detect(rating, "One")   ~ 1,
      str_detect(rating, "Two")   ~ 2,
      str_detect(rating, "Three") ~ 3,
      str_detect(rating, "Four")  ~ 4,
      str_detect(rating, "Five")  ~ 5,
      TRUE ~ NA_real_
    ),

    # Clean title
    title_clean = str_trim(title)
  ) %>%
  filter(!is.na(price_num)) %>%
  arrange(desc(price_num))

# Summary statistics
clean_books %>%
  summarise(
    total_books = n(),
    avg_price   = mean(price_num),
    median_price = median(price_num),
    avg_rating  = mean(rating_num, na.rm = TRUE)
  )

# Visualize
library(ggplot2)

ggplot(clean_books, aes(x = price_num)) +
  geom_histogram(bins = 30, fill = "steelblue") +
  labs(title = "Book Price Distribution",
       x = "Price (GBP)", y = "Count") +
  theme_minimal()

Proxy Integration

With httr2

library(httr2)

response <- request("https://httpbin.org/ip") %>%
  req_proxy("http://proxy.example.com", 8080,
            username = "user", password = "pass") %>%
  req_perform()

print(resp_body_json(response))

With cURL Options

library(rvest)

page <- read_html(
  httr::GET("https://httpbin.org/ip",
            httr::use_proxy("http://proxy.example.com", 8080,
                            username = "user", password = "pass"))
)

For proxy types and selection, see our web scraping proxy guide and proxy glossary.

Polite Scraping

The polite package enforces rate limits and respects robots.txt:

install.packages("polite")
library(polite)

# Create a polite session
session <- bow("https://books.toscrape.com/",
               user_agent = "R research bot (contact@example.com)")

# Check if scraping is allowed
print(session)

# Scrape politely (auto rate-limited)
page <- scrape(session)

books <- page %>%
  html_elements("article.product_pod h3 a") %>%
  html_attr("title")

# Navigate to another page politely
session2 <- nod(session, "catalogue/page-2.html")
page2 <- scrape(session2)

FAQ

Is R good for web scraping?

R is excellent for web scraping when your goal is data analysis. The rvest package makes HTML parsing straightforward, and the tidyverse integration means scraped data flows directly into analysis pipelines. For complex crawling projects, Python’s Scrapy has more features, but R handles most scraping tasks well.

What is the best R package for web scraping?

rvest is the standard choice for HTML scraping. Pair it with httr2 for advanced HTTP control. For JavaScript-rendered sites, use RSelenium. The polite package adds automatic rate limiting and robots.txt compliance.

Can R scrape JavaScript-rendered pages?

Yes, using RSelenium which controls a real browser. Alternatively, check the Network tab for API endpoints that you can call directly with httr2, which is much faster than browser automation.

How does R compare to Python for web scraping?

Python has more scraping libraries and a larger community. R’s advantage is direct integration with statistical analysis and visualization. Use R when scraping is a step in an analysis pipeline; use Python for dedicated scraping projects.

How do I avoid getting blocked when scraping with R?

Use the polite package to respect robots.txt and rate limits. Add realistic User-Agent headers, introduce random delays with Sys.sleep(), and use rotating proxies for large-scale scraping.


Explore web scraping in other languages: Python, Java, Node.js. For proxy setup, see our web scraping proxy guide.

External Resources:


Related Reading

Scroll to Top