Web Scraping with R: Data Collection Tutorial
R is the natural choice for web scraping when your end goal is data analysis, statistical modeling, or visualization. While Python has more scraping libraries, R’s tidyverse integration means scraped data flows directly into dplyr, ggplot2, and modeling functions without format conversion. The rvest package makes HTML parsing as simple as reading a CSV file.
This tutorial covers rvest for HTML parsing, httr2 for HTTP requests, RSelenium for JavaScript-rendered pages, and best practices for turning web data into tidy data frames.
Table of Contents
- Why R for Web Scraping
- Setting Up
- rvest: HTML Scraping
- httr2: HTTP Requests
- Scraping Tables
- Handling Pagination
- RSelenium: JavaScript Pages
- Data Cleaning with tidyverse
- Proxy Integration
- Polite Scraping
- FAQ
Why R for Web Scraping
- Direct data frame output — Scraped data becomes tibbles instantly
- Tidyverse integration — Pipe scraped data directly into dplyr, tidyr, stringr
- Statistical analysis — Apply models immediately after collection
- Visualization — Plot scraped data with ggplot2 in the same script
- Reproducibility — R Markdown documents combine scraping, analysis, and reporting
Setting Up
# Install packages
install.packages(c("rvest", "httr2", "dplyr", "purrr", "stringr", "jsonlite"))
# For JavaScript rendering
install.packages("RSelenium")
# Load libraries
library(rvest)
library(httr2)
library(dplyr)
library(purrr)
library(stringr)rvest: HTML Scraping
rvest is the primary R scraping package, designed to work like Python’s BeautifulSoup but with tidyverse integration:
Basic Scraping
library(rvest)
# Read the page
page <- read_html("https://books.toscrape.com/")
# Extract book data using CSS selectors
books <- page %>%
html_elements("article.product_pod") %>%
map_dfr(function(book) {
tibble(
title = book %>% html_element("h3 a") %>% html_attr("title"),
price = book %>% html_element(".price_color") %>% html_text2(),
rating = book %>% html_element("p.star-rating") %>% html_attr("class") %>%
str_remove("star-rating ")
)
})
print(books)CSS Selectors
# By class
page %>% html_elements(".product_pod")
# By ID
page %>% html_element("#main-content")
# By attribute
page %>% html_elements("a[href]")
# Hierarchical
page %>% html_elements("div.sidebar > ul > li > a")
# Pseudo selectors
page %>% html_elements("tr:nth-child(odd)")
# Combining
page %>% html_elements("article.product_pod .price_color")Extracting Data
# Text content
page %>% html_element("h1") %>% html_text2()
# Attribute values
page %>% html_element("a") %>% html_attr("href")
# All text from multiple elements
page %>% html_elements(".price_color") %>% html_text2()
# Table extraction (automatic)
page %>% html_element("table") %>% html_table()
# Inner HTML
page %>% html_element(".content") %>% html_children() %>% as.character()Simplified Extraction
library(rvest)
library(dplyr)
page <- read_html("https://books.toscrape.com/")
# One-line extraction
titles <- page %>% html_elements("h3 a") %>% html_attr("title")
prices <- page %>% html_elements(".price_color") %>% html_text2()
# Create data frame
books <- tibble(title = titles, price = prices)
print(books)httr2: HTTP Requests
For more control over HTTP requests:
library(httr2)
# Basic request
response <- request("https://books.toscrape.com/") %>%
req_headers(
`User-Agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
Accept = "text/html"
) %>%
req_timeout(30) %>%
req_retry(max_tries = 3) %>%
req_perform()
html <- resp_body_string(response)
page <- read_html(html)JSON API Scraping
library(httr2)
library(jsonlite)
response <- request("https://api.example.com/products") %>%
req_headers(Accept = "application/json") %>%
req_url_query(page = 1, limit = 50) %>%
req_perform()
data <- resp_body_json(response)
# Convert to data frame
products <- data$results %>%
map_dfr(~ tibble(
name = .x$name,
price = .x$price,
category = .x$category
))Session Management
library(rvest)
# Create a session (maintains cookies)
session <- session("https://example.com/login")
# Submit login form
form <- session %>%
html_form() %>%
.[[1]] %>%
html_form_set(username = "user", password = "pass")
session <- session_submit(session, form)
# Navigate to protected pages
session <- session_jump_to(session, "https://example.com/dashboard")
dashboard_data <- session %>%
read_html() %>%
html_elements(".data-row") %>%
html_text2()Scraping Tables
R excels at scraping HTML tables:
library(rvest)
library(dplyr)
# Scrape a Wikipedia table
page <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)")
# Extract all tables
tables <- page %>% html_elements("table.wikitable") %>% html_table()
# Get the first table
gdp_table <- tables[[1]]
print(gdp_table)
# Clean the data
gdp_clean <- gdp_table %>%
rename_with(~ str_replace_all(.x, "\\[.*\\]", "")) %>%
mutate(across(where(is.character), str_trim))Handling Pagination
library(rvest)
library(purrr)
library(dplyr)
scrape_page <- function(page_num) {
url <- sprintf("https://books.toscrape.com/catalogue/page-%d.html", page_num)
tryCatch({
page <- read_html(url)
books <- page %>%
html_elements("article.product_pod") %>%
map_dfr(function(book) {
tibble(
title = book %>% html_element("h3 a") %>% html_attr("title"),
price = book %>% html_element(".price_color") %>% html_text2(),
page = page_num
)
})
Sys.sleep(runif(1, 1, 2)) # Polite delay
return(books)
}, error = function(e) {
message(sprintf("Error on page %d: %s", page_num, e$message))
return(tibble())
})
}
# Scrape all 50 pages
all_books <- map_dfr(1:50, scrape_page, .progress = TRUE)
cat(sprintf("Total: %d books from %d pages\n",
nrow(all_books), max(all_books$page)))RSelenium: JavaScript Pages
For JavaScript-rendered content:
library(RSelenium)
library(rvest)
# Start Selenium server
rD <- rsDriver(browser = "chrome",
chromever = "latest",
extraCapabilities = list(
chromeOptions = list(args = list("--headless", "--no-sandbox"))
))
remDr <- rD$client
# Navigate
remDr$navigate("https://books.toscrape.com/")
# Wait for content
Sys.sleep(3)
# Get page source and parse with rvest
page_source <- remDr$getPageSource()[[1]]
page <- read_html(page_source)
books <- page %>%
html_elements("article.product_pod h3 a") %>%
html_attr("title")
print(books)
# Interact with elements
search_box <- remDr$findElement(using = "css", "input[name='q']")
search_box$sendKeysToElement(list("web scraping", key = "enter"))
Sys.sleep(2)
# Click elements
next_button <- remDr$findElement(using = "css", "li.next a")
next_button$clickElement()
# Clean up
remDr$close()
rD$server$stop()Data Cleaning with tidyverse
library(dplyr)
library(stringr)
library(tidyr)
# Clean scraped book data
clean_books <- all_books %>%
mutate(
# Extract numeric price
price_num = str_extract(price, "[\\d.]+") %>% as.numeric(),
# Convert rating words to numbers
rating_num = case_when(
str_detect(rating, "One") ~ 1,
str_detect(rating, "Two") ~ 2,
str_detect(rating, "Three") ~ 3,
str_detect(rating, "Four") ~ 4,
str_detect(rating, "Five") ~ 5,
TRUE ~ NA_real_
),
# Clean title
title_clean = str_trim(title)
) %>%
filter(!is.na(price_num)) %>%
arrange(desc(price_num))
# Summary statistics
clean_books %>%
summarise(
total_books = n(),
avg_price = mean(price_num),
median_price = median(price_num),
avg_rating = mean(rating_num, na.rm = TRUE)
)
# Visualize
library(ggplot2)
ggplot(clean_books, aes(x = price_num)) +
geom_histogram(bins = 30, fill = "steelblue") +
labs(title = "Book Price Distribution",
x = "Price (GBP)", y = "Count") +
theme_minimal()Proxy Integration
With httr2
library(httr2)
response <- request("https://httpbin.org/ip") %>%
req_proxy("http://proxy.example.com", 8080,
username = "user", password = "pass") %>%
req_perform()
print(resp_body_json(response))With cURL Options
library(rvest)
page <- read_html(
httr::GET("https://httpbin.org/ip",
httr::use_proxy("http://proxy.example.com", 8080,
username = "user", password = "pass"))
)For proxy types and selection, see our web scraping proxy guide and proxy glossary.
Polite Scraping
The polite package enforces rate limits and respects robots.txt:
install.packages("polite")
library(polite)
# Create a polite session
session <- bow("https://books.toscrape.com/",
user_agent = "R research bot (contact@example.com)")
# Check if scraping is allowed
print(session)
# Scrape politely (auto rate-limited)
page <- scrape(session)
books <- page %>%
html_elements("article.product_pod h3 a") %>%
html_attr("title")
# Navigate to another page politely
session2 <- nod(session, "catalogue/page-2.html")
page2 <- scrape(session2)FAQ
Is R good for web scraping?
R is excellent for web scraping when your goal is data analysis. The rvest package makes HTML parsing straightforward, and the tidyverse integration means scraped data flows directly into analysis pipelines. For complex crawling projects, Python’s Scrapy has more features, but R handles most scraping tasks well.
What is the best R package for web scraping?
rvest is the standard choice for HTML scraping. Pair it with httr2 for advanced HTTP control. For JavaScript-rendered sites, use RSelenium. The polite package adds automatic rate limiting and robots.txt compliance.
Can R scrape JavaScript-rendered pages?
Yes, using RSelenium which controls a real browser. Alternatively, check the Network tab for API endpoints that you can call directly with httr2, which is much faster than browser automation.
How does R compare to Python for web scraping?
Python has more scraping libraries and a larger community. R’s advantage is direct integration with statistical analysis and visualization. Use R when scraping is a step in an analysis pipeline; use Python for dedicated scraping projects.
How do I avoid getting blocked when scraping with R?
Use the polite package to respect robots.txt and rate limits. Add realistic User-Agent headers, introduce random delays with Sys.sleep(), and use rotating proxies for large-scale scraping.
Explore web scraping in other languages: Python, Java, Node.js. For proxy setup, see our web scraping proxy guide.
External Resources:
- rvest Documentation
- httr2 Documentation
- polite Package
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
Related Reading
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company