Web Scraping with R Complete Guide: rvest and Beyond

Web Scraping with R Complete Guide: rvest and Beyond

R is the language of choice for statisticians, data scientists, and researchers. while Python dominates web scraping discussions, R has a solid scraping ecosystem centered around the rvest package. if your end goal is data analysis in R, scraping directly into R data frames eliminates the data handoff between languages.

this guide covers web scraping with R from the basics to production patterns, with proxy integration and tidyverse workflow examples.

Why Scrape with R

R makes sense for web scraping when:

  • your analysis is in R: scraping directly into tibbles/data frames means no format conversion
  • you use the tidyverse: rvest integrates seamlessly with dplyr, tidyr, purrr, and ggplot2
  • you are a researcher: R’s statistical modeling and visualization are unmatched
  • you need reproducible research: R Markdown combines scraping, analysis, and reporting in one document
  • your team knows R: keeping everything in one language reduces complexity

when to use Python instead:

  • high-volume scraping: Python has more scalable frameworks like Scrapy
  • JavaScript-heavy sites: Python’s Playwright integration is more mature
  • anti-bot evasion: Python has more anti-detection libraries
  • production systems: Python scrapers are easier to deploy and schedule

R Scraping Libraries

PackagePurposeKey Feature
rvestHTML scrapingCSS selectors, tidyverse integration
httr2HTTP requestsmodern HTTP client with retry logic
httrHTTP requestsclassic HTTP client
xml2XML/HTML parsingunderlying parser for rvest
politeethical scrapingrespects robots.txt automatically
RSeleniumbrowser automationSelenium bindings for R
chromoteheadless ChromeChrome DevTools Protocol

Getting Started with rvest

Installation

# install core packages
install.packages("rvest")
install.packages("httr2")
install.packages("dplyr")
install.packages("purrr")
install.packages("tibble")

# or install the entire tidyverse
install.packages("tidyverse")

Basic Scraping

library(rvest)
library(dplyr)

# read a web page
page <- read_html("https://quotes.toscrape.com/")

# extract quotes using CSS selectors
quotes <- page %>%
  html_elements("div.quote") %>%
  map_df(function(quote) {
    tibble(
      text = quote %>% html_element("span.text") %>% html_text(),
      author = quote %>% html_element("small.author") %>% html_text(),
      tags = quote %>% html_elements("div.tags a.tag") %>% html_text() %>% paste(collapse = ", ")
    )
  })

# view the data
print(quotes)

CSS Selectors with rvest

library(rvest)

page <- read_html("https://example.com")

# basic selectors
titles <- page %>% html_elements("h2") %>% html_text()
links <- page %>% html_elements("a") %>% html_attr("href")
images <- page %>% html_elements("img") %>% html_attr("src")

# class selectors
products <- page %>% html_elements(".product-card")
prices <- page %>% html_elements("span.price") %>% html_text()

# attribute selectors
external_links <- page %>% html_elements("a[target='_blank']") %>% html_attr("href")
product_links <- page %>% html_elements("a[href*='product']") %>% html_attr("href")

# nested selectors
product_names <- page %>%
  html_elements("div.product-card h2") %>%
  html_text()

# nth-child
first_item <- page %>% html_element("ul li:first-child") %>% html_text()

Extracting Tables

rvest has a dedicated function for extracting HTML tables:

library(rvest)

page <- read_html("https://example.com/data-table")

# extract all tables as data frames
tables <- page %>% html_table()

# get the first table
df <- tables[[1]]
print(df)

# or target a specific table
pricing_table <- page %>%
  html_element("table.pricing") %>%
  html_table()

Adding Proxy Support

Using httr2 with Proxies

httr2 is the modern HTTP client for R with built-in proxy support:

library(httr2)
library(rvest)

# create a request with proxy
fetch_with_proxy <- function(url, proxy_url, proxy_user = NULL, proxy_pass = NULL) {
  req <- request(url) %>%
    req_proxy(proxy_url) %>%
    req_headers(
      "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
      "Accept" = "text/html,application/xhtml+xml",
      "Accept-Language" = "en-US,en;q=0.9"
    ) %>%
    req_timeout(30)

  if (!is.null(proxy_user) && !is.null(proxy_pass)) {
    req <- req %>% req_proxy(proxy_url, username = proxy_user, password = proxy_pass)
  }

  resp <- req %>% req_perform()
  resp %>% resp_body_html()
}

# usage
page <- fetch_with_proxy(
  "https://httpbin.org/ip",
  "http://proxy.example.com:8080",
  "username",
  "password"
)

print(page %>% html_text())

Using httr with Proxies (Classic)

library(httr)
library(rvest)

# set proxy
fetch_page <- function(url, proxy_host, proxy_port, proxy_user = NULL, proxy_pass = NULL) {
  if (!is.null(proxy_user)) {
    response <- GET(
      url,
      use_proxy(proxy_host, proxy_port, proxy_user, proxy_pass),
      user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"),
      timeout(30)
    )
  } else {
    response <- GET(
      url,
      use_proxy(proxy_host, proxy_port),
      user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"),
      timeout(30)
    )
  }

  if (status_code(response) == 200) {
    return(read_html(content(response, as = "text", encoding = "UTF-8")))
  } else {
    stop(paste("HTTP error:", status_code(response)))
  }
}

# usage
page <- fetch_page(
  "https://example.com",
  "proxy.example.com", 8080,
  "user", "pass"
)

Rotating Proxies

library(httr2)
library(rvest)

# proxy rotation class using R6
library(R6)

ProxyRotator <- R6Class("ProxyRotator",
  public = list(
    proxies = NULL,
    current_index = 0,

    initialize = function(proxies) {
      self$proxies <- proxies
    },

    get_next = function() {
      self$current_index <- (self$current_index %% length(self$proxies)) + 1
      self$proxies[[self$current_index]]
    },

    get_random = function() {
      self$proxies[[sample(length(self$proxies), 1)]]
    }
  )
)

# scraper with proxy rotation
scrape_with_rotation <- function(urls, proxies, delay = 2) {
  rotator <- ProxyRotator$new(proxies)
  results <- list()

  for (i in seq_along(urls)) {
    proxy <- rotator$get_next()
    cat(sprintf("[%d/%d] %s (proxy: %s)\n", i, length(urls), urls[i], proxy$url))

    tryCatch({
      page <- request(urls[i]) %>%
        req_proxy(proxy$url, username = proxy$user, password = proxy$pass) %>%
        req_headers("User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)") %>%
        req_timeout(30) %>%
        req_perform() %>%
        resp_body_html()

      results[[i]] <- list(
        url = urls[i],
        title = page %>% html_element("title") %>% html_text(),
        success = TRUE
      )
    }, error = function(e) {
      cat(sprintf("  error: %s\n", e$message))
      results[[i]] <<- list(url = urls[i], error = e$message, success = FALSE)
    })

    Sys.sleep(delay + runif(1, 0, 1))
  }

  bind_rows(results)
}

# usage
proxies <- list(
  list(url = "http://proxy1.example.com:8080", user = "user", pass = "pass"),
  list(url = "http://proxy2.example.com:8080", user = "user", pass = "pass"),
  list(url = "http://proxy3.example.com:8080", user = "user", pass = "pass")
)

urls <- paste0("https://example.com/page/", 1:10)
results <- scrape_with_rotation(urls, proxies, delay = 2)

Building a Complete Scraping Pipeline

here is a full pipeline that scrapes, cleans, and analyzes data:

library(rvest)
library(httr2)
library(dplyr)
library(purrr)
library(stringr)
library(readr)

# step 1: define the scraper
scrape_product_page <- function(url, proxy = NULL) {
  req <- request(url) %>%
    req_headers(
      "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
      "Accept" = "text/html"
    ) %>%
    req_timeout(30) %>%
    req_retry(max_tries = 3, backoff = ~ 2)

  if (!is.null(proxy)) {
    req <- req %>% req_proxy(proxy$url, username = proxy$user, password = proxy$pass)
  }

  page <- req %>% req_perform() %>% resp_body_html()

  tibble(
    url = url,
    name = page %>% html_element("h1") %>% html_text(trim = TRUE),
    price = page %>% html_element("[class*='price']") %>% html_text(trim = TRUE),
    rating = page %>% html_element("[class*='rating']") %>% html_text(trim = TRUE),
    description = page %>% html_element("[class*='description']") %>% html_text(trim = TRUE),
    scraped_at = Sys.time()
  )
}

# step 2: scrape multiple pages with purrr
scrape_all_products <- function(urls, proxy = NULL, delay = 2) {
  results <- map_df(seq_along(urls), function(i) {
    cat(sprintf("[%d/%d] %s\n", i, length(urls), urls[i]))

    result <- tryCatch(
      scrape_product_page(urls[i], proxy),
      error = function(e) {
        tibble(url = urls[i], error = e$message)
      }
    )

    Sys.sleep(delay)
    result
  })

  results
}

# step 3: clean and analyze
clean_price <- function(price_text) {
  # extract numeric price from text like "$29.99" or "USD 29.99"
  as.numeric(str_extract(price_text, "[0-9]+\\.?[0-9]*"))
}

clean_rating <- function(rating_text) {
  as.numeric(str_extract(rating_text, "[0-9]+\\.?[0-9]*"))
}

# step 4: run the pipeline
proxy <- list(url = "http://proxy.example.com:8080", user = "user", pass = "pass")

urls <- paste0("https://example.com/product/", 1:20)

raw_data <- scrape_all_products(urls, proxy = proxy, delay = 2)

# clean the data
clean_data <- raw_data %>%
  filter(is.na(error) | !exists("error")) %>%
  mutate(
    price_numeric = clean_price(price),
    rating_numeric = clean_rating(rating)
  )

# basic analysis
summary_stats <- clean_data %>%
  summarise(
    n_products = n(),
    avg_price = mean(price_numeric, na.rm = TRUE),
    min_price = min(price_numeric, na.rm = TRUE),
    max_price = max(price_numeric, na.rm = TRUE),
    avg_rating = mean(rating_numeric, na.rm = TRUE)
  )

print(summary_stats)

# save results
write_csv(clean_data, "products.csv")

Handling Pagination

library(rvest)
library(purrr)
library(dplyr)

scrape_paginated <- function(base_url, max_pages = 10, proxy = NULL) {
  all_data <- tibble()

  for (page_num in 1:max_pages) {
    url <- paste0(base_url, "?page=", page_num)
    cat(sprintf("scraping page %d...\n", page_num))

    page <- tryCatch({
      req <- request(url) %>%
        req_headers("User-Agent" = "Mozilla/5.0") %>%
        req_timeout(30)

      if (!is.null(proxy)) {
        req <- req %>% req_proxy(proxy$url, username = proxy$user, password = proxy$pass)
      }

      req %>% req_perform() %>% resp_body_html()
    }, error = function(e) {
      cat(sprintf("  error on page %d: %s\n", page_num, e$message))
      return(NULL)
    })

    if (is.null(page)) break

    # extract items from this page
    items <- page %>% html_elements("div.item")

    if (length(items) == 0) {
      cat("no items found, stopping\n")
      break
    }

    page_data <- map_df(items, function(item) {
      tibble(
        title = item %>% html_element("h2") %>% html_text(trim = TRUE),
        price = item %>% html_element(".price") %>% html_text(trim = TRUE),
        link = item %>% html_element("a") %>% html_attr("href")
      )
    })

    all_data <- bind_rows(all_data, page_data)
    cat(sprintf("  found %d items (total: %d)\n", nrow(page_data), nrow(all_data)))

    Sys.sleep(2)
  }

  all_data
}

# usage
data <- scrape_paginated("https://example.com/products", max_pages = 5)

Polite Scraping with the polite Package

the polite package automatically respects robots.txt and adds delays:

install.packages("polite")
library(polite)
library(rvest)

# introduce yourself to the website
session <- bow("https://example.com/", user_agent = "R Research Bot (academic)")

# check what is allowed
print(session)

# scrape politely (respects robots.txt and crawl-delay)
page <- scrape(session)

# navigate to specific pages
product_session <- nod(session, "https://example.com/products")
product_page <- scrape(product_session)

products <- product_page %>%
  html_elements(".product") %>%
  map_df(function(item) {
    tibble(
      name = item %>% html_element("h2") %>% html_text(trim = TRUE),
      price = item %>% html_element(".price") %>% html_text(trim = TRUE)
    )
  })

JavaScript-Rendered Pages with RSelenium

for pages that require JavaScript:

install.packages("RSelenium")
library(RSelenium)
library(rvest)

# start a Selenium server with Chrome
driver <- rsDriver(
  browser = "chrome",
  chromever = "latest",
  extraCapabilities = list(
    chromeOptions = list(
      args = list(
        "--headless",
        "--disable-gpu",
        "--proxy-server=http://proxy.example.com:8080"
      )
    )
  )
)

client <- driver$client

# navigate to a page
client$navigate("https://example.com/dynamic-page")

# wait for content to load
Sys.sleep(5)

# get the rendered HTML
page_source <- client$getPageSource()[[1]]
page <- read_html(page_source)

# extract data with rvest
data <- page %>%
  html_elements(".dynamic-content") %>%
  html_text()

print(data)

# clean up
client$close()
driver$server$stop()

Visualization After Scraping

one of R’s biggest strengths is going directly from scraping to visualization:

library(ggplot2)
library(dplyr)

# assume we have scraped product data
products <- read_csv("products.csv")

# price distribution
ggplot(products, aes(x = price_numeric)) +
  geom_histogram(bins = 30, fill = "#2ecc71", color = "white") +
  labs(title = "product price distribution", x = "price (USD)", y = "count") +
  theme_minimal()

# price by category
ggplot(products, aes(x = category, y = price_numeric, fill = category)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "price by category", x = "", y = "price (USD)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# rating vs price
ggplot(products, aes(x = price_numeric, y = rating_numeric)) +
  geom_point(alpha = 0.5, color = "#e74c3c") +
  geom_smooth(method = "lm", color = "#2ecc71") +
  labs(title = "price vs rating", x = "price (USD)", y = "rating") +
  theme_minimal()

R vs Python for Web Scraping

AspectR (rvest)Python (BeautifulSoup)
syntaxtidyverse pipe (%>%)method chaining
data framestibble (built-in)pandas (separate)
visualizationggplot2 (excellent)matplotlib/seaborn
statistical analysisnative strengthvia scipy/statsmodels
async scrapinglimitedasyncio, aiohttp
anti-detectionbasicmany libraries
browser automationRSeleniumPlaywright, Selenium
community sizesmallermuch larger
deploymentless commonstandard

Best Practices

  1. use the polite package for academic and research scraping. it respects robots.txt automatically
  2. pipe into tibbles to keep your data in tidy format from the start
  3. use purrr::map_df for extracting multiple items into a data frame
  4. add error handling with tryCatch around each request
  5. cache raw HTML during development to avoid re-fetching pages
  6. set realistic delays between requests (2+ seconds)
  7. rotate proxies for any scraping beyond simple research
  8. validate extracted data before analysis. check for NAs and unexpected formats

Conclusion

web scraping with R using rvest is the natural choice when your goal is data analysis. the seamless integration with the tidyverse means you go from raw web page to cleaned data frame to visualization in a single script. for research and academic data collection, the polite package adds ethical scraping practices with minimal effort.

for large-scale production scraping, Python is still the better tool. but for research projects, competitive analysis, and data science workflows where the scraping is just the first step in a longer analytical pipeline, R keeps everything in one ecosystem with less friction.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top