Web Scraping with R Complete Guide: rvest and Beyond

R is the language of choice for statisticians, data scientists, and researchers. while Python dominates web scraping discussions, R has a solid scraping ecosystem centered around the rvest package. if your end goal is data analysis in R, scraping directly into R data frames eliminates the data handoff between languages.

this guide covers web scraping with R from the basics to production patterns, with proxy integration and tidyverse workflow examples.

Why Scrape with R

R makes sense for web scraping when:

your analysis is in R: scraping directly into tibbles/data frames means no format conversion
you use the tidyverse: rvest integrates seamlessly with dplyr, tidyr, purrr, and ggplot2
you are a researcher: R’s statistical modeling and visualization are unmatched
you need reproducible research: R Markdown combines scraping, analysis, and reporting in one document
your team knows R: keeping everything in one language reduces complexity

when to use Python instead:

high-volume scraping: Python has more scalable frameworks like Scrapy
JavaScript-heavy sites: Python’s Playwright integration is more mature
anti-bot evasion: Python has more anti-detection libraries
production systems: Python scrapers are easier to deploy and schedule

R Scraping Libraries

Package	Purpose	Key Feature
rvest	HTML scraping	CSS selectors, tidyverse integration
httr2	HTTP requests	modern HTTP client with retry logic
httr	HTTP requests	classic HTTP client
xml2	XML/HTML parsing	underlying parser for rvest
polite	ethical scraping	respects robots.txt automatically
RSelenium	browser automation	Selenium bindings for R
chromote	headless Chrome	Chrome DevTools Protocol

Getting Started with rvest

Installation

# install core packages
install.packages("rvest")
install.packages("httr2")
install.packages("dplyr")
install.packages("purrr")
install.packages("tibble")

# or install the entire tidyverse
install.packages("tidyverse")

Basic Scraping

library(rvest)
library(dplyr)

# read a web page
page <- read_html("https://quotes.toscrape.com/")

# extract quotes using CSS selectors
quotes <- page %>%
  html_elements("div.quote") %>%
  map_df(function(quote) {
    tibble(
      text = quote %>% html_element("span.text") %>% html_text(),
      author = quote %>% html_element("small.author") %>% html_text(),
      tags = quote %>% html_elements("div.tags a.tag") %>% html_text() %>% paste(collapse = ", ")
    )
  })

# view the data
print(quotes)

CSS Selectors with rvest

library(rvest)

page <- read_html("https://example.com")

# basic selectors
titles <- page %>% html_elements("h2") %>% html_text()
links <- page %>% html_elements("a") %>% html_attr("href")
images <- page %>% html_elements("img") %>% html_attr("src")

# class selectors
products <- page %>% html_elements(".product-card")
prices <- page %>% html_elements("span.price") %>% html_text()

# attribute selectors
external_links <- page %>% html_elements("a[target='_blank']") %>% html_attr("href")
product_links <- page %>% html_elements("a[href*='product']") %>% html_attr("href")

# nested selectors
product_names <- page %>%
  html_elements("div.product-card h2") %>%
  html_text()

# nth-child
first_item <- page %>% html_element("ul li:first-child") %>% html_text()

Extracting Tables

rvest has a dedicated function for extracting HTML tables:

library(rvest)

page <- read_html("https://example.com/data-table")

# extract all tables as data frames
tables <- page %>% html_table()

# get the first table
df <- tables[[1]]
print(df)

# or target a specific table
pricing_table <- page %>%
  html_element("table.pricing") %>%
  html_table()

Adding Proxy Support

Using httr2 with Proxies

httr2 is the modern HTTP client for R with built-in proxy support:

library(httr2)
library(rvest)

# create a request with proxy
fetch_with_proxy <- function(url, proxy_url, proxy_user = NULL, proxy_pass = NULL) {
  req <- request(url) %>%
    req_proxy(proxy_url) %>%
    req_headers(
      "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
      "Accept" = "text/html,application/xhtml+xml",
      "Accept-Language" = "en-US,en;q=0.9"
    ) %>%
    req_timeout(30)

  if (!is.null(proxy_user) && !is.null(proxy_pass)) {
    req <- req %>% req_proxy(proxy_url, username = proxy_user, password = proxy_pass)
  }

  resp <- req %>% req_perform()
  resp %>% resp_body_html()
}

# usage
page <- fetch_with_proxy(
  "https://httpbin.org/ip",
  "http://proxy.example.com:8080",
  "username",
  "password"
)

print(page %>% html_text())

Using httr with Proxies (Classic)

library(httr)
library(rvest)

# set proxy
fetch_page <- function(url, proxy_host, proxy_port, proxy_user = NULL, proxy_pass = NULL) {
  if (!is.null(proxy_user)) {
    response <- GET(
      url,
      use_proxy(proxy_host, proxy_port, proxy_user, proxy_pass),
      user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"),
      timeout(30)
    )
  } else {
    response <- GET(
      url,
      use_proxy(proxy_host, proxy_port),
      user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"),
      timeout(30)
    )
  }

  if (status_code(response) == 200) {
    return(read_html(content(response, as = "text", encoding = "UTF-8")))
  } else {
    stop(paste("HTTP error:", status_code(response)))
  }
}

# usage
page <- fetch_page(
  "https://example.com",
  "proxy.example.com", 8080,
  "user", "pass"
)

Rotating Proxies

library(httr2)
library(rvest)

# proxy rotation class using R6
library(R6)

ProxyRotator <- R6Class("ProxyRotator",
  public = list(
    proxies = NULL,
    current_index = 0,

    initialize = function(proxies) {
      self$proxies <- proxies
    },

    get_next = function() {
      self$current_index <- (self$current_index %% length(self$proxies)) + 1
      self$proxies[[self$current_index]]
    },

    get_random = function() {
      self$proxies[[sample(length(self$proxies), 1)]]
    }
  )
)

# scraper with proxy rotation
scrape_with_rotation <- function(urls, proxies, delay = 2) {
  rotator <- ProxyRotator$new(proxies)
  results <- list()

  for (i in seq_along(urls)) {
    proxy <- rotator$get_next()
    cat(sprintf("[%d/%d] %s (proxy: %s)\n", i, length(urls), urls[i], proxy$url))

    tryCatch({
      page <- request(urls[i]) %>%
        req_proxy(proxy$url, username = proxy$user, password = proxy$pass) %>%
        req_headers("User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)") %>%
        req_timeout(30) %>%
        req_perform() %>%
        resp_body_html()

      results[[i]] <- list(
        url = urls[i],
        title = page %>% html_element("title") %>% html_text(),
        success = TRUE
      )
    }, error = function(e) {
      cat(sprintf("  error: %s\n", e$message))
      results[[i]] <<- list(url = urls[i], error = e$message, success = FALSE)
    })

    Sys.sleep(delay + runif(1, 0, 1))
  }

  bind_rows(results)
}

# usage
proxies <- list(
  list(url = "http://proxy1.example.com:8080", user = "user", pass = "pass"),
  list(url = "http://proxy2.example.com:8080", user = "user", pass = "pass"),
  list(url = "http://proxy3.example.com:8080", user = "user", pass = "pass")
)

urls <- paste0("https://example.com/page/", 1:10)
results <- scrape_with_rotation(urls, proxies, delay = 2)

Building a Complete Scraping Pipeline

here is a full pipeline that scrapes, cleans, and analyzes data:

library(rvest)
library(httr2)
library(dplyr)
library(purrr)
library(stringr)
library(readr)

# step 1: define the scraper
scrape_product_page <- function(url, proxy = NULL) {
  req <- request(url) %>%
    req_headers(
      "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
      "Accept" = "text/html"
    ) %>%
    req_timeout(30) %>%
    req_retry(max_tries = 3, backoff = ~ 2)

  if (!is.null(proxy)) {
    req <- req %>% req_proxy(proxy$url, username = proxy$user, password = proxy$pass)
  }

  page <- req %>% req_perform() %>% resp_body_html()

  tibble(
    url = url,
    name = page %>% html_element("h1") %>% html_text(trim = TRUE),
    price = page %>% html_element("[class*='price']") %>% html_text(trim = TRUE),
    rating = page %>% html_element("[class*='rating']") %>% html_text(trim = TRUE),
    description = page %>% html_element("[class*='description']") %>% html_text(trim = TRUE),
    scraped_at = Sys.time()
  )
}

# step 2: scrape multiple pages with purrr
scrape_all_products <- function(urls, proxy = NULL, delay = 2) {
  results <- map_df(seq_along(urls), function(i) {
    cat(sprintf("[%d/%d] %s\n", i, length(urls), urls[i]))

    result <- tryCatch(
      scrape_product_page(urls[i], proxy),
      error = function(e) {
        tibble(url = urls[i], error = e$message)
      }
    )

    Sys.sleep(delay)
    result
  })

  results
}

# step 3: clean and analyze
clean_price <- function(price_text) {
  # extract numeric price from text like "$29.99" or "USD 29.99"
  as.numeric(str_extract(price_text, "[0-9]+\\.?[0-9]*"))
}

clean_rating <- function(rating_text) {
  as.numeric(str_extract(rating_text, "[0-9]+\\.?[0-9]*"))
}

# step 4: run the pipeline
proxy <- list(url = "http://proxy.example.com:8080", user = "user", pass = "pass")

urls <- paste0("https://example.com/product/", 1:20)

raw_data <- scrape_all_products(urls, proxy = proxy, delay = 2)

# clean the data
clean_data <- raw_data %>%
  filter(is.na(error) | !exists("error")) %>%
  mutate(
    price_numeric = clean_price(price),
    rating_numeric = clean_rating(rating)
  )

# basic analysis
summary_stats <- clean_data %>%
  summarise(
    n_products = n(),
    avg_price = mean(price_numeric, na.rm = TRUE),
    min_price = min(price_numeric, na.rm = TRUE),
    max_price = max(price_numeric, na.rm = TRUE),
    avg_rating = mean(rating_numeric, na.rm = TRUE)
  )

print(summary_stats)

# save results
write_csv(clean_data, "products.csv")

Handling Pagination

library(rvest)
library(purrr)
library(dplyr)

scrape_paginated <- function(base_url, max_pages = 10, proxy = NULL) {
  all_data <- tibble()

  for (page_num in 1:max_pages) {
    url <- paste0(base_url, "?page=", page_num)
    cat(sprintf("scraping page %d...\n", page_num))

    page <- tryCatch({
      req <- request(url) %>%
        req_headers("User-Agent" = "Mozilla/5.0") %>%
        req_timeout(30)

      if (!is.null(proxy)) {
        req <- req %>% req_proxy(proxy$url, username = proxy$user, password = proxy$pass)
      }

      req %>% req_perform() %>% resp_body_html()
    }, error = function(e) {
      cat(sprintf("  error on page %d: %s\n", page_num, e$message))
      return(NULL)
    })

    if (is.null(page)) break

    # extract items from this page
    items <- page %>% html_elements("div.item")

    if (length(items) == 0) {
      cat("no items found, stopping\n")
      break
    }

    page_data <- map_df(items, function(item) {
      tibble(
        title = item %>% html_element("h2") %>% html_text(trim = TRUE),
        price = item %>% html_element(".price") %>% html_text(trim = TRUE),
        link = item %>% html_element("a") %>% html_attr("href")
      )
    })

    all_data <- bind_rows(all_data, page_data)
    cat(sprintf("  found %d items (total: %d)\n", nrow(page_data), nrow(all_data)))

    Sys.sleep(2)
  }

  all_data
}

# usage
data <- scrape_paginated("https://example.com/products", max_pages = 5)

Polite Scraping with the polite Package

the polite package automatically respects robots.txt and adds delays:

install.packages("polite")
library(polite)
library(rvest)

# introduce yourself to the website
session <- bow("https://example.com/", user_agent = "R Research Bot (academic)")

# check what is allowed
print(session)

# scrape politely (respects robots.txt and crawl-delay)
page <- scrape(session)

# navigate to specific pages
product_session <- nod(session, "https://example.com/products")
product_page <- scrape(product_session)

products <- product_page %>%
  html_elements(".product") %>%
  map_df(function(item) {
    tibble(
      name = item %>% html_element("h2") %>% html_text(trim = TRUE),
      price = item %>% html_element(".price") %>% html_text(trim = TRUE)
    )
  })

JavaScript-Rendered Pages with RSelenium

for pages that require JavaScript:

install.packages("RSelenium")
library(RSelenium)
library(rvest)

# start a Selenium server with Chrome
driver <- rsDriver(
  browser = "chrome",
  chromever = "latest",
  extraCapabilities = list(
    chromeOptions = list(
      args = list(
        "--headless",
        "--disable-gpu",
        "--proxy-server=http://proxy.example.com:8080"
      )
    )
  )
)

client <- driver$client

# navigate to a page
client$navigate("https://example.com/dynamic-page")

# wait for content to load
Sys.sleep(5)

# get the rendered HTML
page_source <- client$getPageSource()[[1]]
page <- read_html(page_source)

# extract data with rvest
data <- page %>%
  html_elements(".dynamic-content") %>%
  html_text()

print(data)

# clean up
client$close()
driver$server$stop()

Visualization After Scraping

one of R’s biggest strengths is going directly from scraping to visualization:

library(ggplot2)
library(dplyr)

# assume we have scraped product data
products <- read_csv("products.csv")

# price distribution
ggplot(products, aes(x = price_numeric)) +
  geom_histogram(bins = 30, fill = "#2ecc71", color = "white") +
  labs(title = "product price distribution", x = "price (USD)", y = "count") +
  theme_minimal()

# price by category
ggplot(products, aes(x = category, y = price_numeric, fill = category)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "price by category", x = "", y = "price (USD)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# rating vs price
ggplot(products, aes(x = price_numeric, y = rating_numeric)) +
  geom_point(alpha = 0.5, color = "#e74c3c") +
  geom_smooth(method = "lm", color = "#2ecc71") +
  labs(title = "price vs rating", x = "price (USD)", y = "rating") +
  theme_minimal()

R vs Python for Web Scraping

Aspect	R (rvest)	Python (BeautifulSoup)
syntax	tidyverse pipe (%>%)	method chaining
data frames	tibble (built-in)	pandas (separate)
visualization	ggplot2 (excellent)	matplotlib/seaborn
statistical analysis	native strength	via scipy/statsmodels
async scraping	limited	asyncio, aiohttp
anti-detection	basic	many libraries
browser automation	RSelenium	Playwright, Selenium
community size	smaller	much larger
deployment	less common	standard

Best Practices

use the polite package for academic and research scraping. it respects robots.txt automatically
pipe into tibbles to keep your data in tidy format from the start
use purrr::map_df for extracting multiple items into a data frame
add error handling with tryCatch around each request
cache raw HTML during development to avoid re-fetching pages
set realistic delays between requests (2+ seconds)
rotate proxies for any scraping beyond simple research
validate extracted data before analysis. check for NAs and unexpected formats

Conclusion

web scraping with R using rvest is the natural choice when your goal is data analysis. the seamless integration with the tidyverse means you go from raw web page to cleaned data frame to visualization in a single script. for research and academic data collection, the polite package adds ethical scraping practices with minimal effort.

for large-scale production scraping, Python is still the better tool. but for research projects, competitive analysis, and data science workflows where the scraping is just the first step in a longer analytical pipeline, R keeps everything in one ecosystem with less friction.