R web scraping uses rvest for static HTML parsing and RSelenium for JavaScript-rendered pages. both integrate natively with tidyverse workflows. this guide covers installation, CSS selectors, table extraction, and headless browser setup for data scientists using R.
why scrape in R?
data scientists already working in R can collect web data without switching to Python. rvest, part of the tidyverse ecosystem, offers a clean pipe-friendly API for HTML parsing. RSelenium provides full browser automation when JS rendering is needed. the combination covers 95% of scraping use cases without leaving the R environment.
rvest version 1.0.3+ integrates with the httr2 package for modern HTTP handling. older tutorials using html_session() should be updated to use session() instead.
installation
# install from CRAN
install.packages("rvest") # v1.0.3
install.packages("httr2") # v1.0.0
install.packages("RSelenium") # v1.7.9
install.packages("tidyverse") # for data processingbasic scraping with rvest
library(rvest)
library(dplyr)
url <- "https://news.ycombinator.com"
page <- read_html(url)
# CSS selectors
titles <- page |>
html_elements(".titleline > a") |>
html_text2()
links <- page |>
html_elements(".titleline > a") |>
html_attr("href")
results <- tibble(title = titles, url = links)
head(results)scraping HTML tables
rvest’s html_table() function converts HTML tables directly into R data frames. this is the fastest way to extract tabular data from pages like Wikipedia or financial sites.
library(rvest)
page <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_population")
# extract all tables on the page
tables <- page |> html_table(fill = TRUE)
# get the first table
population_df <- tables[[1]]
head(population_df)
# extract a specific table by ID
specific_table <- page |>
html_element("#mw-content-text table.wikitable") |>
html_table()handling pagination and multiple pages
library(rvest)
library(purrr)
scrape_page <- function(page_num) {
url <- paste0("https://example.com/listings?page=", page_num)
Sys.sleep(runif(1, 2, 4)) # polite delay
page <- read_html(url)
page |>
html_elements(".item-title") |>
html_text2()
}
all_items <- map(1:10, scrape_page) |> unlist()
cat("total items:", length(all_items))session management and cookies
rvest’s session() function maintains cookies and session state across requests. useful for authenticated scraping or maintaining a consistent browser fingerprint.
library(rvest)
s <- session("https://site.com/login")
form <- s |> html_form() |> pluck(1)
filled_form <- html_form_set(form,
username = "your_username",
password = "your_password"
)
s <- session_submit(s, filled_form)
# now access authenticated pages
dashboard <- session_jump_to(s, "https://site.com/dashboard")
data <- dashboard |> html_element(".user-data") |> html_text2()JavaScript rendering with RSelenium
RSelenium starts a remote WebDriver server (Chrome or Firefox) and controls it from R. install Java first, then use the rsDriver function to manage ChromeDriver automatically.
library(RSelenium)
library(rvest)
# start Chrome driver
rD <- rsDriver(browser = "chrome",
chromever = "latest",
headless = TRUE,
verbose = FALSE)
remDr <- rD[["client"]]
remDr$navigate("https://spa-site.com/products")
Sys.sleep(3) # wait for JS
# get rendered page source and parse with rvest
page_source <- remDr$getPageSource()[[1]]
doc <- read_html(page_source)
products <- doc |>
html_elements(".product-card .name") |>
html_text2()
remDr$close()
rD$server$stop()for proxy and scraping fundamentals applicable to R workflows, see what is a proxy server, SOCKS5 vs HTTP proxy, and what is web scraping.
sources and further reading
related guides
last updated: April 1, 2026