C# Web Scraping Guide: HtmlAgilityPack and Beyond
C# is a strong choice for web scraping when you are building data pipelines that integrate with .NET applications, need high performance, or work in an enterprise Windows environment. HtmlAgilityPack is the most popular HTML parsing library in the .NET ecosystem, and paired with HttpClient, it gives you a fast, type-safe scraping toolkit.
this guide covers HtmlAgilityPack, AngleSharp, Playwright for .NET, proxy integration, and patterns for building production-grade scrapers.
Why C# for Web Scraping
C# offers several advantages for scraping:
- .NET integration: scrape data directly into your .NET applications, databases, and APIs
- async/await: built-in async support makes concurrent scraping clean and efficient
- strong typing: catch errors at compile time, not when your scraper is running at 3 AM
- performance: the .NET runtime is significantly faster than Python for parsing and processing
- enterprise ecosystem: if your organization runs on .NET, keeping scraping in C# reduces complexity
the main drawback is a smaller scraping community compared to Python. fewer tutorials, fewer scraping-specific libraries, and fewer StackOverflow answers when you get stuck.
Library Overview
| Library | Type | Best For |
|---|---|---|
| HtmlAgilityPack | HTML parser | static pages, XPath-based extraction |
| AngleSharp | HTML/CSS parser | CSS selector-based extraction, modern API |
| Playwright for .NET | browser automation | JavaScript-rendered pages |
| Selenium WebDriver | browser automation | legacy browser automation |
| HttpClient | HTTP client | making requests (built into .NET) |
Getting Started with HtmlAgilityPack
Installation
dotnet add package HtmlAgilityPack
or via NuGet Package Manager:
Install-Package HtmlAgilityPack
Basic Scraping
using HtmlAgilityPack;
using System;
using System.Net.Http;
using System.Threading.Tasks;
class BasicScraper
{
static async Task Main()
{
var httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.Add("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
string html = await httpClient.GetStringAsync("https://quotes.toscrape.com/");
var doc = new HtmlDocument();
doc.LoadHtml(html);
// select elements using XPath
var quotes = doc.DocumentNode.SelectNodes("//div[@class='quote']");
if (quotes != null)
{
foreach (var quote in quotes)
{
string text = quote.SelectSingleNode(".//span[@class='text']")?.InnerText ?? "";
string author = quote.SelectSingleNode(".//small[@class='author']")?.InnerText ?? "";
Console.WriteLine($"author: {author}");
Console.WriteLine($"quote: {text}");
Console.WriteLine();
}
}
}
}
XPath Cheat Sheet for HtmlAgilityPack
var doc = new HtmlDocument();
doc.LoadHtml(html);
var root = doc.DocumentNode;
// basic selection
var allDivs = root.SelectNodes("//div");
var byId = root.SelectSingleNode("//div[@id='main']");
var byClass = root.SelectNodes("//div[@class='product']");
// partial attribute matching
var containsClass = root.SelectNodes("//div[contains(@class, 'product')]");
var startsWith = root.SelectNodes("//a[starts-with(@href, '/product')]");
// text content
var byText = root.SelectNodes("//button[contains(text(), 'Buy')]");
// relationships
var children = root.SelectNodes("//div[@class='parent']/div");
var descendants = root.SelectNodes("//div[@class='parent']//span");
var following = root.SelectNodes("//h2/following-sibling::p");
// multiple conditions
var combo = root.SelectNodes("//div[@class='product' and @data-available='true']");
// get attribute values
var links = root.SelectNodes("//a[@href]");
if (links != null)
{
foreach (var link in links)
{
string href = link.GetAttributeValue("href", "");
string text = link.InnerText.Trim();
Console.WriteLine($"{text}: {href}");
}
}
Proxy Integration
HttpClient with Proxy
using System.Net;
using System.Net.Http;
class ProxyScraper
{
static HttpClient CreateClientWithProxy(string proxyUrl, string username = null, string password = null)
{
var proxy = new WebProxy(proxyUrl);
if (username != null && password != null)
{
proxy.Credentials = new NetworkCredential(username, password);
}
var handler = new HttpClientHandler
{
Proxy = proxy,
UseProxy = true,
AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate
};
var client = new HttpClient(handler);
client.DefaultRequestHeaders.Add("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
client.Timeout = TimeSpan.FromSeconds(30);
return client;
}
static async Task Main()
{
using var client = CreateClientWithProxy(
"http://proxy.example.com:8080",
"username",
"password"
);
string html = await client.GetStringAsync("https://httpbin.org/ip");
Console.WriteLine(html);
}
}
Rotating Proxies
using System;
using System.Collections.Generic;
using System.Net;
using System.Net.Http;
using System.Threading;
using System.Threading.Tasks;
using HtmlAgilityPack;
class RotatingProxyScraper
{
private readonly List<ProxyConfig> _proxies;
private int _currentIndex;
private readonly object _lock = new object();
public RotatingProxyScraper(List<ProxyConfig> proxies)
{
_proxies = proxies;
_currentIndex = 0;
}
private ProxyConfig GetNextProxy()
{
lock (_lock)
{
var proxy = _proxies[_currentIndex % _proxies.Count];
_currentIndex++;
return proxy;
}
}
public async Task<HtmlDocument> FetchAsync(string url, int maxRetries = 3)
{
for (int attempt = 0; attempt < maxRetries; attempt++)
{
var proxyConfig = GetNextProxy();
try
{
var proxy = new WebProxy(proxyConfig.Host, proxyConfig.Port);
if (proxyConfig.Username != null)
{
proxy.Credentials = new NetworkCredential(
proxyConfig.Username, proxyConfig.Password);
}
var handler = new HttpClientHandler { Proxy = proxy, UseProxy = true };
using var client = new HttpClient(handler);
client.DefaultRequestHeaders.Add("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
client.Timeout = TimeSpan.FromSeconds(30);
string html = await client.GetStringAsync(url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
return doc;
}
catch (Exception ex)
{
Console.WriteLine($"attempt {attempt + 1} failed ({proxyConfig.Host}): {ex.Message}");
await Task.Delay(TimeSpan.FromSeconds(Math.Pow(2, attempt)));
}
}
return null;
}
}
class ProxyConfig
{
public string Host { get; set; }
public int Port { get; set; }
public string Username { get; set; }
public string Password { get; set; }
}
AngleSharp: the Modern Alternative
AngleSharp provides CSS selector support (which HtmlAgilityPack lacks natively) and a more modern API:
dotnet add package AngleSharp
using AngleSharp;
using AngleSharp.Html.Dom;
using System;
using System.Linq;
using System.Threading.Tasks;
class AngleSharpScraper
{
static async Task Main()
{
var config = Configuration.Default.WithDefaultLoader();
var context = BrowsingContext.New(config);
// load page directly (AngleSharp handles HTTP)
var document = await context.OpenAsync("https://quotes.toscrape.com/");
// use CSS selectors (much cleaner than XPath)
var quotes = document.QuerySelectorAll("div.quote");
foreach (var quote in quotes)
{
string text = quote.QuerySelector("span.text")?.TextContent ?? "";
string author = quote.QuerySelector("small.author")?.TextContent ?? "";
var tags = quote.QuerySelectorAll("div.tags a.tag")
.Select(t => t.TextContent);
Console.WriteLine($"author: {author}");
Console.WriteLine($"quote: {text}");
Console.WriteLine($"tags: {string.Join(", ", tags)}");
Console.WriteLine();
}
}
}
AngleSharp with Proxy
using AngleSharp;
using AngleSharp.Io;
using System.Net;
using System.Net.Http;
class AngleSharpProxyScraper
{
static async Task Main()
{
var proxy = new WebProxy("http://proxy.example.com:8080")
{
Credentials = new NetworkCredential("user", "pass")
};
var handler = new HttpClientHandler { Proxy = proxy, UseProxy = true };
var httpClient = new HttpClient(handler);
var requester = new HttpClientRequester(httpClient);
var config = Configuration.Default.With(requester).WithDefaultLoader();
var context = BrowsingContext.New(config);
var document = await context.OpenAsync("https://example.com");
var products = document.QuerySelectorAll(".product-card");
foreach (var product in products)
{
Console.WriteLine(product.QuerySelector("h2")?.TextContent);
}
}
}
Async Concurrent Scraping
C#’s async/await model makes concurrent scraping clean and efficient:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net.Http;
using System.Threading;
using System.Threading.Tasks;
using HtmlAgilityPack;
class ConcurrentScraper
{
private readonly HttpClient _client;
private readonly SemaphoreSlim _semaphore;
private readonly int _delayMs;
public ConcurrentScraper(HttpClient client, int maxConcurrency = 5, int delayMs = 1000)
{
_client = client;
_semaphore = new SemaphoreSlim(maxConcurrency);
_delayMs = delayMs;
}
public async Task<List<ScrapedPage>> ScrapeAllAsync(List<string> urls)
{
var tasks = urls.Select(url => ScrapeWithThrottleAsync(url));
var results = await Task.WhenAll(tasks);
return results.ToList();
}
private async Task<ScrapedPage> ScrapeWithThrottleAsync(string url)
{
await _semaphore.WaitAsync();
try
{
var html = await _client.GetStringAsync(url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
var title = doc.DocumentNode.SelectSingleNode("//h1")?.InnerText?.Trim() ?? "";
var price = doc.DocumentNode.SelectSingleNode("//*[contains(@class, 'price')]")
?.InnerText?.Trim() ?? "";
await Task.Delay(_delayMs);
return new ScrapedPage
{
Url = url,
Title = title,
Price = price,
Success = true
};
}
catch (Exception ex)
{
return new ScrapedPage
{
Url = url,
Error = ex.Message,
Success = false
};
}
finally
{
_semaphore.Release();
}
}
}
class ScrapedPage
{
public string Url { get; set; }
public string Title { get; set; }
public string Price { get; set; }
public string Error { get; set; }
public bool Success { get; set; }
}
Playwright for .NET: JavaScript-Rendered Pages
when pages require JavaScript rendering, use Playwright:
dotnet add package Microsoft.Playwright
using Microsoft.Playwright;
using System;
using System.Threading.Tasks;
class PlaywrightScraper
{
static async Task Main()
{
// install browsers (run once)
// dotnet tool install --global Microsoft.Playwright.CLI
// playwright install chromium
using var playwright = await Playwright.CreateAsync();
var launchOptions = new BrowserTypeLaunchOptions
{
Headless = true,
Proxy = new Proxy
{
Server = "http://proxy.example.com:8080",
Username = "user",
Password = "pass"
}
};
await using var browser = await playwright.Chromium.LaunchAsync(launchOptions);
var page = await browser.NewPageAsync(new BrowserNewPageOptions
{
UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
});
await page.GotoAsync("https://example.com/dynamic-page",
new PageGotoOptions { WaitUntil = WaitUntilState.NetworkIdle });
// wait for specific element
await page.WaitForSelectorAsync("div.results");
// extract data using Playwright's built-in selectors
var items = await page.QuerySelectorAllAsync("div.result-item");
foreach (var item in items)
{
string title = await item.InnerTextAsync();
Console.WriteLine(title);
}
// or get the HTML and parse with HtmlAgilityPack
string html = await page.ContentAsync();
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
// use XPath on the rendered DOM
var products = doc.DocumentNode.SelectNodes("//div[@class='product']");
}
}
Building a Production Scraper
here is a complete, production-ready scraping application:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Text.Json;
using System.Threading.Tasks;
using HtmlAgilityPack;
class ProductionScraper
{
private readonly List<ProxyConfig> _proxies;
private readonly Random _random = new Random();
private readonly string[] _userAgents = {
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0"
};
public ProductionScraper(List<ProxyConfig> proxies)
{
_proxies = proxies;
}
private HttpClient CreateClient()
{
var proxyConfig = _proxies[_random.Next(_proxies.Count)];
var proxy = new WebProxy(proxyConfig.Host, proxyConfig.Port);
if (proxyConfig.Username != null)
proxy.Credentials = new NetworkCredential(proxyConfig.Username, proxyConfig.Password);
var handler = new HttpClientHandler
{
Proxy = proxy,
UseProxy = true,
AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate
};
var client = new HttpClient(handler);
client.DefaultRequestHeaders.Add("User-Agent", _userAgents[_random.Next(_userAgents.Length)]);
client.DefaultRequestHeaders.Add("Accept", "text/html,application/xhtml+xml");
client.DefaultRequestHeaders.Add("Accept-Language", "en-US,en;q=0.9");
client.Timeout = TimeSpan.FromSeconds(30);
return client;
}
public async Task<HtmlDocument> FetchWithRetryAsync(string url, int maxRetries = 3)
{
Exception lastException = null;
for (int i = 0; i < maxRetries; i++)
{
try
{
using var client = CreateClient();
var response = await client.GetAsync(url);
response.EnsureSuccessStatusCode();
string html = await response.Content.ReadAsStringAsync();
// check for block pages
if (html.Contains("Access Denied") || html.Contains("captcha"))
{
throw new Exception("blocked by target site");
}
var doc = new HtmlDocument();
doc.LoadHtml(html);
return doc;
}
catch (Exception ex)
{
lastException = ex;
Console.WriteLine($" retry {i + 1}/{maxRetries}: {ex.Message}");
await Task.Delay(TimeSpan.FromSeconds(Math.Pow(2, i)));
}
}
throw lastException;
}
public async Task<List<Dictionary<string, string>>> ScrapeProductsAsync(
List<string> urls, int concurrency = 3, int delayMs = 2000)
{
var results = new List<Dictionary<string, string>>();
var semaphore = new System.Threading.SemaphoreSlim(concurrency);
var tasks = urls.Select(async (url, index) =>
{
await semaphore.WaitAsync();
try
{
Console.WriteLine($"[{index + 1}/{urls.Count}] {url}");
var doc = await FetchWithRetryAsync(url);
var product = new Dictionary<string, string>
{
["url"] = url,
["title"] = GetText(doc, "//h1"),
["price"] = GetText(doc, "//*[contains(@class, 'price')]"),
["description"] = GetText(doc, "//*[contains(@class, 'description')]"),
["rating"] = GetText(doc, "//*[contains(@class, 'rating')]")
};
lock (results) { results.Add(product); }
await Task.Delay(delayMs + _random.Next(1000));
}
catch (Exception ex)
{
Console.WriteLine($" failed: {ex.Message}");
lock (results)
{
results.Add(new Dictionary<string, string>
{
["url"] = url,
["error"] = ex.Message
});
}
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks);
return results;
}
private string GetText(HtmlDocument doc, string xpath)
{
return doc.DocumentNode.SelectSingleNode(xpath)?.InnerText?.Trim() ?? "";
}
public async Task SaveResultsAsync(List<Dictionary<string, string>> results, string filename)
{
var options = new JsonSerializerOptions { WriteIndented = true };
string json = JsonSerializer.Serialize(results, options);
await File.WriteAllTextAsync(filename, json);
Console.WriteLine($"saved {results.Count} results to {filename}");
}
static async Task Main()
{
var proxies = new List<ProxyConfig>
{
new() { Host = "proxy1.example.com", Port = 8080, Username = "user", Password = "pass" },
new() { Host = "proxy2.example.com", Port = 8080, Username = "user", Password = "pass" }
};
var scraper = new ProductionScraper(proxies);
var urls = Enumerable.Range(1, 20)
.Select(i => $"https://example.com/product/{i}")
.ToList();
var results = await scraper.ScrapeProductsAsync(urls, concurrency: 3, delayMs: 2000);
await scraper.SaveResultsAsync(results, "products.json");
}
}
Handling Common Challenges
Following Pagination
public async Task<List<string>> GetAllPageUrlsAsync(string baseUrl, int maxPages = 50)
{
var allProductUrls = new List<string>();
string currentUrl = baseUrl;
int pageCount = 0;
while (currentUrl != null && pageCount < maxPages)
{
pageCount++;
var doc = await FetchWithRetryAsync(currentUrl);
// extract product URLs from this page
var productLinks = doc.DocumentNode.SelectNodes("//a[contains(@class, 'product')]");
if (productLinks != null)
{
foreach (var link in productLinks)
{
string href = link.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(href))
{
var absoluteUrl = new Uri(new Uri(currentUrl), href).ToString();
allProductUrls.Add(absoluteUrl);
}
}
}
// find next page
var nextLink = doc.DocumentNode.SelectSingleNode(
"//a[contains(@class, 'next')] | //a[@rel='next'] | //li[contains(@class, 'next')]/a");
currentUrl = nextLink?.GetAttributeValue("href", null);
if (currentUrl != null && !currentUrl.StartsWith("http"))
{
currentUrl = new Uri(new Uri(baseUrl), currentUrl).ToString();
}
await Task.Delay(2000);
}
return allProductUrls;
}
Decoding HTML Entities
using System.Net;
string raw = "Price: &dollar;29.99 — Sale!";
string decoded = WebUtility.HtmlDecode(raw);
// result: "Price: $29.99 -- Sale!"
C# vs Python for Web Scraping
| Aspect | C# | Python |
|---|---|---|
| setup complexity | higher (project file, NuGet) | lower (pip install) |
| parsing speed | faster | slower |
| async support | excellent (built-in) | good (asyncio) |
| selector style | XPath (HtmlAgilityPack) or CSS (AngleSharp) | CSS (BeautifulSoup) or both |
| enterprise integration | excellent (.NET ecosystem) | good |
| community resources | smaller for scraping | much larger |
| type safety | compile-time | runtime |
| development speed | slower | faster |
Conclusion
C# web scraping with HtmlAgilityPack is a solid choice for .NET teams and enterprise environments. the combination of strong typing, excellent async support, and high performance makes it particularly well suited for production scrapers that need to run reliably at scale.
if you prefer CSS selectors over XPath, AngleSharp is the better library choice. for JavaScript-rendered pages, Playwright for .NET gives you full browser automation with the same API quality you would expect from the Playwright project.
pair any of these libraries with rotating residential proxies through .NET’s HttpClient, and you have a complete scraping infrastructure that integrates seamlessly with the rest of your .NET stack.