C# Web Scraping: Complete Guide with HtmlAgilityPack and Proxies

TL;DR
C# can scrape the web using HtmlAgilityPack for HTML parsing and HttpClient for requests. add a rotating proxy and you have a production-grade .NET scraper in under 100 lines.

most scraping tutorials default to Python, but C# is a solid choice for teams already running .NET infrastructure. the ecosystem has mature HTTP and HTML parsing libraries, strong typing catches bugs early, and async/await makes concurrent scraping clean to write.

this guide covers the full stack: fetching pages with HttpClient, parsing HTML with HtmlAgilityPack, handling pagination, and routing requests through a proxy.

setting up the project

create a new .NET console project and install the required packages:

dotnet new console -n CSharpScraper
cd CSharpScraper
dotnet add package HtmlAgilityPack
dotnet add package HtmlAgilityPack.CssSelectors.NetCore

HtmlAgilityPack is the most widely used HTML parser in the .NET ecosystem. it supports both XPath and CSS selectors. the CSS selectors package makes it easier to port logic from browser DevTools.

basic http request with httpclient

using System.Net.Http;
using HtmlAgilityPack;

var handler = new HttpClientHandler
{
    Proxy = new System.Net.WebProxy("http://your-proxy:port", false),
    UseProxy = true
};

using var client = new HttpClient(handler);
client.DefaultRequestHeaders.Add("User-Agent",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");

var html = await client.GetStringAsync("https://books.toscrape.com");
var doc = new HtmlDocument();
doc.LoadHtml(html);

always create HttpClient as a singleton or use IHttpClientFactory in production. creating a new instance per request exhausts socket connections. learn more about routing requests in our guide to what is a proxy server.

parsing html with htmlagilitypack

// XPath selection
var titles = doc.DocumentNode.SelectNodes("//article[@class='product_pod']//h3/a");
foreach (var node in titles ?? Enumerable.Empty<HtmlNode>())
{
    Console.WriteLine(node.GetAttributeValue("title", ""));
}

// CSS selectors (with add-on package)
var prices = doc.DocumentNode.QuerySelectorAll(".price_color");
foreach (var price in prices)
{
    Console.WriteLine(price.InnerText.Trim());
}

XPath is verbose but powerful for deeply nested structures. CSS selectors are easier to read and match what you’d write in browser DevTools.

handling pagination

int page = 1;
while (true)
{
    var url = $"https://books.toscrape.com/catalogue/page-{page}.html";
    var html = await client.GetStringAsync(url);
    var doc = new HtmlDocument();
    doc.LoadHtml(html);

    var books = doc.DocumentNode.QuerySelectorAll("article.product_pod");
    if (!books.Any()) break;

    foreach (var book in books)
    {
        var title = book.QuerySelector("h3 a").GetAttributeValue("title", "");
        var price = book.QuerySelector(".price_color").InnerText.Trim();
        Console.WriteLine($"{title} - {price}");
    }

    // check for next page
    var next = doc.DocumentNode.QuerySelector("li.next a");
    if (next == null) break;
    page++;
    await Task.Delay(1000); // polite delay
}

proxy rotation in c#

for large-scale scraping you need rotating proxies. see our overview of SOCKS5 vs HTTP proxy to choose the right protocol for your use case.

var proxies = new List<string>
{
    "http://user:pass@proxy1:port",
    "http://user:pass@proxy2:port",
    "http://user:pass@proxy3:port"
};

int i = 0;
async Task<HttpClient> GetClient()
{
    var proxy = proxies[i++ % proxies.Count];
    var handler = new HttpClientHandler
    {
        Proxy = new System.Net.WebProxy(proxy, false),
        UseProxy = true
    };
    return new HttpClient(handler);
}

async concurrent scraping

var urls = Enumerable.Range(1, 50)
    .Select(p => $"https://books.toscrape.com/catalogue/page-{p}.html")
    .ToList();

var semaphore = new SemaphoreSlim(5); // max 5 concurrent
var tasks = urls.Select(async url =>
{
    await semaphore.WaitAsync();
    try
    {
        using var client = await GetClient();
        var html = await client.GetStringAsync(url);
        return html;
    }
    finally { semaphore.Release(); }
});

var results = await Task.WhenAll(tasks);

SemaphoreSlim limits concurrency without blocking threads. this keeps memory usage flat even when scraping thousands of pages.

serializing scraped data to json

using System.Text.Json;

var books = new List<object>();
// ... populate books list

var json = JsonSerializer.Serialize(books, new JsonSerializerOptions
{
    WriteIndented = true
});
File.WriteAllText("books.json", json);

common issues in c# scraping

  • SSL certificate errors: set ServerCertificateCustomValidationCallback = HttpClientHandler.DangerousAcceptAnyServerCertificateValidator during development only
  • encoding problems: HtmlDocument.DetectEncoding() handles charset declarations automatically
  • JavaScript-rendered pages: use Playwright for .NET (Microsoft.Playwright) for dynamic content
  • rate limiting: add exponential backoff with Polly (dotnet add package Polly)

playwright for .net (js-rendered pages)

using Microsoft.Playwright;

var playwright = await Playwright.CreateAsync();
var browser = await playwright.Chromium.LaunchAsync();
var page = await browser.NewPageAsync();
await page.GotoAsync("https://example.com");
var content = await page.ContentAsync();
await browser.CloseAsync();

Playwright for .NET mirrors the Python API almost exactly. see what is web scraping for a comparison of headless browser approaches across languages.

sources and further reading

related guides

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top