C# Web Scraping Guide: HtmlAgilityPack and Beyond

C# Web Scraping Guide: HtmlAgilityPack and Beyond

C# is a strong choice for web scraping when you are building data pipelines that integrate with .NET applications, need high performance, or work in an enterprise Windows environment. HtmlAgilityPack is the most popular HTML parsing library in the .NET ecosystem, and paired with HttpClient, it gives you a fast, type-safe scraping toolkit.

this guide covers HtmlAgilityPack, AngleSharp, Playwright for .NET, proxy integration, and patterns for building production-grade scrapers.

Why C# for Web Scraping

C# offers several advantages for scraping:

  • .NET integration: scrape data directly into your .NET applications, databases, and APIs
  • async/await: built-in async support makes concurrent scraping clean and efficient
  • strong typing: catch errors at compile time, not when your scraper is running at 3 AM
  • performance: the .NET runtime is significantly faster than Python for parsing and processing
  • enterprise ecosystem: if your organization runs on .NET, keeping scraping in C# reduces complexity

the main drawback is a smaller scraping community compared to Python. fewer tutorials, fewer scraping-specific libraries, and fewer StackOverflow answers when you get stuck.

Library Overview

LibraryTypeBest For
HtmlAgilityPackHTML parserstatic pages, XPath-based extraction
AngleSharpHTML/CSS parserCSS selector-based extraction, modern API
Playwright for .NETbrowser automationJavaScript-rendered pages
Selenium WebDriverbrowser automationlegacy browser automation
HttpClientHTTP clientmaking requests (built into .NET)

Getting Started with HtmlAgilityPack

Installation

dotnet add package HtmlAgilityPack

or via NuGet Package Manager:

Install-Package HtmlAgilityPack

Basic Scraping

using HtmlAgilityPack;
using System;
using System.Net.Http;
using System.Threading.Tasks;

class BasicScraper
{
    static async Task Main()
    {
        var httpClient = new HttpClient();
        httpClient.DefaultRequestHeaders.Add("User-Agent",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");

        string html = await httpClient.GetStringAsync("https://quotes.toscrape.com/");

        var doc = new HtmlDocument();
        doc.LoadHtml(html);

        // select elements using XPath
        var quotes = doc.DocumentNode.SelectNodes("//div[@class='quote']");

        if (quotes != null)
        {
            foreach (var quote in quotes)
            {
                string text = quote.SelectSingleNode(".//span[@class='text']")?.InnerText ?? "";
                string author = quote.SelectSingleNode(".//small[@class='author']")?.InnerText ?? "";

                Console.WriteLine($"author: {author}");
                Console.WriteLine($"quote: {text}");
                Console.WriteLine();
            }
        }
    }
}

XPath Cheat Sheet for HtmlAgilityPack

var doc = new HtmlDocument();
doc.LoadHtml(html);
var root = doc.DocumentNode;

// basic selection
var allDivs = root.SelectNodes("//div");
var byId = root.SelectSingleNode("//div[@id='main']");
var byClass = root.SelectNodes("//div[@class='product']");

// partial attribute matching
var containsClass = root.SelectNodes("//div[contains(@class, 'product')]");
var startsWith = root.SelectNodes("//a[starts-with(@href, '/product')]");

// text content
var byText = root.SelectNodes("//button[contains(text(), 'Buy')]");

// relationships
var children = root.SelectNodes("//div[@class='parent']/div");
var descendants = root.SelectNodes("//div[@class='parent']//span");
var following = root.SelectNodes("//h2/following-sibling::p");

// multiple conditions
var combo = root.SelectNodes("//div[@class='product' and @data-available='true']");

// get attribute values
var links = root.SelectNodes("//a[@href]");
if (links != null)
{
    foreach (var link in links)
    {
        string href = link.GetAttributeValue("href", "");
        string text = link.InnerText.Trim();
        Console.WriteLine($"{text}: {href}");
    }
}

Proxy Integration

HttpClient with Proxy

using System.Net;
using System.Net.Http;

class ProxyScraper
{
    static HttpClient CreateClientWithProxy(string proxyUrl, string username = null, string password = null)
    {
        var proxy = new WebProxy(proxyUrl);

        if (username != null && password != null)
        {
            proxy.Credentials = new NetworkCredential(username, password);
        }

        var handler = new HttpClientHandler
        {
            Proxy = proxy,
            UseProxy = true,
            AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate
        };

        var client = new HttpClient(handler);
        client.DefaultRequestHeaders.Add("User-Agent",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
        client.Timeout = TimeSpan.FromSeconds(30);

        return client;
    }

    static async Task Main()
    {
        using var client = CreateClientWithProxy(
            "http://proxy.example.com:8080",
            "username",
            "password"
        );

        string html = await client.GetStringAsync("https://httpbin.org/ip");
        Console.WriteLine(html);
    }
}

Rotating Proxies

using System;
using System.Collections.Generic;
using System.Net;
using System.Net.Http;
using System.Threading;
using System.Threading.Tasks;
using HtmlAgilityPack;

class RotatingProxyScraper
{
    private readonly List<ProxyConfig> _proxies;
    private int _currentIndex;
    private readonly object _lock = new object();

    public RotatingProxyScraper(List<ProxyConfig> proxies)
    {
        _proxies = proxies;
        _currentIndex = 0;
    }

    private ProxyConfig GetNextProxy()
    {
        lock (_lock)
        {
            var proxy = _proxies[_currentIndex % _proxies.Count];
            _currentIndex++;
            return proxy;
        }
    }

    public async Task<HtmlDocument> FetchAsync(string url, int maxRetries = 3)
    {
        for (int attempt = 0; attempt < maxRetries; attempt++)
        {
            var proxyConfig = GetNextProxy();

            try
            {
                var proxy = new WebProxy(proxyConfig.Host, proxyConfig.Port);
                if (proxyConfig.Username != null)
                {
                    proxy.Credentials = new NetworkCredential(
                        proxyConfig.Username, proxyConfig.Password);
                }

                var handler = new HttpClientHandler { Proxy = proxy, UseProxy = true };
                using var client = new HttpClient(handler);
                client.DefaultRequestHeaders.Add("User-Agent",
                    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
                client.Timeout = TimeSpan.FromSeconds(30);

                string html = await client.GetStringAsync(url);

                var doc = new HtmlDocument();
                doc.LoadHtml(html);
                return doc;
            }
            catch (Exception ex)
            {
                Console.WriteLine($"attempt {attempt + 1} failed ({proxyConfig.Host}): {ex.Message}");
                await Task.Delay(TimeSpan.FromSeconds(Math.Pow(2, attempt)));
            }
        }

        return null;
    }
}

class ProxyConfig
{
    public string Host { get; set; }
    public int Port { get; set; }
    public string Username { get; set; }
    public string Password { get; set; }
}

AngleSharp: the Modern Alternative

AngleSharp provides CSS selector support (which HtmlAgilityPack lacks natively) and a more modern API:

dotnet add package AngleSharp
using AngleSharp;
using AngleSharp.Html.Dom;
using System;
using System.Linq;
using System.Threading.Tasks;

class AngleSharpScraper
{
    static async Task Main()
    {
        var config = Configuration.Default.WithDefaultLoader();
        var context = BrowsingContext.New(config);

        // load page directly (AngleSharp handles HTTP)
        var document = await context.OpenAsync("https://quotes.toscrape.com/");

        // use CSS selectors (much cleaner than XPath)
        var quotes = document.QuerySelectorAll("div.quote");

        foreach (var quote in quotes)
        {
            string text = quote.QuerySelector("span.text")?.TextContent ?? "";
            string author = quote.QuerySelector("small.author")?.TextContent ?? "";
            var tags = quote.QuerySelectorAll("div.tags a.tag")
                           .Select(t => t.TextContent);

            Console.WriteLine($"author: {author}");
            Console.WriteLine($"quote: {text}");
            Console.WriteLine($"tags: {string.Join(", ", tags)}");
            Console.WriteLine();
        }
    }
}

AngleSharp with Proxy

using AngleSharp;
using AngleSharp.Io;
using System.Net;
using System.Net.Http;

class AngleSharpProxyScraper
{
    static async Task Main()
    {
        var proxy = new WebProxy("http://proxy.example.com:8080")
        {
            Credentials = new NetworkCredential("user", "pass")
        };

        var handler = new HttpClientHandler { Proxy = proxy, UseProxy = true };
        var httpClient = new HttpClient(handler);

        var requester = new HttpClientRequester(httpClient);
        var config = Configuration.Default.With(requester).WithDefaultLoader();
        var context = BrowsingContext.New(config);

        var document = await context.OpenAsync("https://example.com");

        var products = document.QuerySelectorAll(".product-card");
        foreach (var product in products)
        {
            Console.WriteLine(product.QuerySelector("h2")?.TextContent);
        }
    }
}

Async Concurrent Scraping

C#’s async/await model makes concurrent scraping clean and efficient:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Net.Http;
using System.Threading;
using System.Threading.Tasks;
using HtmlAgilityPack;

class ConcurrentScraper
{
    private readonly HttpClient _client;
    private readonly SemaphoreSlim _semaphore;
    private readonly int _delayMs;

    public ConcurrentScraper(HttpClient client, int maxConcurrency = 5, int delayMs = 1000)
    {
        _client = client;
        _semaphore = new SemaphoreSlim(maxConcurrency);
        _delayMs = delayMs;
    }

    public async Task<List<ScrapedPage>> ScrapeAllAsync(List<string> urls)
    {
        var tasks = urls.Select(url => ScrapeWithThrottleAsync(url));
        var results = await Task.WhenAll(tasks);
        return results.ToList();
    }

    private async Task<ScrapedPage> ScrapeWithThrottleAsync(string url)
    {
        await _semaphore.WaitAsync();
        try
        {
            var html = await _client.GetStringAsync(url);
            var doc = new HtmlDocument();
            doc.LoadHtml(html);

            var title = doc.DocumentNode.SelectSingleNode("//h1")?.InnerText?.Trim() ?? "";
            var price = doc.DocumentNode.SelectSingleNode("//*[contains(@class, 'price')]")
                           ?.InnerText?.Trim() ?? "";

            await Task.Delay(_delayMs);

            return new ScrapedPage
            {
                Url = url,
                Title = title,
                Price = price,
                Success = true
            };
        }
        catch (Exception ex)
        {
            return new ScrapedPage
            {
                Url = url,
                Error = ex.Message,
                Success = false
            };
        }
        finally
        {
            _semaphore.Release();
        }
    }
}

class ScrapedPage
{
    public string Url { get; set; }
    public string Title { get; set; }
    public string Price { get; set; }
    public string Error { get; set; }
    public bool Success { get; set; }
}

Playwright for .NET: JavaScript-Rendered Pages

when pages require JavaScript rendering, use Playwright:

dotnet add package Microsoft.Playwright
using Microsoft.Playwright;
using System;
using System.Threading.Tasks;

class PlaywrightScraper
{
    static async Task Main()
    {
        // install browsers (run once)
        // dotnet tool install --global Microsoft.Playwright.CLI
        // playwright install chromium

        using var playwright = await Playwright.CreateAsync();

        var launchOptions = new BrowserTypeLaunchOptions
        {
            Headless = true,
            Proxy = new Proxy
            {
                Server = "http://proxy.example.com:8080",
                Username = "user",
                Password = "pass"
            }
        };

        await using var browser = await playwright.Chromium.LaunchAsync(launchOptions);
        var page = await browser.NewPageAsync(new BrowserNewPageOptions
        {
            UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        });

        await page.GotoAsync("https://example.com/dynamic-page",
            new PageGotoOptions { WaitUntil = WaitUntilState.NetworkIdle });

        // wait for specific element
        await page.WaitForSelectorAsync("div.results");

        // extract data using Playwright's built-in selectors
        var items = await page.QuerySelectorAllAsync("div.result-item");

        foreach (var item in items)
        {
            string title = await item.InnerTextAsync();
            Console.WriteLine(title);
        }

        // or get the HTML and parse with HtmlAgilityPack
        string html = await page.ContentAsync();
        var doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(html);

        // use XPath on the rendered DOM
        var products = doc.DocumentNode.SelectNodes("//div[@class='product']");
    }
}

Building a Production Scraper

here is a complete, production-ready scraping application:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Text.Json;
using System.Threading.Tasks;
using HtmlAgilityPack;

class ProductionScraper
{
    private readonly List<ProxyConfig> _proxies;
    private readonly Random _random = new Random();
    private readonly string[] _userAgents = {
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0"
    };

    public ProductionScraper(List<ProxyConfig> proxies)
    {
        _proxies = proxies;
    }

    private HttpClient CreateClient()
    {
        var proxyConfig = _proxies[_random.Next(_proxies.Count)];
        var proxy = new WebProxy(proxyConfig.Host, proxyConfig.Port);

        if (proxyConfig.Username != null)
            proxy.Credentials = new NetworkCredential(proxyConfig.Username, proxyConfig.Password);

        var handler = new HttpClientHandler
        {
            Proxy = proxy,
            UseProxy = true,
            AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate
        };

        var client = new HttpClient(handler);
        client.DefaultRequestHeaders.Add("User-Agent", _userAgents[_random.Next(_userAgents.Length)]);
        client.DefaultRequestHeaders.Add("Accept", "text/html,application/xhtml+xml");
        client.DefaultRequestHeaders.Add("Accept-Language", "en-US,en;q=0.9");
        client.Timeout = TimeSpan.FromSeconds(30);

        return client;
    }

    public async Task<HtmlDocument> FetchWithRetryAsync(string url, int maxRetries = 3)
    {
        Exception lastException = null;

        for (int i = 0; i < maxRetries; i++)
        {
            try
            {
                using var client = CreateClient();
                var response = await client.GetAsync(url);
                response.EnsureSuccessStatusCode();

                string html = await response.Content.ReadAsStringAsync();

                // check for block pages
                if (html.Contains("Access Denied") || html.Contains("captcha"))
                {
                    throw new Exception("blocked by target site");
                }

                var doc = new HtmlDocument();
                doc.LoadHtml(html);
                return doc;
            }
            catch (Exception ex)
            {
                lastException = ex;
                Console.WriteLine($"  retry {i + 1}/{maxRetries}: {ex.Message}");
                await Task.Delay(TimeSpan.FromSeconds(Math.Pow(2, i)));
            }
        }

        throw lastException;
    }

    public async Task<List<Dictionary<string, string>>> ScrapeProductsAsync(
        List<string> urls, int concurrency = 3, int delayMs = 2000)
    {
        var results = new List<Dictionary<string, string>>();
        var semaphore = new System.Threading.SemaphoreSlim(concurrency);

        var tasks = urls.Select(async (url, index) =>
        {
            await semaphore.WaitAsync();
            try
            {
                Console.WriteLine($"[{index + 1}/{urls.Count}] {url}");
                var doc = await FetchWithRetryAsync(url);

                var product = new Dictionary<string, string>
                {
                    ["url"] = url,
                    ["title"] = GetText(doc, "//h1"),
                    ["price"] = GetText(doc, "//*[contains(@class, 'price')]"),
                    ["description"] = GetText(doc, "//*[contains(@class, 'description')]"),
                    ["rating"] = GetText(doc, "//*[contains(@class, 'rating')]")
                };

                lock (results) { results.Add(product); }

                await Task.Delay(delayMs + _random.Next(1000));
            }
            catch (Exception ex)
            {
                Console.WriteLine($"  failed: {ex.Message}");
                lock (results)
                {
                    results.Add(new Dictionary<string, string>
                    {
                        ["url"] = url,
                        ["error"] = ex.Message
                    });
                }
            }
            finally
            {
                semaphore.Release();
            }
        });

        await Task.WhenAll(tasks);
        return results;
    }

    private string GetText(HtmlDocument doc, string xpath)
    {
        return doc.DocumentNode.SelectSingleNode(xpath)?.InnerText?.Trim() ?? "";
    }

    public async Task SaveResultsAsync(List<Dictionary<string, string>> results, string filename)
    {
        var options = new JsonSerializerOptions { WriteIndented = true };
        string json = JsonSerializer.Serialize(results, options);
        await File.WriteAllTextAsync(filename, json);
        Console.WriteLine($"saved {results.Count} results to {filename}");
    }

    static async Task Main()
    {
        var proxies = new List<ProxyConfig>
        {
            new() { Host = "proxy1.example.com", Port = 8080, Username = "user", Password = "pass" },
            new() { Host = "proxy2.example.com", Port = 8080, Username = "user", Password = "pass" }
        };

        var scraper = new ProductionScraper(proxies);

        var urls = Enumerable.Range(1, 20)
            .Select(i => $"https://example.com/product/{i}")
            .ToList();

        var results = await scraper.ScrapeProductsAsync(urls, concurrency: 3, delayMs: 2000);
        await scraper.SaveResultsAsync(results, "products.json");
    }
}

Handling Common Challenges

Following Pagination

public async Task<List<string>> GetAllPageUrlsAsync(string baseUrl, int maxPages = 50)
{
    var allProductUrls = new List<string>();
    string currentUrl = baseUrl;
    int pageCount = 0;

    while (currentUrl != null && pageCount < maxPages)
    {
        pageCount++;
        var doc = await FetchWithRetryAsync(currentUrl);

        // extract product URLs from this page
        var productLinks = doc.DocumentNode.SelectNodes("//a[contains(@class, 'product')]");
        if (productLinks != null)
        {
            foreach (var link in productLinks)
            {
                string href = link.GetAttributeValue("href", "");
                if (!string.IsNullOrEmpty(href))
                {
                    var absoluteUrl = new Uri(new Uri(currentUrl), href).ToString();
                    allProductUrls.Add(absoluteUrl);
                }
            }
        }

        // find next page
        var nextLink = doc.DocumentNode.SelectSingleNode(
            "//a[contains(@class, 'next')] | //a[@rel='next'] | //li[contains(@class, 'next')]/a");

        currentUrl = nextLink?.GetAttributeValue("href", null);
        if (currentUrl != null && !currentUrl.StartsWith("http"))
        {
            currentUrl = new Uri(new Uri(baseUrl), currentUrl).ToString();
        }

        await Task.Delay(2000);
    }

    return allProductUrls;
}

Decoding HTML Entities

using System.Net;

string raw = "Price: &amp;dollar;29.99 &mdash; Sale!";
string decoded = WebUtility.HtmlDecode(raw);
// result: "Price: $29.99 -- Sale!"

C# vs Python for Web Scraping

AspectC#Python
setup complexityhigher (project file, NuGet)lower (pip install)
parsing speedfasterslower
async supportexcellent (built-in)good (asyncio)
selector styleXPath (HtmlAgilityPack) or CSS (AngleSharp)CSS (BeautifulSoup) or both
enterprise integrationexcellent (.NET ecosystem)good
community resourcessmaller for scrapingmuch larger
type safetycompile-timeruntime
development speedslowerfaster

Conclusion

C# web scraping with HtmlAgilityPack is a solid choice for .NET teams and enterprise environments. the combination of strong typing, excellent async support, and high performance makes it particularly well suited for production scrapers that need to run reliably at scale.

if you prefer CSS selectors over XPath, AngleSharp is the better library choice. for JavaScript-rendered pages, Playwright for .NET gives you full browser automation with the same API quality you would expect from the Playwright project.

pair any of these libraries with rotating residential proxies through .NET’s HttpClient, and you have a complete scraping infrastructure that integrates seamlessly with the rest of your .NET stack.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top