Web Scraping with PHP: Tutorial & Libraries

Web Scraping with PHP: Tutorial & Libraries

PHP is a practical choice for web scraping when your existing infrastructure runs on PHP — WordPress plugins, Laravel applications, or cron-based data collection scripts. While Python dominates the scraping landscape, PHP’s built-in cURL support, Goutte framework, and Symfony DomCrawler make it fully capable of handling scraping tasks from simple page parsing to multi-step crawling workflows.

This tutorial covers every major PHP scraping approach, from raw cURL to the Goutte framework, with working code examples.

Table of Contents

Why PHP for Web Scraping

  • Existing infrastructure — If your site runs PHP, scraping integrates directly
  • cURL built-in — PHP ships with cURL, no additional HTTP library needed
  • DOMDocument built-in — Native HTML/XML parsing without dependencies
  • Shared hosting friendly — Runs on almost any web server
  • Composer ecosystem — Goutte, Guzzle, and Symfony components are mature

Setting Up

# Create project
mkdir php-scraper && cd php-scraper
composer init --no-interaction

# Install libraries
composer require symfony/dom-crawler
composer require symfony/css-selector
composer require guzzlehttp/guzzle
composer require fabpot/goutte

cURL: Basic HTTP Requests

PHP’s built-in cURL extension handles HTTP without any dependencies:

<?php

function fetchPage(string $url): string {
    $ch = curl_init();

    curl_setopt_array($ch, [
        CURLOPT_URL            => $url,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_TIMEOUT        => 30,
        CURLOPT_USERAGENT      => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        CURLOPT_HTTPHEADER     => [
            'Accept: text/html,application/xhtml+xml',
            'Accept-Language: en-US,en;q=0.9',
        ],
        CURLOPT_SSL_VERIFYPEER => true,
    ]);

    $html = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

    if (curl_errno($ch)) {
        throw new Exception('cURL error: ' . curl_error($ch));
    }

    curl_close($ch);

    if ($httpCode !== 200) {
        throw new Exception("HTTP {$httpCode} for {$url}");
    }

    return $html;
}

// Usage
$html = fetchPage('https://books.toscrape.com/');
echo "Page length: " . strlen($html) . " bytes\n";

POST Requests

<?php

$ch = curl_init();
curl_setopt_array($ch, [
    CURLOPT_URL            => 'https://example.com/api/search',
    CURLOPT_POST           => true,
    CURLOPT_POSTFIELDS     => json_encode(['query' => 'web scraping', 'page' => 1]),
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_HTTPHEADER     => [
        'Content-Type: application/json',
        'Accept: application/json',
    ],
]);

$response = curl_exec($ch);
curl_close($ch);

$data = json_decode($response, true);
print_r($data);

DOMDocument: Built-in HTML Parsing

PHP’s DOMDocument and DOMXPath provide native HTML parsing:

<?php

$html = file_get_contents('https://books.toscrape.com/');

// Suppress HTML5 warnings
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();

$xpath = new DOMXPath($dom);

// Extract books using XPath
$books = $xpath->query("//article[contains(@class, 'product_pod')]");

foreach ($books as $book) {
    $titleNode = $xpath->query(".//h3/a", $book)->item(0);
    $priceNode = $xpath->query(".//*[contains(@class, 'price_color')]", $book)->item(0);

    $title = $titleNode->getAttribute('title');
    $price = $priceNode->textContent;

    echo "{$title}: {$price}\n";
}

DOMXPath Selectors

<?php

// All links
$links = $xpath->query("//a[@href]");

// Elements by class
$products = $xpath->query("//*[contains(@class, 'product')]");

// Elements by attribute
$dataItems = $xpath->query("//*[@data-id]");

// Text content
$prices = $xpath->query("//span[@class='price']/text()");

// Conditional
$cheapItems = $xpath->query("//div[@class='product'][.//span[@class='price'][number(substring(text(),2)) < 20]]");

Symfony DomCrawler

DomCrawler provides a cleaner API with CSS selector support:

<?php
require 'vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$html = file_get_contents('https://books.toscrape.com/');
$crawler = new Crawler($html);

// CSS selectors
$books = $crawler->filter('article.product_pod')->each(function (Crawler $node) {
    return [
        'title'  => $node->filter('h3 a')->attr('title'),
        'price'  => $node->filter('.price_color')->text(),
        'rating' => str_replace('star-rating ', '', $node->filter('p.star-rating')->attr('class')),
    ];
});

foreach ($books as $book) {
    echo "{$book['title']} — {$book['price']} ({$book['rating']})\n";
}

echo "Total: " . count($books) . " books\n";

DomCrawler Methods

<?php

// Get text content
$title = $crawler->filter('h1')->text();
$title = $crawler->filter('h1')->text('Default value'); // With fallback

// Get attribute
$href = $crawler->filter('a.link')->attr('href');

// Get all matching elements
$allPrices = $crawler->filter('.price')->each(fn($node) => $node->text());

// Check existence
$exists = $crawler->filter('.element')->count() > 0;

// Get inner HTML
$html = $crawler->filter('.content')->html();

// Traverse
$crawler->filter('ul.menu li')->each(function (Crawler $node, $i) {
    echo "Item {$i}: " . $node->text() . "\n";
});

// Filter within results
$firstProduct = $crawler->filter('.product')->first();
$lastProduct = $crawler->filter('.product')->last();
$thirdProduct = $crawler->filter('.product')->eq(2);

Goutte: Scraping Framework

Goutte combines Guzzle HTTP and DomCrawler into a web scraping client:

<?php
require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', 'https://books.toscrape.com/');

// Extract books
$books = $crawler->filter('article.product_pod')->each(function ($node) {
    return [
        'title' => $node->filter('h3 a')->attr('title'),
        'price' => $node->filter('.price_color')->text(),
    ];
});

// Follow links
$detailLink = $crawler->filter('article.product_pod h3 a')->first()->attr('href');
$detailCrawler = $client->click($crawler->filter('article.product_pod h3 a')->first()->link());

$description = $detailCrawler->filter('#product_description + p')->text('No description');
echo "First book description: {$description}\n";

Form Submission with Goutte

<?php

$client = new Client();
$crawler = $client->request('GET', 'https://example.com/login');

// Fill and submit form
$form = $crawler->selectButton('Login')->form([
    'username' => 'user',
    'password' => 'pass',
]);

$crawler = $client->submit($form);

// Now scrape authenticated content
$dashboard = $client->request('GET', 'https://example.com/dashboard');
$data = $dashboard->filter('.data-row')->each(fn($node) => $node->text());
print_r($data);

Guzzle: Advanced HTTP

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;

$client = new Client([
    'timeout'  => 30,
    'headers'  => [
        'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
        'Accept'     => 'text/html',
    ],
    'cookies'  => true,
]);

$response = $client->get('https://books.toscrape.com/');
$html = $response->getBody()->getContents();

$crawler = new Crawler($html);
$books = $crawler->filter('article.product_pod')->each(function ($node) {
    return [
        'title' => $node->filter('h3 a')->attr('title'),
        'price' => $node->filter('.price_color')->text(),
    ];
});

echo json_encode($books, JSON_PRETTY_PRINT) . "\n";

Handling Pagination

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;

$client = new Client(['timeout' => 30]);
$allBooks = [];

for ($page = 1; $page <= 50; $page++) {
    $url = "https://books.toscrape.com/catalogue/page-{$page}.html";

    try {
        $response = $client->get($url);
        $crawler = new Crawler($response->getBody()->getContents());

        $books = $crawler->filter('article.product_pod')->each(function ($node) {
            return [
                'title' => $node->filter('h3 a')->attr('title'),
                'price' => $node->filter('.price_color')->text(),
            ];
        });

        if (empty($books)) break;

        $allBooks = array_merge($allBooks, $books);
        echo "Page {$page}: " . count($books) . " books\n";

        usleep(random_int(500000, 1500000)); // 0.5-1.5s delay

    } catch (\Exception $e) {
        echo "Error on page {$page}: " . $e->getMessage() . "\n";
        break;
    }
}

echo "Total: " . count($allBooks) . " books\n";

Proxy Integration

cURL with Proxy

<?php

$ch = curl_init();
curl_setopt_array($ch, [
    CURLOPT_URL            => 'https://httpbin.org/ip',
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_PROXY          => 'http://proxy.example.com:8080',
    CURLOPT_PROXYUSERPWD   => 'username:password',
]);

$response = curl_exec($ch);
curl_close($ch);

echo $response;

Guzzle with Proxy

<?php

$client = new Client([
    'proxy' => [
        'http'  => 'http://user:pass@proxy.example.com:8080',
        'https' => 'http://user:pass@proxy.example.com:8080',
    ],
]);

$response = $client->get('https://httpbin.org/ip');
echo $response->getBody();

Rotating Proxies

<?php

$proxies = [
    'http://user:pass@proxy1.example.com:8080',
    'http://user:pass@proxy2.example.com:8080',
    'http://user:pass@proxy3.example.com:8080',
];

function getWithRotatingProxy(string $url, array $proxies): string {
    $proxy = $proxies[array_rand($proxies)];

    $client = new Client([
        'proxy'   => ['http' => $proxy, 'https' => $proxy],
        'timeout' => 30,
    ]);

    $response = $client->get($url);
    return $response->getBody()->getContents();
}

For proxy selection, see our web scraping proxy guide and proxy glossary.

Concurrent Requests

Guzzle supports concurrent requests with promises:

<?php

use GuzzleHttp\Client;
use GuzzleHttp\Promise\Utils;
use Symfony\Component\DomCrawler\Crawler;

$client = new Client(['timeout' => 30]);
$promises = [];

for ($page = 1; $page <= 50; $page++) {
    $url = "https://books.toscrape.com/catalogue/page-{$page}.html";
    $promises[$page] = $client->getAsync($url);
}

$results = Utils::settle($promises)->wait();

$allBooks = [];
foreach ($results as $page => $result) {
    if ($result['state'] === 'fulfilled') {
        $html = $result['value']->getBody()->getContents();
        $crawler = new Crawler($html);

        $books = $crawler->filter('article.product_pod')->each(function ($node) {
            return [
                'title' => $node->filter('h3 a')->attr('title'),
                'price' => $node->filter('.price_color')->text(),
            ];
        });

        $allBooks = array_merge($allBooks, $books);
    }
}

echo "Total: " . count($allBooks) . " books\n";

Storing Data

CSV

<?php

$fp = fopen('books.csv', 'w');
fputcsv($fp, ['Title', 'Price']);

foreach ($allBooks as $book) {
    fputcsv($fp, [$book['title'], $book['price']]);
}

fclose($fp);

JSON

<?php

file_put_contents('books.json',
    json_encode($allBooks, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE));

MySQL/Database

<?php

$pdo = new PDO('mysql:host=localhost;dbname=scraper', 'user', 'pass');
$stmt = $pdo->prepare("INSERT INTO books (title, price) VALUES (?, ?)");

foreach ($allBooks as $book) {
    $stmt->execute([$book['title'], $book['price']]);
}

FAQ

Is PHP good for web scraping?

PHP is perfectly capable for web scraping, especially if your existing stack is PHP-based. It has built-in cURL and DOMDocument, plus excellent libraries like Goutte and Guzzle. However, Python has a larger scraping ecosystem and more community resources.

What is the best PHP library for web scraping?

For simple projects, Symfony DomCrawler with Guzzle provides the best balance of power and simplicity. Goutte wraps these into a convenient scraping client. For maximum control, use raw cURL with DOMDocument.

Can PHP scrape JavaScript-rendered pages?

PHP alone cannot render JavaScript. You need to use a headless browser. Options include calling Puppeteer/Playwright via Node.js from PHP, using the chrome-php/chrome package to control Chrome, or finding the API endpoint that JavaScript calls and hitting it directly with cURL.

How does PHP compare to Python for web scraping?

Python has more scraping-specific libraries (Scrapy, BeautifulSoup, etc.) and a larger community. PHP’s advantage is integration with existing PHP applications and shared hosting support. Use PHP when it fits your existing stack; use Python for dedicated scraping projects.


Explore scraping in other languages: Python, Java, Node.js. For proxy setup, see our web scraping proxy guide.

External Resources:


Related Reading

Scroll to Top