Web Scraping with PHP: Tutorial & Libraries
PHP is a practical choice for web scraping when your existing infrastructure runs on PHP — WordPress plugins, Laravel applications, or cron-based data collection scripts. While Python dominates the scraping landscape, PHP’s built-in cURL support, Goutte framework, and Symfony DomCrawler make it fully capable of handling scraping tasks from simple page parsing to multi-step crawling workflows.
This tutorial covers every major PHP scraping approach, from raw cURL to the Goutte framework, with working code examples.
Table of Contents
- Why PHP for Web Scraping
- Setting Up
- cURL: Basic HTTP Requests
- DOMDocument: Built-in HTML Parsing
- Symfony DomCrawler
- Goutte: Scraping Framework
- Guzzle: Advanced HTTP
- Handling Pagination
- Proxy Integration
- Concurrent Requests
- Storing Data
- FAQ
Why PHP for Web Scraping
- Existing infrastructure — If your site runs PHP, scraping integrates directly
- cURL built-in — PHP ships with cURL, no additional HTTP library needed
- DOMDocument built-in — Native HTML/XML parsing without dependencies
- Shared hosting friendly — Runs on almost any web server
- Composer ecosystem — Goutte, Guzzle, and Symfony components are mature
Setting Up
# Create project
mkdir php-scraper && cd php-scraper
composer init --no-interaction
# Install libraries
composer require symfony/dom-crawler
composer require symfony/css-selector
composer require guzzlehttp/guzzle
composer require fabpot/gouttecURL: Basic HTTP Requests
PHP’s built-in cURL extension handles HTTP without any dependencies:
<?php
function fetchPage(string $url): string {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_TIMEOUT => 30,
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
CURLOPT_HTTPHEADER => [
'Accept: text/html,application/xhtml+xml',
'Accept-Language: en-US,en;q=0.9',
],
CURLOPT_SSL_VERIFYPEER => true,
]);
$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if (curl_errno($ch)) {
throw new Exception('cURL error: ' . curl_error($ch));
}
curl_close($ch);
if ($httpCode !== 200) {
throw new Exception("HTTP {$httpCode} for {$url}");
}
return $html;
}
// Usage
$html = fetchPage('https://books.toscrape.com/');
echo "Page length: " . strlen($html) . " bytes\n";POST Requests
<?php
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => 'https://example.com/api/search',
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => json_encode(['query' => 'web scraping', 'page' => 1]),
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => [
'Content-Type: application/json',
'Accept: application/json',
],
]);
$response = curl_exec($ch);
curl_close($ch);
$data = json_decode($response, true);
print_r($data);DOMDocument: Built-in HTML Parsing
PHP’s DOMDocument and DOMXPath provide native HTML parsing:
<?php
$html = file_get_contents('https://books.toscrape.com/');
// Suppress HTML5 warnings
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
// Extract books using XPath
$books = $xpath->query("//article[contains(@class, 'product_pod')]");
foreach ($books as $book) {
$titleNode = $xpath->query(".//h3/a", $book)->item(0);
$priceNode = $xpath->query(".//*[contains(@class, 'price_color')]", $book)->item(0);
$title = $titleNode->getAttribute('title');
$price = $priceNode->textContent;
echo "{$title}: {$price}\n";
}DOMXPath Selectors
<?php
// All links
$links = $xpath->query("//a[@href]");
// Elements by class
$products = $xpath->query("//*[contains(@class, 'product')]");
// Elements by attribute
$dataItems = $xpath->query("//*[@data-id]");
// Text content
$prices = $xpath->query("//span[@class='price']/text()");
// Conditional
$cheapItems = $xpath->query("//div[@class='product'][.//span[@class='price'][number(substring(text(),2)) < 20]]");Symfony DomCrawler
DomCrawler provides a cleaner API with CSS selector support:
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('https://books.toscrape.com/');
$crawler = new Crawler($html);
// CSS selectors
$books = $crawler->filter('article.product_pod')->each(function (Crawler $node) {
return [
'title' => $node->filter('h3 a')->attr('title'),
'price' => $node->filter('.price_color')->text(),
'rating' => str_replace('star-rating ', '', $node->filter('p.star-rating')->attr('class')),
];
});
foreach ($books as $book) {
echo "{$book['title']} — {$book['price']} ({$book['rating']})\n";
}
echo "Total: " . count($books) . " books\n";DomCrawler Methods
<?php
// Get text content
$title = $crawler->filter('h1')->text();
$title = $crawler->filter('h1')->text('Default value'); // With fallback
// Get attribute
$href = $crawler->filter('a.link')->attr('href');
// Get all matching elements
$allPrices = $crawler->filter('.price')->each(fn($node) => $node->text());
// Check existence
$exists = $crawler->filter('.element')->count() > 0;
// Get inner HTML
$html = $crawler->filter('.content')->html();
// Traverse
$crawler->filter('ul.menu li')->each(function (Crawler $node, $i) {
echo "Item {$i}: " . $node->text() . "\n";
});
// Filter within results
$firstProduct = $crawler->filter('.product')->first();
$lastProduct = $crawler->filter('.product')->last();
$thirdProduct = $crawler->filter('.product')->eq(2);Goutte: Scraping Framework
Goutte combines Guzzle HTTP and DomCrawler into a web scraping client:
<?php
require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://books.toscrape.com/');
// Extract books
$books = $crawler->filter('article.product_pod')->each(function ($node) {
return [
'title' => $node->filter('h3 a')->attr('title'),
'price' => $node->filter('.price_color')->text(),
];
});
// Follow links
$detailLink = $crawler->filter('article.product_pod h3 a')->first()->attr('href');
$detailCrawler = $client->click($crawler->filter('article.product_pod h3 a')->first()->link());
$description = $detailCrawler->filter('#product_description + p')->text('No description');
echo "First book description: {$description}\n";Form Submission with Goutte
<?php
$client = new Client();
$crawler = $client->request('GET', 'https://example.com/login');
// Fill and submit form
$form = $crawler->selectButton('Login')->form([
'username' => 'user',
'password' => 'pass',
]);
$crawler = $client->submit($form);
// Now scrape authenticated content
$dashboard = $client->request('GET', 'https://example.com/dashboard');
$data = $dashboard->filter('.data-row')->each(fn($node) => $node->text());
print_r($data);Guzzle: Advanced HTTP
<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;
$client = new Client([
'timeout' => 30,
'headers' => [
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
'Accept' => 'text/html',
],
'cookies' => true,
]);
$response = $client->get('https://books.toscrape.com/');
$html = $response->getBody()->getContents();
$crawler = new Crawler($html);
$books = $crawler->filter('article.product_pod')->each(function ($node) {
return [
'title' => $node->filter('h3 a')->attr('title'),
'price' => $node->filter('.price_color')->text(),
];
});
echo json_encode($books, JSON_PRETTY_PRINT) . "\n";Handling Pagination
<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;
$client = new Client(['timeout' => 30]);
$allBooks = [];
for ($page = 1; $page <= 50; $page++) {
$url = "https://books.toscrape.com/catalogue/page-{$page}.html";
try {
$response = $client->get($url);
$crawler = new Crawler($response->getBody()->getContents());
$books = $crawler->filter('article.product_pod')->each(function ($node) {
return [
'title' => $node->filter('h3 a')->attr('title'),
'price' => $node->filter('.price_color')->text(),
];
});
if (empty($books)) break;
$allBooks = array_merge($allBooks, $books);
echo "Page {$page}: " . count($books) . " books\n";
usleep(random_int(500000, 1500000)); // 0.5-1.5s delay
} catch (\Exception $e) {
echo "Error on page {$page}: " . $e->getMessage() . "\n";
break;
}
}
echo "Total: " . count($allBooks) . " books\n";Proxy Integration
cURL with Proxy
<?php
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => 'https://httpbin.org/ip',
CURLOPT_RETURNTRANSFER => true,
CURLOPT_PROXY => 'http://proxy.example.com:8080',
CURLOPT_PROXYUSERPWD => 'username:password',
]);
$response = curl_exec($ch);
curl_close($ch);
echo $response;Guzzle with Proxy
<?php
$client = new Client([
'proxy' => [
'http' => 'http://user:pass@proxy.example.com:8080',
'https' => 'http://user:pass@proxy.example.com:8080',
],
]);
$response = $client->get('https://httpbin.org/ip');
echo $response->getBody();Rotating Proxies
<?php
$proxies = [
'http://user:pass@proxy1.example.com:8080',
'http://user:pass@proxy2.example.com:8080',
'http://user:pass@proxy3.example.com:8080',
];
function getWithRotatingProxy(string $url, array $proxies): string {
$proxy = $proxies[array_rand($proxies)];
$client = new Client([
'proxy' => ['http' => $proxy, 'https' => $proxy],
'timeout' => 30,
]);
$response = $client->get($url);
return $response->getBody()->getContents();
}For proxy selection, see our web scraping proxy guide and proxy glossary.
Concurrent Requests
Guzzle supports concurrent requests with promises:
<?php
use GuzzleHttp\Client;
use GuzzleHttp\Promise\Utils;
use Symfony\Component\DomCrawler\Crawler;
$client = new Client(['timeout' => 30]);
$promises = [];
for ($page = 1; $page <= 50; $page++) {
$url = "https://books.toscrape.com/catalogue/page-{$page}.html";
$promises[$page] = $client->getAsync($url);
}
$results = Utils::settle($promises)->wait();
$allBooks = [];
foreach ($results as $page => $result) {
if ($result['state'] === 'fulfilled') {
$html = $result['value']->getBody()->getContents();
$crawler = new Crawler($html);
$books = $crawler->filter('article.product_pod')->each(function ($node) {
return [
'title' => $node->filter('h3 a')->attr('title'),
'price' => $node->filter('.price_color')->text(),
];
});
$allBooks = array_merge($allBooks, $books);
}
}
echo "Total: " . count($allBooks) . " books\n";Storing Data
CSV
<?php
$fp = fopen('books.csv', 'w');
fputcsv($fp, ['Title', 'Price']);
foreach ($allBooks as $book) {
fputcsv($fp, [$book['title'], $book['price']]);
}
fclose($fp);JSON
<?php
file_put_contents('books.json',
json_encode($allBooks, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE));MySQL/Database
<?php
$pdo = new PDO('mysql:host=localhost;dbname=scraper', 'user', 'pass');
$stmt = $pdo->prepare("INSERT INTO books (title, price) VALUES (?, ?)");
foreach ($allBooks as $book) {
$stmt->execute([$book['title'], $book['price']]);
}FAQ
Is PHP good for web scraping?
PHP is perfectly capable for web scraping, especially if your existing stack is PHP-based. It has built-in cURL and DOMDocument, plus excellent libraries like Goutte and Guzzle. However, Python has a larger scraping ecosystem and more community resources.
What is the best PHP library for web scraping?
For simple projects, Symfony DomCrawler with Guzzle provides the best balance of power and simplicity. Goutte wraps these into a convenient scraping client. For maximum control, use raw cURL with DOMDocument.
Can PHP scrape JavaScript-rendered pages?
PHP alone cannot render JavaScript. You need to use a headless browser. Options include calling Puppeteer/Playwright via Node.js from PHP, using the chrome-php/chrome package to control Chrome, or finding the API endpoint that JavaScript calls and hitting it directly with cURL.
How does PHP compare to Python for web scraping?
Python has more scraping-specific libraries (Scrapy, BeautifulSoup, etc.) and a larger community. PHP’s advantage is integration with existing PHP applications and shared hosting support. Use PHP when it fits your existing stack; use Python for dedicated scraping projects.
Explore scraping in other languages: Python, Java, Node.js. For proxy setup, see our web scraping proxy guide.
External Resources:
- Goutte Documentation
- Symfony DomCrawler Documentation
- Guzzle Documentation
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
Related Reading
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company