Server-Sent Events (SSE) Scraping Guide
Server-Sent Events (SSE) provide a simpler alternative to WebSockets for server-to-client streaming. Many AI applications (ChatGPT, Claude), live dashboards, and real-time feeds use SSE to push data to browsers. Unlike WebSockets, SSE works over standard HTTP, making it easier to proxy and scrape.
How SSE Works
Client sends:
GET /events HTTP/1.1
Accept: text/event-stream
Server responds:
HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
data: {"price": 45230.50, "symbol": "BTC"}
data: {"price": 45231.20, "symbol": "BTC"}
event: update
data: {"price": 45229.80, "symbol": "BTC"}
id: 12345
event: heartbeat
data: pingSSE format rules:
- Lines starting with
data:contain the payload event:specifies event type (default is “message”)id:sets the last event ID for reconnection- Empty line signals end of an event
- Multi-line data uses multiple
data:lines
Scraping SSE with Python
import httpx
import json
import asyncio
class SSEScraper:
"""Scrape Server-Sent Events through a proxy."""
def __init__(self, proxy_url=None):
self.proxy_url = proxy_url
async def connect(self, url, headers=None, max_events=None, timeout=None):
"""Connect to an SSE endpoint and collect events."""
default_headers = {
'Accept': 'text/event-stream',
'Cache-Control': 'no-cache',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
}
if headers:
default_headers.update(headers)
events = []
event_count = 0
async with httpx.AsyncClient(proxy=self.proxy_url, timeout=None) as client:
async with client.stream('GET', url, headers=default_headers) as response:
current_event = {'event': 'message', 'data': '', 'id': None}
async for line in response.aiter_lines():
if line.startswith('data: '):
data = line[6:]
if current_event['data']:
current_event['data'] += '\n' + data
else:
current_event['data'] = data
elif line.startswith('event: '):
current_event['event'] = line[7:]
elif line.startswith('id: '):
current_event['id'] = line[4:]
elif line == '':
# Empty line = end of event
if current_event['data']:
try:
current_event['data'] = json.loads(current_event['data'])
except json.JSONDecodeError:
pass
events.append(current_event.copy())
event_count += 1
yield current_event.copy()
current_event = {'event': 'message', 'data': '', 'id': None}
if max_events and event_count >= max_events:
return
# Usage
async def main():
scraper = SSEScraper(proxy_url='http://user:pass@proxy.example.com:8080')
async for event in scraper.connect(
'https://api.example.com/stream/prices',
max_events=100
):
print(f"Event: {event['event']}, Data: {event['data']}")
asyncio.run(main())SSE with Auto-Reconnection
class ResilientSSEClient:
"""SSE client with automatic reconnection."""
def __init__(self, url, proxy=None, reconnect_delay=3):
self.url = url
self.proxy = proxy
self.reconnect_delay = reconnect_delay
self.last_event_id = None
async def stream(self):
"""Stream events with automatic reconnection."""
while True:
try:
headers = {'Accept': 'text/event-stream'}
if self.last_event_id:
headers['Last-Event-ID'] = self.last_event_id
async with httpx.AsyncClient(proxy=self.proxy, timeout=None) as client:
async with client.stream('GET', self.url, headers=headers) as response:
async for line in response.aiter_lines():
if line.startswith('data: '):
data = line[6:]
try:
yield json.loads(data)
except json.JSONDecodeError:
yield data
elif line.startswith('id: '):
self.last_event_id = line[4:]
except (httpx.ReadTimeout, httpx.ConnectError) as e:
print(f"Connection lost: {e}. Reconnecting in {self.reconnect_delay}s...")
await asyncio.sleep(self.reconnect_delay)Use Cases
| Source | SSE Endpoint Type | Data |
|---|---|---|
| AI chat (ChatGPT, Claude) | Token streaming | Real-time responses |
| Financial feeds | Price updates | Stock/crypto prices |
| Sports scores | Live updates | Game scores |
| Social dashboards | Activity streams | Notifications |
| CI/CD pipelines | Build logs | Log streaming |
Internal Links
- WebSocket Proxying — alternative real-time protocol
- AJAX Request Interception — discover SSE endpoints
- Bandwidth Optimization — SSE uses minimal bandwidth
- Building a Stock/Crypto Price Monitor — apply SSE scraping
- Proxy Connection Pooling — manage long-lived SSE connections
FAQ
What is the difference between SSE and WebSocket?
SSE is unidirectional (server to client only) over HTTP, while WebSocket is bidirectional over a custom protocol. SSE is simpler to proxy since it uses standard HTTP. Choose SSE scraping when you only need to receive data from the server.
Can I use SSE through a regular HTTP proxy?
Yes. SSE uses standard HTTP with text/event-stream content type. Any HTTP proxy that supports long-lived connections (keep-alive) works with SSE. Set appropriate timeouts to prevent the proxy from closing the connection.
How much bandwidth does SSE scraping use?
Very little. SSE sends only the data payload as text, with minimal protocol overhead. A typical SSE stream might use 1-10 KB/s depending on update frequency, far less than loading full web pages.
How do I handle SSE reconnection through proxies?
Use the Last-Event-ID header when reconnecting. This tells the server to resume from where you left off. Implement exponential backoff for reconnection delays to avoid overwhelming the server.
Can anti-bot systems detect SSE scraping?
SSE connections look like normal HTTP requests. The main detection vector is the connection duration and request headers. Use realistic headers and rotate proxies between reconnections to avoid detection.
- AJAX Request Interception: Scraping API Calls Directly
- Bandwidth Optimization for Proxies: Reduce Costs & Increase Speed
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a Proxy Rotator in Python: Complete Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
Related Reading
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)