Server-Sent Events (SSE) Scraping Guide

Server-Sent Events (SSE) provide a simpler alternative to WebSockets for server-to-client streaming. Many AI applications (ChatGPT, Claude), live dashboards, and real-time feeds use SSE to push data to browsers. Unlike WebSockets, SSE works over standard HTTP, making it easier to proxy and scrape.

How SSE Works

Client sends:
GET /events HTTP/1.1
Accept: text/event-stream

Server responds:
HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

data: {"price": 45230.50, "symbol": "BTC"}

data: {"price": 45231.20, "symbol": "BTC"}

event: update
data: {"price": 45229.80, "symbol": "BTC"}
id: 12345

event: heartbeat
data: ping

SSE format rules:

Lines starting with data: contain the payload
event: specifies event type (default is “message”)
id: sets the last event ID for reconnection
Empty line signals end of an event
Multi-line data uses multiple data: lines

Scraping SSE with Python

import httpx
import json
import asyncio

class SSEScraper:
    """Scrape Server-Sent Events through a proxy."""
    
    def __init__(self, proxy_url=None):
        self.proxy_url = proxy_url
    
    async def connect(self, url, headers=None, max_events=None, timeout=None):
        """Connect to an SSE endpoint and collect events."""
        default_headers = {
            'Accept': 'text/event-stream',
            'Cache-Control': 'no-cache',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        }
        if headers:
            default_headers.update(headers)
        
        events = []
        event_count = 0
        
        async with httpx.AsyncClient(proxy=self.proxy_url, timeout=None) as client:
            async with client.stream('GET', url, headers=default_headers) as response:
                current_event = {'event': 'message', 'data': '', 'id': None}
                
                async for line in response.aiter_lines():
                    if line.startswith('data: '):
                        data = line[6:]
                        if current_event['data']:
                            current_event['data'] += '\n' + data
                        else:
                            current_event['data'] = data
                    
                    elif line.startswith('event: '):
                        current_event['event'] = line[7:]
                    
                    elif line.startswith('id: '):
                        current_event['id'] = line[4:]
                    
                    elif line == '':
                        # Empty line = end of event
                        if current_event['data']:
                            try:
                                current_event['data'] = json.loads(current_event['data'])
                            except json.JSONDecodeError:
                                pass
                            
                            events.append(current_event.copy())
                            event_count += 1
                            
                            yield current_event.copy()
                            
                            current_event = {'event': 'message', 'data': '', 'id': None}
                            
                            if max_events and event_count >= max_events:
                                return

# Usage
async def main():
    scraper = SSEScraper(proxy_url='http://user:pass@proxy.example.com:8080')
    
    async for event in scraper.connect(
        'https://api.example.com/stream/prices',
        max_events=100
    ):
        print(f"Event: {event['event']}, Data: {event['data']}")

asyncio.run(main())

SSE with Auto-Reconnection

class ResilientSSEClient:
    """SSE client with automatic reconnection."""
    
    def __init__(self, url, proxy=None, reconnect_delay=3):
        self.url = url
        self.proxy = proxy
        self.reconnect_delay = reconnect_delay
        self.last_event_id = None
    
    async def stream(self):
        """Stream events with automatic reconnection."""
        while True:
            try:
                headers = {'Accept': 'text/event-stream'}
                if self.last_event_id:
                    headers['Last-Event-ID'] = self.last_event_id
                
                async with httpx.AsyncClient(proxy=self.proxy, timeout=None) as client:
                    async with client.stream('GET', self.url, headers=headers) as response:
                        async for line in response.aiter_lines():
                            if line.startswith('data: '):
                                data = line[6:]
                                try:
                                    yield json.loads(data)
                                except json.JSONDecodeError:
                                    yield data
                            elif line.startswith('id: '):
                                self.last_event_id = line[4:]
            
            except (httpx.ReadTimeout, httpx.ConnectError) as e:
                print(f"Connection lost: {e}. Reconnecting in {self.reconnect_delay}s...")
                await asyncio.sleep(self.reconnect_delay)

Use Cases

Source	SSE Endpoint Type	Data
AI chat (ChatGPT, Claude)	Token streaming	Real-time responses
Financial feeds	Price updates	Stock/crypto prices
Sports scores	Live updates	Game scores
Social dashboards	Activity streams	Notifications
CI/CD pipelines	Build logs	Log streaming

Internal Links

WebSocket Proxying — alternative real-time protocol
AJAX Request Interception — discover SSE endpoints
Bandwidth Optimization — SSE uses minimal bandwidth
Building a Stock/Crypto Price Monitor — apply SSE scraping
Proxy Connection Pooling — manage long-lived SSE connections

FAQ

What is the difference between SSE and WebSocket?

SSE is unidirectional (server to client only) over HTTP, while WebSocket is bidirectional over a custom protocol. SSE is simpler to proxy since it uses standard HTTP. Choose SSE scraping when you only need to receive data from the server.

Can I use SSE through a regular HTTP proxy?

Yes. SSE uses standard HTTP with text/event-stream content type. Any HTTP proxy that supports long-lived connections (keep-alive) works with SSE. Set appropriate timeouts to prevent the proxy from closing the connection.

How much bandwidth does SSE scraping use?

Very little. SSE sends only the data payload as text, with minimal protocol overhead. A typical SSE stream might use 1-10 KB/s depending on update frequency, far less than loading full web pages.

How do I handle SSE reconnection through proxies?

Use the Last-Event-ID header when reconnecting. This tells the server to resume from where you left off. Implement exponential backoff for reconnection delays to avoid overwhelming the server.

Can anti-bot systems detect SSE scraping?

SSE connections look like normal HTTP requests. The main detection vector is the connection duration and request headers. Use realistic headers and rotate proxies between reconnections to avoid detection.