Server-Sent Events (SSE) Scraping Guide

Server-Sent Events (SSE) Scraping Guide

Server-Sent Events (SSE) provide a simpler alternative to WebSockets for server-to-client streaming. Many AI applications (ChatGPT, Claude), live dashboards, and real-time feeds use SSE to push data to browsers. Unlike WebSockets, SSE works over standard HTTP, making it easier to proxy and scrape.

How SSE Works

Client sends:
GET /events HTTP/1.1
Accept: text/event-stream

Server responds:
HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

data: {"price": 45230.50, "symbol": "BTC"}

data: {"price": 45231.20, "symbol": "BTC"}

event: update
data: {"price": 45229.80, "symbol": "BTC"}
id: 12345

event: heartbeat
data: ping

SSE format rules:

  • Lines starting with data: contain the payload
  • event: specifies event type (default is “message”)
  • id: sets the last event ID for reconnection
  • Empty line signals end of an event
  • Multi-line data uses multiple data: lines

Scraping SSE with Python

import httpx
import json
import asyncio

class SSEScraper:
    """Scrape Server-Sent Events through a proxy."""
    
    def __init__(self, proxy_url=None):
        self.proxy_url = proxy_url
    
    async def connect(self, url, headers=None, max_events=None, timeout=None):
        """Connect to an SSE endpoint and collect events."""
        default_headers = {
            'Accept': 'text/event-stream',
            'Cache-Control': 'no-cache',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        }
        if headers:
            default_headers.update(headers)
        
        events = []
        event_count = 0
        
        async with httpx.AsyncClient(proxy=self.proxy_url, timeout=None) as client:
            async with client.stream('GET', url, headers=default_headers) as response:
                current_event = {'event': 'message', 'data': '', 'id': None}
                
                async for line in response.aiter_lines():
                    if line.startswith('data: '):
                        data = line[6:]
                        if current_event['data']:
                            current_event['data'] += '\n' + data
                        else:
                            current_event['data'] = data
                    
                    elif line.startswith('event: '):
                        current_event['event'] = line[7:]
                    
                    elif line.startswith('id: '):
                        current_event['id'] = line[4:]
                    
                    elif line == '':
                        # Empty line = end of event
                        if current_event['data']:
                            try:
                                current_event['data'] = json.loads(current_event['data'])
                            except json.JSONDecodeError:
                                pass
                            
                            events.append(current_event.copy())
                            event_count += 1
                            
                            yield current_event.copy()
                            
                            current_event = {'event': 'message', 'data': '', 'id': None}
                            
                            if max_events and event_count >= max_events:
                                return

# Usage
async def main():
    scraper = SSEScraper(proxy_url='http://user:pass@proxy.example.com:8080')
    
    async for event in scraper.connect(
        'https://api.example.com/stream/prices',
        max_events=100
    ):
        print(f"Event: {event['event']}, Data: {event['data']}")

asyncio.run(main())

SSE with Auto-Reconnection

class ResilientSSEClient:
    """SSE client with automatic reconnection."""
    
    def __init__(self, url, proxy=None, reconnect_delay=3):
        self.url = url
        self.proxy = proxy
        self.reconnect_delay = reconnect_delay
        self.last_event_id = None
    
    async def stream(self):
        """Stream events with automatic reconnection."""
        while True:
            try:
                headers = {'Accept': 'text/event-stream'}
                if self.last_event_id:
                    headers['Last-Event-ID'] = self.last_event_id
                
                async with httpx.AsyncClient(proxy=self.proxy, timeout=None) as client:
                    async with client.stream('GET', self.url, headers=headers) as response:
                        async for line in response.aiter_lines():
                            if line.startswith('data: '):
                                data = line[6:]
                                try:
                                    yield json.loads(data)
                                except json.JSONDecodeError:
                                    yield data
                            elif line.startswith('id: '):
                                self.last_event_id = line[4:]
            
            except (httpx.ReadTimeout, httpx.ConnectError) as e:
                print(f"Connection lost: {e}. Reconnecting in {self.reconnect_delay}s...")
                await asyncio.sleep(self.reconnect_delay)

Use Cases

SourceSSE Endpoint TypeData
AI chat (ChatGPT, Claude)Token streamingReal-time responses
Financial feedsPrice updatesStock/crypto prices
Sports scoresLive updatesGame scores
Social dashboardsActivity streamsNotifications
CI/CD pipelinesBuild logsLog streaming

Internal Links

FAQ

What is the difference between SSE and WebSocket?

SSE is unidirectional (server to client only) over HTTP, while WebSocket is bidirectional over a custom protocol. SSE is simpler to proxy since it uses standard HTTP. Choose SSE scraping when you only need to receive data from the server.

Can I use SSE through a regular HTTP proxy?

Yes. SSE uses standard HTTP with text/event-stream content type. Any HTTP proxy that supports long-lived connections (keep-alive) works with SSE. Set appropriate timeouts to prevent the proxy from closing the connection.

How much bandwidth does SSE scraping use?

Very little. SSE sends only the data payload as text, with minimal protocol overhead. A typical SSE stream might use 1-10 KB/s depending on update frequency, far less than loading full web pages.

How do I handle SSE reconnection through proxies?

Use the Last-Event-ID header when reconnecting. This tells the server to resume from where you left off. Implement exponential backoff for reconnection delays to avoid overwhelming the server.

Can anti-bot systems detect SSE scraping?

SSE connections look like normal HTTP requests. The main detection vector is the connection duration and request headers. Use realistic headers and rotate proxies between reconnections to avoid detection.


Related Reading

Scroll to Top