Web Scraping Behind Login: Session Management
Many valuable datasets live behind authentication walls — dashboards, member directories, account settings, and gated content. Scraping these requires managing login flows, session cookies, CSRF tokens, and multi-factor authentication. This guide covers every technique for authenticated web scraping.
Understanding Authentication Mechanisms
Cookie-Based Sessions
The most common approach. After login, the server sets a session cookie that authenticates subsequent requests:
POST /login → Set-Cookie: session_id=abc123
GET /dashboard (Cookie: session_id=abc123) → 200 OK
Token-Based (JWT)
Modern SPAs often use JWT tokens stored in localStorage or sent via headers:
POST /api/auth/login → {"token": "eyJhbGci..."}
GET /api/data (Authorization: Bearer eyJhbGci...) → 200 OK
OAuth 2.0
Third-party authentication through providers like Google, GitHub, Facebook. Requires handling redirect flows and token exchange.
Method 1: Requests with Session Cookies
For traditional server-rendered sites:
import requests
from bs4 import BeautifulSoup
class AuthenticatedScraper:
def __init__(self, proxy=None):
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
})
if proxy:
self.session.proxies = {"http": proxy, "https": proxy}
def login(self, login_url, username, password):
"""Handle standard form-based login."""
# Step 1: Get the login page (for CSRF token)
login_page = self.session.get(login_url)
soup = BeautifulSoup(login_page.text, "lxml")
# Step 2: Extract CSRF token
csrf_token = self._extract_csrf(soup)
# Step 3: Submit login form
login_data = {
"username": username,
"password": password,
}
if csrf_token:
login_data["_csrf"] = csrf_token
# or "csrf_token", "authenticity_token", etc.
response = self.session.post(
login_url,
data=login_data,
allow_redirects=True
)
# Step 4: Verify login success
if response.status_code == 200 and "logout" in response.text.lower():
print("Login successful")
return True
else:
print(f"Login failed: {response.status_code}")
return False
def _extract_csrf(self, soup):
"""Find CSRF token in common locations."""
# Meta tag
meta = soup.find("meta", {"name": "csrf-token"})
if meta:
return meta["content"]
# Hidden input field
for name in ["_csrf", "csrf_token", "authenticity_token", "_token"]:
field = soup.find("input", {"name": name})
if field:
return field.get("value")
return None
def scrape(self, url):
"""Scrape a page using the authenticated session."""
response = self.session.get(url)
if response.status_code == 401 or response.status_code == 403:
raise Exception("Session expired or unauthorized")
return response
def save_cookies(self, filepath):
"""Save session cookies for reuse."""
import pickle
with open(filepath, "wb") as f:
pickle.dump(self.session.cookies, f)
def load_cookies(self, filepath):
"""Load previously saved cookies."""
import pickle
with open(filepath, "rb") as f:
self.session.cookies.update(pickle.load(f))
Usage
scraper = AuthenticatedScraper(proxy="http://user:pass@proxy:8080")
scraper.login("https://example.com/login", "myuser", "mypass")
response = scraper.scrape("https://example.com/dashboard")
Method 2: Browser-Based Login (Playwright)
For sites with complex login flows, JavaScript challenges, or CAPTCHAs:
from playwright.sync_api import sync_playwright
import json
class BrowserAuthScraper:
def __init__(self, proxy=None):
self.playwright = sync_playwright().start()
launch_options = {"headless": True}
if proxy:
launch_options["proxy"] = {"server": proxy}
self.browser = self.playwright.chromium.launch(**launch_options)
self.context = self.browser.new_context()
self.page = self.context.new_page()
def login(self, login_url, username, password):
"""Login using browser automation."""
self.page.goto(login_url, wait_until="networkidle")
# Fill login form
self.page.fill('input[name="username"], input[type="email"]', username)
self.page.fill('input[name="password"], input[type="password"]', password)
# Click login button
self.page.click('button[type="submit"], input[type="submit"]')
# Wait for navigation after login
self.page.wait_for_load_state("networkidle")
# Verify login
if "dashboard" in self.page.url or "account" in self.page.url:
print("Login successful")
return True
print(f"Login may have failed. Current URL: {self.page.url}")
return False
def save_session(self, filepath):
"""Save browser session state (cookies + localStorage)."""
storage = self.context.storage_state()
with open(filepath, "w") as f:
json.dump(storage, f)
def load_session(self, filepath):
"""Restore a previously saved session."""
self.context.close()
self.context = self.browser.new_context(storage_state=filepath)
self.page = self.context.new_page()
def scrape(self, url):
"""Navigate to a page and extract content."""
self.page.goto(url, wait_until="networkidle")
return self.page.content()
def extract_cookies(self):
"""Get cookies for use with requests library."""
cookies = self.context.cookies()
return {c["name"]: c["value"] for c in cookies}
def close(self):
self.browser.close()
self.playwright.stop()
Usage
scraper = BrowserAuthScraper(proxy="http://user:pass@proxy:8080")
scraper.login("https://example.com/login", "user@email.com", "password123")
scraper.save_session("session.json")
Later, reuse the session
scraper2 = BrowserAuthScraper()
scraper2.load_session("session.json")
html = scraper2.scrape("https://example.com/protected-data")
Method 3: API Authentication
For sites with REST/GraphQL APIs:
import requests
class APIAuthScraper:
def __init__(self, base_url, proxy=None):
self.base_url = base_url
self.session = requests.Session()
if proxy:
self.session.proxies = {"http": proxy, "https": proxy}
def login_jwt(self, username, password):
"""Login and get JWT token."""
response = self.session.post(
f"{self.base_url}/api/auth/login",
json={"email": username, "password": password}
)
data = response.json()
self.token = data["token"]
self.refresh_token = data.get("refresh_token")
self.session.headers["Authorization"] = f"Bearer {self.token}"
return True
def refresh_auth(self):
"""Refresh expired JWT token."""
response = self.session.post(
f"{self.base_url}/api/auth/refresh",
json={"refresh_token": self.refresh_token}
)
data = response.json()
self.token = data["token"]
self.session.headers["Authorization"] = f"Bearer {self.token}"
def get(self, endpoint, params=None):
"""Make authenticated GET request with auto-refresh."""
response = self.session.get(
f"{self.base_url}{endpoint}",
params=params
)
if response.status_code == 401:
self.refresh_auth()
response = self.session.get(
f"{self.base_url}{endpoint}",
params=params
)
return response.json()
def get_paginated(self, endpoint, page_param="page", max_pages=100):
"""Fetch all pages of paginated API data."""
all_data = []
page = 1
while page <= max_pages:
data = self.get(endpoint, params={page_param: page})
items = data.get("results", data.get("data", []))
if not items:
break
all_data.extend(items)
page += 1
return all_data
Handling CSRF Tokens
CSRF tokens prevent unauthorized form submissions. They change on every page load:
def handle_csrf_workflow(session, form_url, submit_url, form_data):
"""Complete CSRF-protected form submission workflow."""
# 1. Load the page with the form
page = session.get(form_url)
soup = BeautifulSoup(page.text, "lxml")
# 2. Extract CSRF token
csrf = None
# Check meta tags
meta = soup.find("meta", {"name": ["csrf-token", "_csrf"]})
if meta:
csrf = meta.get("content")
# Check hidden inputs
if not csrf:
for input_tag in soup.find_all("input", {"type": "hidden"}):
if "csrf" in input_tag.get("name", "").lower():
csrf = input_tag.get("value")
break
# Check cookies
if not csrf:
csrf = session.cookies.get("XSRF-TOKEN")
# 3. Submit with CSRF token
if csrf:
form_data["_csrf"] = csrf
session.headers["X-CSRF-Token"] = csrf
response = session.post(submit_url, data=form_data)
return response
Session Persistence Strategies
Cookie Jar Persistence
import http.cookiejar
import os
def create_persistent_session(cookie_file="cookies.txt"):
"""Create a session that persists cookies to disk."""
session = requests.Session()
cookie_jar = http.cookiejar.MozillaCookieJar(cookie_file)
if os.path.exists(cookie_file):
cookie_jar.load(ignore_discard=True, ignore_expires=True)
session.cookies = cookie_jar
return session
def save_session_cookies(session, cookie_file="cookies.txt"):
"""Save current cookies to file."""
session.cookies.save(ignore_discard=True, ignore_expires=True)
Session Validation
def is_session_valid(session, check_url):
"""Check if the current session is still authenticated."""
response = session.get(check_url, allow_redirects=False)
# If redirected to login page, session expired
if response.status_code in (301, 302):
location = response.headers.get("Location", "")
if "login" in location.lower():
return False
# If 401/403, session expired
if response.status_code in (401, 403):
return False
return True
Handling Multi-Factor Authentication
TOTP (Time-Based OTP)
import pyotp
def login_with_mfa(session, login_url, mfa_url, username, password, totp_secret):
"""Handle login with TOTP-based MFA."""
# Step 1: Submit credentials
session.post(login_url, data={
"username": username,
"password": password
})
# Step 2: Generate TOTP code
totp = pyotp.TOTP(totp_secret)
code = totp.now()
# Step 3: Submit MFA code
response = session.post(mfa_url, data={
"code": code
})
return response.status_code == 200
Email/SMS OTP
For email or SMS-based OTP, you need access to the inbox or phone:
import imaplib
import email
import re
import time
def get_otp_from_email(imap_server, email_addr, password,
sender_filter, timeout=120):
"""Fetch OTP code from email inbox."""
mail = imaplib.IMAP4_SSL(imap_server)
mail.login(email_addr, password)
mail.select("inbox")
start = time.time()
while time.time() - start < timeout:
_, messages = mail.search(None, f'FROM "{sender_filter}" UNSEEN')
msg_nums = messages[0].split()
if msg_nums:
_, msg_data = mail.fetch(msg_nums[-1], "(RFC822)")
msg = email.message_from_bytes(msg_data[0][1])
body = msg.get_payload(decode=True).decode()
# Extract 6-digit OTP
otp_match = re.search(r'\b(\d{6})\b', body)
if otp_match:
mail.logout()
return otp_match.group(1)
time.sleep(5)
mail.logout()
raise TimeoutError("OTP not received within timeout")
Proxy Considerations for Authenticated Scraping
When scraping behind login with proxies, keep these in mind:
- Sticky sessions: Use the same proxy IP for the entire login session. Switching IPs mid-session often triggers re-authentication or security alerts.
- Residential proxies: Sites are more likely to challenge datacenter IPs on login pages. Residential proxies look like regular users.
- Geographic consistency: If the account is registered in the US, use a US proxy. Logging in from a different country triggers security alerts.
# Use sticky sessions with your proxy provider
proxy = "http://user:pass@gate.proxy.com:8080?session=user123&sticky=true"
Security and Ethical Considerations
- Never share or expose credentials in logs, version control, or error messages
- Use environment variables for all credentials
- Respect rate limits — authenticated scraping should be even more conservative
- Review Terms of Service — many sites explicitly prohibit automated access to logged-in areas
- Store credentials securely using a secrets manager (not in code)
import os
username = os.environ.get("SCRAPER_USERNAME")
password = os.environ.get("SCRAPER_PASSWORD")
if not username or not password:
raise ValueError("Set SCRAPER_USERNAME and SCRAPER_PASSWORD environment variables")
FAQ
How long do session cookies last?
It varies by site — from 30 minutes to several days. Check the cookie expiry time and implement auto-renewal logic. Most session cookies expire when the browser closes unless they’re “persistent” cookies.
Can I use one login session across multiple scrapers?
Generally no — sharing cookies across different IPs triggers security flags. Each scraper instance should maintain its own session with its own proxy IP.
How do I handle CAPTCHA on login pages?
Options: (1) Use residential proxies to reduce CAPTCHA triggers, (2) integrate a CAPTCHA-solving service, (3) solve CAPTCHAs manually once and save the session for reuse.
What if the site uses fingerprinting alongside authentication?
Combine browser-based login (Playwright) with anti-detection techniques. Maintain consistent browser fingerprints across sessions.
Should I scrape while logged in or use the public API?
Always prefer official APIs when available. Authenticated scraping should be a last resort when no API exists and the data is legitimately needed.
Conclusion
Authenticated web scraping requires careful session management, CSRF handling, and security awareness. Start with requests.Session() for simple cookie-based auth, use Playwright for complex login flows with JavaScript, and always persist sessions to avoid repeated logins. Pair with residential proxies and consistent IP sessions for the most reliable results.