Web Scraping Behind Login: Session Management

Web Scraping Behind Login: Session Management

Many valuable datasets live behind authentication walls — dashboards, member directories, account settings, and gated content. Scraping these requires managing login flows, session cookies, CSRF tokens, and multi-factor authentication. This guide covers every technique for authenticated web scraping.

Understanding Authentication Mechanisms

Cookie-Based Sessions

The most common approach. After login, the server sets a session cookie that authenticates subsequent requests:

POST /login → Set-Cookie: session_id=abc123

GET /dashboard (Cookie: session_id=abc123) → 200 OK

Token-Based (JWT)

Modern SPAs often use JWT tokens stored in localStorage or sent via headers:

POST /api/auth/login → {"token": "eyJhbGci..."}

GET /api/data (Authorization: Bearer eyJhbGci...) → 200 OK

OAuth 2.0

Third-party authentication through providers like Google, GitHub, Facebook. Requires handling redirect flows and token exchange.

Method 1: Requests with Session Cookies

For traditional server-rendered sites:

import requests

from bs4 import BeautifulSoup

class AuthenticatedScraper:

def __init__(self, proxy=None):

self.session = requests.Session()

self.session.headers.update({

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "

"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"

})

if proxy:

self.session.proxies = {"http": proxy, "https": proxy}

def login(self, login_url, username, password):

"""Handle standard form-based login."""

# Step 1: Get the login page (for CSRF token)

login_page = self.session.get(login_url)

soup = BeautifulSoup(login_page.text, "lxml")

# Step 2: Extract CSRF token

csrf_token = self._extract_csrf(soup)

# Step 3: Submit login form

login_data = {

"username": username,

"password": password,

}

if csrf_token:

login_data["_csrf"] = csrf_token

# or "csrf_token", "authenticity_token", etc.

response = self.session.post(

login_url,

data=login_data,

allow_redirects=True

)

# Step 4: Verify login success

if response.status_code == 200 and "logout" in response.text.lower():

print("Login successful")

return True

else:

print(f"Login failed: {response.status_code}")

return False

def _extract_csrf(self, soup):

"""Find CSRF token in common locations."""

# Meta tag

meta = soup.find("meta", {"name": "csrf-token"})

if meta:

return meta["content"]

# Hidden input field

for name in ["_csrf", "csrf_token", "authenticity_token", "_token"]:

field = soup.find("input", {"name": name})

if field:

return field.get("value")

return None

def scrape(self, url):

"""Scrape a page using the authenticated session."""

response = self.session.get(url)

if response.status_code == 401 or response.status_code == 403:

raise Exception("Session expired or unauthorized")

return response

def save_cookies(self, filepath):

"""Save session cookies for reuse."""

import pickle

with open(filepath, "wb") as f:

pickle.dump(self.session.cookies, f)

def load_cookies(self, filepath):

"""Load previously saved cookies."""

import pickle

with open(filepath, "rb") as f:

self.session.cookies.update(pickle.load(f))

Usage

scraper = AuthenticatedScraper(proxy="http://user:pass@proxy:8080")

scraper.login("https://example.com/login", "myuser", "mypass")

response = scraper.scrape("https://example.com/dashboard")

Method 2: Browser-Based Login (Playwright)

For sites with complex login flows, JavaScript challenges, or CAPTCHAs:

from playwright.sync_api import sync_playwright

import json

class BrowserAuthScraper:

def __init__(self, proxy=None):

self.playwright = sync_playwright().start()

launch_options = {"headless": True}

if proxy:

launch_options["proxy"] = {"server": proxy}

self.browser = self.playwright.chromium.launch(**launch_options)

self.context = self.browser.new_context()

self.page = self.context.new_page()

def login(self, login_url, username, password):

"""Login using browser automation."""

self.page.goto(login_url, wait_until="networkidle")

# Fill login form

self.page.fill('input[name="username"], input[type="email"]', username)

self.page.fill('input[name="password"], input[type="password"]', password)

# Click login button

self.page.click('button[type="submit"], input[type="submit"]')

# Wait for navigation after login

self.page.wait_for_load_state("networkidle")

# Verify login

if "dashboard" in self.page.url or "account" in self.page.url:

print("Login successful")

return True

print(f"Login may have failed. Current URL: {self.page.url}")

return False

def save_session(self, filepath):

"""Save browser session state (cookies + localStorage)."""

storage = self.context.storage_state()

with open(filepath, "w") as f:

json.dump(storage, f)

def load_session(self, filepath):

"""Restore a previously saved session."""

self.context.close()

self.context = self.browser.new_context(storage_state=filepath)

self.page = self.context.new_page()

def scrape(self, url):

"""Navigate to a page and extract content."""

self.page.goto(url, wait_until="networkidle")

return self.page.content()

def extract_cookies(self):

"""Get cookies for use with requests library."""

cookies = self.context.cookies()

return {c["name"]: c["value"] for c in cookies}

def close(self):

self.browser.close()

self.playwright.stop()

Usage

scraper = BrowserAuthScraper(proxy="http://user:pass@proxy:8080")

scraper.login("https://example.com/login", "user@email.com", "password123")

scraper.save_session("session.json")

Later, reuse the session

scraper2 = BrowserAuthScraper()

scraper2.load_session("session.json")

html = scraper2.scrape("https://example.com/protected-data")

Method 3: API Authentication

For sites with REST/GraphQL APIs:

import requests

class APIAuthScraper:

def __init__(self, base_url, proxy=None):

self.base_url = base_url

self.session = requests.Session()

if proxy:

self.session.proxies = {"http": proxy, "https": proxy}

def login_jwt(self, username, password):

"""Login and get JWT token."""

response = self.session.post(

f"{self.base_url}/api/auth/login",

json={"email": username, "password": password}

)

data = response.json()

self.token = data["token"]

self.refresh_token = data.get("refresh_token")

self.session.headers["Authorization"] = f"Bearer {self.token}"

return True

def refresh_auth(self):

"""Refresh expired JWT token."""

response = self.session.post(

f"{self.base_url}/api/auth/refresh",

json={"refresh_token": self.refresh_token}

)

data = response.json()

self.token = data["token"]

self.session.headers["Authorization"] = f"Bearer {self.token}"

def get(self, endpoint, params=None):

"""Make authenticated GET request with auto-refresh."""

response = self.session.get(

f"{self.base_url}{endpoint}",

params=params

)

if response.status_code == 401:

self.refresh_auth()

response = self.session.get(

f"{self.base_url}{endpoint}",

params=params

)

return response.json()

def get_paginated(self, endpoint, page_param="page", max_pages=100):

"""Fetch all pages of paginated API data."""

all_data = []

page = 1

while page <= max_pages:

data = self.get(endpoint, params={page_param: page})

items = data.get("results", data.get("data", []))

if not items:

break

all_data.extend(items)

page += 1

return all_data

Handling CSRF Tokens

CSRF tokens prevent unauthorized form submissions. They change on every page load:

def handle_csrf_workflow(session, form_url, submit_url, form_data):

"""Complete CSRF-protected form submission workflow."""

# 1. Load the page with the form

page = session.get(form_url)

soup = BeautifulSoup(page.text, "lxml")

# 2. Extract CSRF token

csrf = None

# Check meta tags

meta = soup.find("meta", {"name": ["csrf-token", "_csrf"]})

if meta:

csrf = meta.get("content")

# Check hidden inputs

if not csrf:

for input_tag in soup.find_all("input", {"type": "hidden"}):

if "csrf" in input_tag.get("name", "").lower():

csrf = input_tag.get("value")

break

# Check cookies

if not csrf:

csrf = session.cookies.get("XSRF-TOKEN")

# 3. Submit with CSRF token

if csrf:

form_data["_csrf"] = csrf

session.headers["X-CSRF-Token"] = csrf

response = session.post(submit_url, data=form_data)

return response

Session Persistence Strategies

Cookie Jar Persistence

import http.cookiejar

import os

def create_persistent_session(cookie_file="cookies.txt"):

"""Create a session that persists cookies to disk."""

session = requests.Session()

cookie_jar = http.cookiejar.MozillaCookieJar(cookie_file)

if os.path.exists(cookie_file):

cookie_jar.load(ignore_discard=True, ignore_expires=True)

session.cookies = cookie_jar

return session

def save_session_cookies(session, cookie_file="cookies.txt"):

"""Save current cookies to file."""

session.cookies.save(ignore_discard=True, ignore_expires=True)

Session Validation

def is_session_valid(session, check_url):

"""Check if the current session is still authenticated."""

response = session.get(check_url, allow_redirects=False)

# If redirected to login page, session expired

if response.status_code in (301, 302):

location = response.headers.get("Location", "")

if "login" in location.lower():

return False

# If 401/403, session expired

if response.status_code in (401, 403):

return False

return True

Handling Multi-Factor Authentication

TOTP (Time-Based OTP)

import pyotp

def login_with_mfa(session, login_url, mfa_url, username, password, totp_secret):

"""Handle login with TOTP-based MFA."""

# Step 1: Submit credentials

session.post(login_url, data={

"username": username,

"password": password

})

# Step 2: Generate TOTP code

totp = pyotp.TOTP(totp_secret)

code = totp.now()

# Step 3: Submit MFA code

response = session.post(mfa_url, data={

"code": code

})

return response.status_code == 200

Email/SMS OTP

For email or SMS-based OTP, you need access to the inbox or phone:

import imaplib

import email

import re

import time

def get_otp_from_email(imap_server, email_addr, password,

sender_filter, timeout=120):

"""Fetch OTP code from email inbox."""

mail = imaplib.IMAP4_SSL(imap_server)

mail.login(email_addr, password)

mail.select("inbox")

start = time.time()

while time.time() - start < timeout:

_, messages = mail.search(None, f'FROM "{sender_filter}" UNSEEN')

msg_nums = messages[0].split()

if msg_nums:

_, msg_data = mail.fetch(msg_nums[-1], "(RFC822)")

msg = email.message_from_bytes(msg_data[0][1])

body = msg.get_payload(decode=True).decode()

# Extract 6-digit OTP

otp_match = re.search(r'\b(\d{6})\b', body)

if otp_match:

mail.logout()

return otp_match.group(1)

time.sleep(5)

mail.logout()

raise TimeoutError("OTP not received within timeout")

Proxy Considerations for Authenticated Scraping

When scraping behind login with proxies, keep these in mind:

  1. Sticky sessions: Use the same proxy IP for the entire login session. Switching IPs mid-session often triggers re-authentication or security alerts.
  1. Residential proxies: Sites are more likely to challenge datacenter IPs on login pages. Residential proxies look like regular users.
  1. Geographic consistency: If the account is registered in the US, use a US proxy. Logging in from a different country triggers security alerts.
# Use sticky sessions with your proxy provider

proxy = "http://user:pass@gate.proxy.com:8080?session=user123&sticky=true"

Security and Ethical Considerations

  • Never share or expose credentials in logs, version control, or error messages
  • Use environment variables for all credentials
  • Respect rate limits — authenticated scraping should be even more conservative
  • Review Terms of Service — many sites explicitly prohibit automated access to logged-in areas
  • Store credentials securely using a secrets manager (not in code)
import os

username = os.environ.get("SCRAPER_USERNAME")

password = os.environ.get("SCRAPER_PASSWORD")

if not username or not password:

raise ValueError("Set SCRAPER_USERNAME and SCRAPER_PASSWORD environment variables")

FAQ

How long do session cookies last?

It varies by site — from 30 minutes to several days. Check the cookie expiry time and implement auto-renewal logic. Most session cookies expire when the browser closes unless they’re “persistent” cookies.

Can I use one login session across multiple scrapers?

Generally no — sharing cookies across different IPs triggers security flags. Each scraper instance should maintain its own session with its own proxy IP.

How do I handle CAPTCHA on login pages?

Options: (1) Use residential proxies to reduce CAPTCHA triggers, (2) integrate a CAPTCHA-solving service, (3) solve CAPTCHAs manually once and save the session for reuse.

What if the site uses fingerprinting alongside authentication?

Combine browser-based login (Playwright) with anti-detection techniques. Maintain consistent browser fingerprints across sessions.

Should I scrape while logged in or use the public API?

Always prefer official APIs when available. Authenticated scraping should be a last resort when no API exists and the data is legitimately needed.

Conclusion

Authenticated web scraping requires careful session management, CSRF handling, and security awareness. Start with requests.Session() for simple cookie-based auth, use Playwright for complex login flows with JavaScript, and always persist sessions to avoid repeated logins. Pair with residential proxies and consistent IP sessions for the most reliable results.

Internal Links

Scroll to Top