403 Forbidden in Web Scraping: How to Fix It

403 Forbidden in Web Scraping: How to Fix It

The 403 Forbidden status code is the most common error web scrapers encounter. It means the server understood your request but refuses to fulfill it. In the context of web scraping, this almost always means the server has identified your request as automated and is blocking it.

This guide covers every common cause of 403 errors and provides tested solutions for each scenario.

What Causes 403 Forbidden in Web Scraping?

1. Missing or Incorrect User-Agent

The most frequent cause. Python’s requests library sends python-requests/2.x.x as the default User-Agent, which no real browser would use.

2. Missing Browser Headers

Real browsers send 15-20 headers with every request. Sending just a User-Agent is suspicious.

3. IP Reputation

Your IP address is flagged — either because it’s a datacenter IP, it’s on a blacklist, or you’ve made too many requests.

4. TLS Fingerprint Mismatch

Your TLS handshake doesn’t match any real browser, even if your headers look correct.

5. Geographic Restrictions

The content is restricted to specific countries or regions.

6. Authentication Required

The page requires login or API authentication.

7. Rate Limiting

You’ve exceeded the server’s request rate threshold.

8. WAF Rules

Web Application Firewall rules are blocking your specific request pattern.

9. robots.txt Enforcement

Some servers actively block requests to URLs disallowed by robots.txt.

10. Referrer Checking

The server expects a valid Referer header from navigation within the site.

Diagnosing the Cause

Before applying fixes, identify the specific cause:

import requests

def diagnose_403(url):
    """Diagnose why a URL returns 403."""

    tests = {
        "default": {},
        "with_ua": {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"},
        "with_full_headers": {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "none",
        }
    }

    for test_name, headers in tests.items():
        try:
            response = requests.get(url, headers=headers, timeout=15)
            print(f"{test_name}: {response.status_code} ({len(response.text)} bytes)")

            if response.status_code == 200:
                print(f"  -> Fix found: {test_name}")
                return test_name
        except Exception as e:
            print(f"{test_name}: Error - {e}")

    print("Headers alone don't fix it. Try TLS matching or proxies.")
    return None

diagnose_403("https://target-site.com")

Fix 1: Set Proper Browser Headers

The simplest and most common fix. Replace minimal headers with a complete browser header set:

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Ch-Ua": '"Chromium";v="122", "Not(A:Brand";v="24", "Google Chrome";v="122"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"Windows"',
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Cache-Control": "max-age=0",
}

response = requests.get("https://target-site.com", headers=headers)
print(response.status_code)

Success rate: This alone fixes 30-40% of 403 errors.

Fix 2: Match TLS Fingerprint

If headers alone don’t work, the server is likely checking TLS fingerprints:

from curl_cffi import requests

response = requests.get(
    "https://target-site.com",
    impersonate="chrome"
)

print(response.status_code)

Success rate: Fixes an additional 20-30% of 403 errors.

For a detailed explanation, see our TLS/JA3 fingerprinting guide.

Fix 3: Use Residential Proxies

If TLS matching doesn’t help, the issue is likely IP-based:

from curl_cffi import requests

proxies = {
    "http": "http://user:pass@residential-proxy.com:7777",
    "https": "http://user:pass@residential-proxy.com:7777"
}

response = requests.get(
    "https://target-site.com",
    impersonate="chrome",
    proxies=proxies
)

print(response.status_code)

Success rate: Fixes most remaining 403 errors when combined with proper headers and TLS matching.

See our proxy provider reviews for recommendations.

Fix 4: Add Referrer Header

Some sites check that requests come from within their own site:

headers["Referer"] = "https://target-site.com/"
headers["Origin"] = "https://target-site.com"

response = requests.get("https://target-site.com/products", headers=headers)

For AJAX/API requests, the referrer should be the page that triggered the request:

# When accessing an API endpoint called by a specific page
api_headers = {
    **headers,
    "Referer": "https://target-site.com/search-page",
    "X-Requested-With": "XMLHttpRequest",
    "Accept": "application/json"
}

Fix 5: Handle Cookies Properly

Many sites set required cookies on the homepage that must be present for subsequent requests:

from curl_cffi import requests

session = requests.Session(impersonate="chrome")

# Step 1: Visit homepage to get cookies
home_response = session.get("https://target-site.com/")
print(f"Homepage: {home_response.status_code}")
print(f"Cookies: {dict(session.cookies)}")

# Step 2: Now access the target page (with cookies)
target_response = session.get("https://target-site.com/products")
print(f"Target: {target_response.status_code}")

Fix 6: Use Browser Automation

When all HTTP-level approaches fail, use a real browser:

import undetected_chromedriver as uc
import time

driver = uc.Chrome()
driver.get("https://target-site.com/products")
time.sleep(5)

if "403" not in driver.title:
    print("Success!")
    content = driver.page_source
    # Extract cookies for future use
    cookies = {c['name']: c['value'] for c in driver.get_cookies()}

driver.quit()

See our Undetected ChromeDriver tutorial for full setup.

Fix 7: Rotate User-Agents

Using the same User-Agent for thousands of requests gets flagged:

import random

user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.3 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
]

def get_random_headers():
    ua = random.choice(user_agents)
    return {
        "User-Agent": ua,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
    }

For a comprehensive approach, see our User-Agent rotation guide.

Fix 8: Slow Down Request Rate

403 can be a response to rate limiting (instead of the more explicit 429):

import time
import random

for url in urls:
    response = session.get(url, headers=get_random_headers())

    if response.status_code == 403:
        print("Got 403, backing off...")
        time.sleep(random.uniform(30, 60))  # Long pause
        continue

    # Normal delay between requests
    time.sleep(random.uniform(2, 5))

For detailed rate limiting strategies, see our rate limiting guide.

Fix 9: Use the Site’s API

Check for API endpoints that might be less restricted:

# Open browser DevTools > Network tab
# Navigate the site normally
# Look for XHR/Fetch requests to API endpoints

# Common patterns:
api_endpoints = [
    "https://target-site.com/api/v1/products",
    "https://target-site.com/api/search",
    "https://api.target-site.com/v2/items",
    "https://target-site.com/graphql",
]

for endpoint in api_endpoints:
    response = session.get(endpoint, headers={
        "Accept": "application/json",
        "X-Requested-With": "XMLHttpRequest",
    })
    print(f"{endpoint}: {response.status_code}")

Fix 10: Change Request Method

Some WAF rules only block specific HTTP methods:

# If GET is blocked, try POST
response = session.post(
    "https://target-site.com/api/search",
    json={"query": "keyword"},
    headers=headers
)

# Or try HEAD first to check accessibility
head_response = session.head("https://target-site.com/page", headers=headers)
print(f"HEAD: {head_response.status_code}")

Systematic Fix Approach

Apply fixes in order from simplest to most complex:

from curl_cffi import requests
import undetected_chromedriver as uc
import time

def resilient_fetch(url, proxy=None):
    """Try increasingly sophisticated methods to fetch a URL."""

    proxies = {"http": proxy, "https": proxy} if proxy else None

    # Level 1: curl_cffi with browser impersonation
    try:
        response = requests.get(
            url,
            impersonate="chrome",
            proxies=proxies,
            timeout=15
        )
        if response.status_code == 200:
            return {"method": "curl_cffi", "content": response.text}
    except:
        pass

    # Level 2: curl_cffi with session (cookie accumulation)
    try:
        session = requests.Session(impersonate="chrome")
        if proxies:
            session.proxies = proxies

        from urllib.parse import urlparse
        parsed = urlparse(url)
        base_url = f"{parsed.scheme}://{parsed.netloc}"

        session.get(base_url)  # Get cookies
        time.sleep(1)

        response = session.get(url)
        if response.status_code == 200:
            return {"method": "curl_cffi_session", "content": response.text}
    except:
        pass

    # Level 3: Browser automation
    try:
        options = uc.ChromeOptions()
        if proxy:
            options.add_argument(f"--proxy-server={proxy}")

        driver = uc.Chrome(options=options)
        driver.get(url)
        time.sleep(8)

        content = driver.page_source
        driver.quit()

        if len(content) > 1000:
            return {"method": "browser", "content": content}
    except:
        pass

    return None

result = resilient_fetch(
    "https://target-site.com/products",
    proxy="http://user:pass@residential-proxy:7777"
)

if result:
    print(f"Success with {result['method']}: {len(result['content'])} bytes")

Common 403 Patterns by Website Type

E-Commerce Sites (Amazon, eBay, etc.)

Typically check User-Agent, rate, and IP reputation. Fix with residential proxies + proper headers.

Social Media (LinkedIn, Facebook, etc.)

Strict authentication requirements. Usually need logged-in sessions + browser automation.

News Sites

Often use CDN-level blocks. Fix with TLS fingerprint matching + geographic proxies.

Government/Public Data Sites

Frequently block non-standard User-Agents. Fix with proper headers alone.

Conclusion

403 Forbidden errors in web scraping have many causes, but they follow a predictable hierarchy. Start with the simplest fix (proper headers), escalate to TLS fingerprint matching, add residential proxies if needed, and fall back to browser automation for the most protected sites.

The systematic approach outlined above handles 95%+ of 403 errors. For the remaining edge cases, site-specific analysis of headers, cookies, and API endpoints is needed.

For related guides, see our articles on Cloudflare Error 1020, IP bans, and how websites detect bots.


Related Reading

Scroll to Top