WhatWaf: Detect WAF for Web Scraping Success

when your web scraper hits a wall, the first question is: what is blocking you? web Application Firewalls (WAFs) like Cloudflare, Akamai, and DataDome are the most common obstacles, and each one requires a different bypass strategy. WhatWaf is an open-source tool that identifies which WAF is protecting a target website, so you can choose the right approach before wasting time on trial and error.

this guide shows you how to use WhatWaf for WAF detection, interpret its results, and plan your scraping strategy based on what you find.

What is a WAF and Why It Matters for Scraping

a Web Application Firewall sits between your scraper and the website’s server. it analyzes incoming requests and decides whether they come from a real user or an automated tool. if it suspects automation, it blocks the request, serves a CAPTCHA, or returns a challenge page.

the problem for scrapers is that different WAFs use different detection methods:

Cloudflare: JavaScript challenges, browser fingerprinting, behavioral analysis
Akamai Bot Manager: device fingerprinting, sensor data collection, ML-based detection
DataDome: real-time behavioral analysis, CAPTCHA challenges
PerimeterX (now HUMAN): behavioral biometrics, proof of work challenges
AWS WAF: rule-based filtering, rate limiting, IP reputation
Imperva (Incapsula): cookie challenges, JavaScript validation

knowing which WAF you are dealing with determines your entire scraping strategy: which headers to send, whether you need a real browser, what proxy type to use, and whether you need to solve CAPTCHAs.

Installing WhatWaf

WhatWaf is a Python-based tool available on GitHub.

# clone the repository
git clone https://github.com/Ekultek/WhatWaf.git
cd WhatWaf

# install dependencies
pip install -r requirements.txt

# or install directly with pip
pip install whatwaf

verify the installation:

python whatwaf.py --help

Basic WAF Detection

the simplest usage is pointing WhatWaf at a URL:

python whatwaf.py -u https://target-website.com

WhatWaf sends a series of test requests with payloads designed to trigger WAF responses. it then analyzes the responses to identify which WAF is in place.

example output:

[*] checking https://target-website.com
[+] results for target-website.com:
[+]  detected firewall: Cloudflare (Cloudflare Inc.)
[+]  detection method: response headers, status codes
[+]  confidence: high

Common Options

# scan with custom headers
python whatwaf.py -u https://target.com --headers "User-Agent: Mozilla/5.0"

# scan multiple URLs from a file
python whatwaf.py -l urls.txt

# use a proxy for scanning
python whatwaf.py -u https://target.com --proxy http://user:pass@proxy.com:8080

# increase verbosity
python whatwaf.py -u https://target.com -v

# specify request timeout
python whatwaf.py -u https://target.com --timeout 30

# use random user agent
python whatwaf.py -u https://target.com --ra

Integrating WhatWaf with Python Scraping Workflows

you can automate WAF detection as part of your scraping pipeline to choose the right strategy automatically.

Method 1: Command-Line Integration

import subprocess
import json
import re

def detect_waf_cli(url, proxy=None):
    """detect WAF using WhatWaf CLI."""
    cmd = ["python", "whatwaf.py", "-u", url, "--json"]

    if proxy:
        cmd.extend(["--proxy", proxy])

    result = subprocess.run(
        cmd,
        capture_output=True,
        text=True,
        timeout=60
    )

    # parse output
    output = result.stdout

    waf_info = {
        "url": url,
        "waf_detected": False,
        "waf_name": None,
        "confidence": None
    }

    if "detected firewall" in output.lower():
        waf_info["waf_detected"] = True
        # extract WAF name from output
        match = re.search(r"detected firewall:\s*(.+?)(?:\n|$)", output, re.IGNORECASE)
        if match:
            waf_info["waf_name"] = match.group(1).strip()

    return waf_info

Method 2: HTTP Header Analysis (DIY Detection)

you do not always need WhatWaf. many WAFs leave identifiable signatures in HTTP headers.

import httpx

class WAFDetector:
    """detect common WAFs by analyzing HTTP response headers."""

    WAF_SIGNATURES = {
        "Cloudflare": {
            "headers": {"server": "cloudflare", "cf-ray": "*"},
            "cookies": ["__cf_bm", "cf_clearance"],
        },
        "Akamai": {
            "headers": {"server": "akamaighost", "x-akamai-transformed": "*"},
            "cookies": ["_abck", "ak_bmsc"],
        },
        "DataDome": {
            "headers": {"x-datadome": "*", "server": "datadome"},
            "cookies": ["datadome"],
        },
        "PerimeterX": {
            "headers": {},
            "cookies": ["_px3", "_pxhd", "_pxvid"],
        },
        "Imperva": {
            "headers": {"x-iinfo": "*", "x-cdn": "imperva"},
            "cookies": ["incap_ses_", "visid_incap_"],
        },
        "AWS WAF": {
            "headers": {"x-amzn-waf-action": "*"},
            "cookies": ["aws-waf-token"],
        },
        "Sucuri": {
            "headers": {"server": "sucuri", "x-sucuri-id": "*"},
            "cookies": ["sucuri_cloudproxy_uuid"],
        },
        "Fastly": {
            "headers": {"x-fastly-request-id": "*", "via": "varnish"},
            "cookies": [],
        }
    }

    def __init__(self, proxy=None):
        self.proxy = proxy
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
        }

    def detect(self, url):
        """detect WAF by checking response headers and cookies."""
        try:
            with httpx.Client(
                proxy=self.proxy,
                headers=self.headers,
                follow_redirects=True,
                timeout=30
            ) as client:
                response = client.get(url)
                return self._analyze_response(url, response)
        except Exception as e:
            return {"url": url, "error": str(e), "waf_detected": False}

    def _analyze_response(self, url, response):
        """analyze response for WAF signatures."""
        detected = []
        resp_headers = {k.lower(): v.lower() for k, v in response.headers.items()}
        resp_cookies = [c.lower() for c in response.cookies.keys()]

        for waf_name, signatures in self.WAF_SIGNATURES.items():
            score = 0

            # check headers
            for header, expected in signatures["headers"].items():
                if header in resp_headers:
                    if expected == "*" or expected in resp_headers[header]:
                        score += 2

            # check cookies
            for cookie_pattern in signatures["cookies"]:
                for actual_cookie in resp_cookies:
                    if cookie_pattern in actual_cookie:
                        score += 1

            if score >= 2:
                detected.append({
                    "waf": waf_name,
                    "confidence": "high" if score >= 3 else "medium",
                    "score": score
                })

        return {
            "url": url,
            "status_code": response.status_code,
            "waf_detected": len(detected) > 0,
            "waf_services": detected,
            "all_headers": dict(response.headers),
        }

# usage
detector = WAFDetector(proxy="http://user:pass@proxy.example.com:8080")
result = detector.detect("https://target-website.com")

if result["waf_detected"]:
    for waf in result["waf_services"]:
        print(f"detected: {waf['waf']} (confidence: {waf['confidence']})")
else:
    print("no WAF detected")

Scraping Strategies by WAF Type

once you know which WAF protects your target, here is how to approach it.

Cloudflare

Cloudflare is the most common WAF you will encounter. it serves JavaScript challenges and uses Turnstile CAPTCHAs.

# strategy: use a stealth browser + residential proxies
from playwright.sync_api import sync_playwright

def scrape_cloudflare_site(url, proxy):
    """scrape a Cloudflare-protected site."""
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=False,  # headed mode passes more checks
            proxy={"server": proxy}
        )

        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            viewport={"width": 1920, "height": 1080},
        )

        page = context.new_page()
        page.goto(url)

        # wait for Cloudflare challenge to resolve
        page.wait_for_timeout(5000)

        # check if challenge passed
        if "challenge" not in page.url:
            content = page.content()
            print("successfully passed Cloudflare")
            return content
        else:
            print("Cloudflare challenge not resolved")
            return None

        browser.close()

recommended proxy type: residential proxies with sticky sessions.

Akamai Bot Manager

Akamai uses sophisticated device fingerprinting through sensor data collection.

# strategy: use anti-detect browser or undetected-chromedriver
import undetected_chromedriver as uc

def scrape_akamai_site(url, proxy=None):
    """scrape an Akamai-protected site."""
    options = uc.ChromeOptions()

    if proxy:
        options.add_argument(f"--proxy-server={proxy}")

    driver = uc.Chrome(options=options)
    driver.get(url)

    # akamai needs time to run its sensor scripts
    import time
    time.sleep(8)

    content = driver.page_source
    driver.quit()
    return content

recommended proxy type: residential proxies with low abuse score.

DataDome

DataDome analyzes behavior patterns in real time.

# strategy: mimic human behavior patterns
from playwright.sync_api import sync_playwright
import random
import time

def scrape_datadome_site(url, proxy):
    """scrape a DataDome-protected site with human-like behavior."""
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=False,
            proxy={"server": proxy}
        )

        page = browser.new_page()

        # navigate with human-like timing
        page.goto(url)
        time.sleep(random.uniform(2, 4))

        # simulate mouse movements
        page.mouse.move(random.randint(100, 800), random.randint(100, 600))
        time.sleep(random.uniform(0.5, 1.5))

        # scroll like a human
        page.mouse.wheel(0, random.randint(200, 500))
        time.sleep(random.uniform(1, 3))

        content = page.content()
        browser.close()
        return content

recommended proxy type: mobile proxies or ISP proxies with clean IP reputation.

AWS WAF

AWS WAF is primarily rule-based, focusing on rate limiting and IP reputation.

# strategy: rate limiting + proxy rotation
import httpx
import time
import random

def scrape_aws_waf_site(urls, proxies, requests_per_minute=10):
    """scrape AWS WAF-protected site with rate limiting."""
    delay = 60.0 / requests_per_minute
    results = []

    for url in urls:
        proxy = random.choice(proxies)
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            "Accept": "text/html,application/xhtml+xml",
            "Accept-Language": "en-US,en;q=0.9",
        }

        try:
            with httpx.Client(proxy=proxy, headers=headers) as client:
                resp = client.get(url, follow_redirects=True)
                results.append({"url": url, "status": resp.status_code, "content": resp.text})
        except Exception as e:
            results.append({"url": url, "error": str(e)})

        time.sleep(delay + random.uniform(0, 1))

    return results

recommended proxy type: datacenter proxies with rotation are often sufficient.

Batch WAF Detection

scan multiple websites at once to plan your scraping infrastructure:

import csv

def batch_detect(urls, proxy=None, output_file="waf_results.csv"):
    """detect WAFs for a list of URLs."""
    detector = WAFDetector(proxy=proxy)
    results = []

    for i, url in enumerate(urls):
        print(f"[{i+1}/{len(urls)}] scanning {url}...")
        result = detector.detect(url)

        waf_names = ", ".join([w["waf"] for w in result.get("waf_services", [])])

        results.append({
            "url": url,
            "waf_detected": result["waf_detected"],
            "waf_names": waf_names or "none",
            "status_code": result.get("status_code", "error"),
        })

        import time
        time.sleep(2)

    # save to CSV
    with open(output_file, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=["url", "waf_detected", "waf_names", "status_code"])
        writer.writeheader()
        writer.writerows(results)

    # print summary
    waf_count = sum(1 for r in results if r["waf_detected"])
    print(f"\nresults: {waf_count}/{len(results)} sites have WAF protection")

    return results

# scan your target sites
urls = [
    "https://site1.com",
    "https://site2.com",
    "https://site3.com",
]

batch_detect(urls, proxy="http://user:pass@proxy.com:8080")

Choosing the Right Proxy Based on WAF

WAF	Best Proxy Type	Session Type	Notes
Cloudflare	residential	sticky (5-10 min)	needs browser with JS
Akamai	residential	sticky (10+ min)	sensor data requires consistency
DataDome	mobile/ISP	sticky (15+ min)	behavioral analysis needs stable IP
PerimeterX	residential	sticky	requires browser
AWS WAF	datacenter (rotating)	rotating	rule-based, rotation works
Imperva	residential	sticky	cookie-based challenges
Sucuri	datacenter	rotating	basic protection

Common Mistakes in WAF Detection

1. testing from a known datacenter IP.
WAFs may behave differently for datacenter IPs versus residential IPs. test from the same type of IP you plan to scrape from.

2. ignoring the response body.
some WAFs do not set special headers but return challenge pages. always check the response body for JavaScript challenges or CAPTCHA forms.

def check_challenge_page(html):
    """check if the response is a WAF challenge page."""
    challenge_indicators = [
        "cf-browser-verification",
        "challenge-platform",
        "captcha",
        "_datadome",
        "px-captcha",
        "bot-protection",
        "access denied",
    ]

    html_lower = html.lower()
    for indicator in challenge_indicators:
        if indicator in html_lower:
            return True, indicator

    return False, None

3. assuming no WAF means no protection.
some websites use custom anti-bot solutions that WhatWaf and header analysis will not detect. rate limiting, IP blocking, and behavioral analysis can exist without a commercial WAF product.

Conclusion

WAF detection is the first step in any serious scraping project. knowing what you are up against saves hours of debugging failed requests and lets you allocate the right resources from the start. WhatWaf gives you a quick identification, but combining it with HTTP header analysis and response body inspection gives you the full picture.

once you identify the WAF, you need proxies that can handle it. our Singapore mobile proxy service uses genuine 4G/5G connections that most WAFs trust by default.

the key takeaway: do not use the same scraping strategy for every site. detect the WAF first, then match your proxy type, browser configuration, and request patterns to the specific protection you are facing.