WhatWaf: Detect WAF for Web Scraping Success
when your web scraper hits a wall, the first question is: what is blocking you? web Application Firewalls (WAFs) like Cloudflare, Akamai, and DataDome are the most common obstacles, and each one requires a different bypass strategy. WhatWaf is an open-source tool that identifies which WAF is protecting a target website, so you can choose the right approach before wasting time on trial and error.
this guide shows you how to use WhatWaf for WAF detection, interpret its results, and plan your scraping strategy based on what you find.
What is a WAF and Why It Matters for Scraping
a Web Application Firewall sits between your scraper and the website’s server. it analyzes incoming requests and decides whether they come from a real user or an automated tool. if it suspects automation, it blocks the request, serves a CAPTCHA, or returns a challenge page.
the problem for scrapers is that different WAFs use different detection methods:
- Cloudflare: JavaScript challenges, browser fingerprinting, behavioral analysis
- Akamai Bot Manager: device fingerprinting, sensor data collection, ML-based detection
- DataDome: real-time behavioral analysis, CAPTCHA challenges
- PerimeterX (now HUMAN): behavioral biometrics, proof of work challenges
- AWS WAF: rule-based filtering, rate limiting, IP reputation
- Imperva (Incapsula): cookie challenges, JavaScript validation
knowing which WAF you are dealing with determines your entire scraping strategy: which headers to send, whether you need a real browser, what proxy type to use, and whether you need to solve CAPTCHAs.
Installing WhatWaf
WhatWaf is a Python-based tool available on GitHub.
# clone the repository
git clone https://github.com/Ekultek/WhatWaf.git
cd WhatWaf
# install dependencies
pip install -r requirements.txt
# or install directly with pip
pip install whatwaf
verify the installation:
python whatwaf.py --help
Basic WAF Detection
the simplest usage is pointing WhatWaf at a URL:
python whatwaf.py -u https://target-website.com
WhatWaf sends a series of test requests with payloads designed to trigger WAF responses. it then analyzes the responses to identify which WAF is in place.
example output:
[*] checking https://target-website.com
[+] results for target-website.com:
[+] detected firewall: Cloudflare (Cloudflare Inc.)
[+] detection method: response headers, status codes
[+] confidence: high
Common Options
# scan with custom headers
python whatwaf.py -u https://target.com --headers "User-Agent: Mozilla/5.0"
# scan multiple URLs from a file
python whatwaf.py -l urls.txt
# use a proxy for scanning
python whatwaf.py -u https://target.com --proxy http://user:pass@proxy.com:8080
# increase verbosity
python whatwaf.py -u https://target.com -v
# specify request timeout
python whatwaf.py -u https://target.com --timeout 30
# use random user agent
python whatwaf.py -u https://target.com --ra
Integrating WhatWaf with Python Scraping Workflows
you can automate WAF detection as part of your scraping pipeline to choose the right strategy automatically.
Method 1: Command-Line Integration
import subprocess
import json
import re
def detect_waf_cli(url, proxy=None):
"""detect WAF using WhatWaf CLI."""
cmd = ["python", "whatwaf.py", "-u", url, "--json"]
if proxy:
cmd.extend(["--proxy", proxy])
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=60
)
# parse output
output = result.stdout
waf_info = {
"url": url,
"waf_detected": False,
"waf_name": None,
"confidence": None
}
if "detected firewall" in output.lower():
waf_info["waf_detected"] = True
# extract WAF name from output
match = re.search(r"detected firewall:\s*(.+?)(?:\n|$)", output, re.IGNORECASE)
if match:
waf_info["waf_name"] = match.group(1).strip()
return waf_info
Method 2: HTTP Header Analysis (DIY Detection)
you do not always need WhatWaf. many WAFs leave identifiable signatures in HTTP headers.
import httpx
class WAFDetector:
"""detect common WAFs by analyzing HTTP response headers."""
WAF_SIGNATURES = {
"Cloudflare": {
"headers": {"server": "cloudflare", "cf-ray": "*"},
"cookies": ["__cf_bm", "cf_clearance"],
},
"Akamai": {
"headers": {"server": "akamaighost", "x-akamai-transformed": "*"},
"cookies": ["_abck", "ak_bmsc"],
},
"DataDome": {
"headers": {"x-datadome": "*", "server": "datadome"},
"cookies": ["datadome"],
},
"PerimeterX": {
"headers": {},
"cookies": ["_px3", "_pxhd", "_pxvid"],
},
"Imperva": {
"headers": {"x-iinfo": "*", "x-cdn": "imperva"},
"cookies": ["incap_ses_", "visid_incap_"],
},
"AWS WAF": {
"headers": {"x-amzn-waf-action": "*"},
"cookies": ["aws-waf-token"],
},
"Sucuri": {
"headers": {"server": "sucuri", "x-sucuri-id": "*"},
"cookies": ["sucuri_cloudproxy_uuid"],
},
"Fastly": {
"headers": {"x-fastly-request-id": "*", "via": "varnish"},
"cookies": [],
}
}
def __init__(self, proxy=None):
self.proxy = proxy
self.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
def detect(self, url):
"""detect WAF by checking response headers and cookies."""
try:
with httpx.Client(
proxy=self.proxy,
headers=self.headers,
follow_redirects=True,
timeout=30
) as client:
response = client.get(url)
return self._analyze_response(url, response)
except Exception as e:
return {"url": url, "error": str(e), "waf_detected": False}
def _analyze_response(self, url, response):
"""analyze response for WAF signatures."""
detected = []
resp_headers = {k.lower(): v.lower() for k, v in response.headers.items()}
resp_cookies = [c.lower() for c in response.cookies.keys()]
for waf_name, signatures in self.WAF_SIGNATURES.items():
score = 0
# check headers
for header, expected in signatures["headers"].items():
if header in resp_headers:
if expected == "*" or expected in resp_headers[header]:
score += 2
# check cookies
for cookie_pattern in signatures["cookies"]:
for actual_cookie in resp_cookies:
if cookie_pattern in actual_cookie:
score += 1
if score >= 2:
detected.append({
"waf": waf_name,
"confidence": "high" if score >= 3 else "medium",
"score": score
})
return {
"url": url,
"status_code": response.status_code,
"waf_detected": len(detected) > 0,
"waf_services": detected,
"all_headers": dict(response.headers),
}
# usage
detector = WAFDetector(proxy="http://user:pass@proxy.example.com:8080")
result = detector.detect("https://target-website.com")
if result["waf_detected"]:
for waf in result["waf_services"]:
print(f"detected: {waf['waf']} (confidence: {waf['confidence']})")
else:
print("no WAF detected")
Scraping Strategies by WAF Type
once you know which WAF protects your target, here is how to approach it.
Cloudflare
Cloudflare is the most common WAF you will encounter. it serves JavaScript challenges and uses Turnstile CAPTCHAs.
# strategy: use a stealth browser + residential proxies
from playwright.sync_api import sync_playwright
def scrape_cloudflare_site(url, proxy):
"""scrape a Cloudflare-protected site."""
with sync_playwright() as p:
browser = p.chromium.launch(
headless=False, # headed mode passes more checks
proxy={"server": proxy}
)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
viewport={"width": 1920, "height": 1080},
)
page = context.new_page()
page.goto(url)
# wait for Cloudflare challenge to resolve
page.wait_for_timeout(5000)
# check if challenge passed
if "challenge" not in page.url:
content = page.content()
print("successfully passed Cloudflare")
return content
else:
print("Cloudflare challenge not resolved")
return None
browser.close()
recommended proxy type: residential proxies with sticky sessions.
Akamai Bot Manager
Akamai uses sophisticated device fingerprinting through sensor data collection.
# strategy: use anti-detect browser or undetected-chromedriver
import undetected_chromedriver as uc
def scrape_akamai_site(url, proxy=None):
"""scrape an Akamai-protected site."""
options = uc.ChromeOptions()
if proxy:
options.add_argument(f"--proxy-server={proxy}")
driver = uc.Chrome(options=options)
driver.get(url)
# akamai needs time to run its sensor scripts
import time
time.sleep(8)
content = driver.page_source
driver.quit()
return content
recommended proxy type: residential proxies with low abuse score.
DataDome
DataDome analyzes behavior patterns in real time.
# strategy: mimic human behavior patterns
from playwright.sync_api import sync_playwright
import random
import time
def scrape_datadome_site(url, proxy):
"""scrape a DataDome-protected site with human-like behavior."""
with sync_playwright() as p:
browser = p.chromium.launch(
headless=False,
proxy={"server": proxy}
)
page = browser.new_page()
# navigate with human-like timing
page.goto(url)
time.sleep(random.uniform(2, 4))
# simulate mouse movements
page.mouse.move(random.randint(100, 800), random.randint(100, 600))
time.sleep(random.uniform(0.5, 1.5))
# scroll like a human
page.mouse.wheel(0, random.randint(200, 500))
time.sleep(random.uniform(1, 3))
content = page.content()
browser.close()
return content
recommended proxy type: mobile proxies or ISP proxies with clean IP reputation.
AWS WAF
AWS WAF is primarily rule-based, focusing on rate limiting and IP reputation.
# strategy: rate limiting + proxy rotation
import httpx
import time
import random
def scrape_aws_waf_site(urls, proxies, requests_per_minute=10):
"""scrape AWS WAF-protected site with rate limiting."""
delay = 60.0 / requests_per_minute
results = []
for url in urls:
proxy = random.choice(proxies)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
}
try:
with httpx.Client(proxy=proxy, headers=headers) as client:
resp = client.get(url, follow_redirects=True)
results.append({"url": url, "status": resp.status_code, "content": resp.text})
except Exception as e:
results.append({"url": url, "error": str(e)})
time.sleep(delay + random.uniform(0, 1))
return results
recommended proxy type: datacenter proxies with rotation are often sufficient.
Batch WAF Detection
scan multiple websites at once to plan your scraping infrastructure:
import csv
def batch_detect(urls, proxy=None, output_file="waf_results.csv"):
"""detect WAFs for a list of URLs."""
detector = WAFDetector(proxy=proxy)
results = []
for i, url in enumerate(urls):
print(f"[{i+1}/{len(urls)}] scanning {url}...")
result = detector.detect(url)
waf_names = ", ".join([w["waf"] for w in result.get("waf_services", [])])
results.append({
"url": url,
"waf_detected": result["waf_detected"],
"waf_names": waf_names or "none",
"status_code": result.get("status_code", "error"),
})
import time
time.sleep(2)
# save to CSV
with open(output_file, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["url", "waf_detected", "waf_names", "status_code"])
writer.writeheader()
writer.writerows(results)
# print summary
waf_count = sum(1 for r in results if r["waf_detected"])
print(f"\nresults: {waf_count}/{len(results)} sites have WAF protection")
return results
# scan your target sites
urls = [
"https://site1.com",
"https://site2.com",
"https://site3.com",
]
batch_detect(urls, proxy="http://user:pass@proxy.com:8080")
Choosing the Right Proxy Based on WAF
| WAF | Best Proxy Type | Session Type | Notes |
|---|---|---|---|
| Cloudflare | residential | sticky (5-10 min) | needs browser with JS |
| Akamai | residential | sticky (10+ min) | sensor data requires consistency |
| DataDome | mobile/ISP | sticky (15+ min) | behavioral analysis needs stable IP |
| PerimeterX | residential | sticky | requires browser |
| AWS WAF | datacenter (rotating) | rotating | rule-based, rotation works |
| Imperva | residential | sticky | cookie-based challenges |
| Sucuri | datacenter | rotating | basic protection |
Common Mistakes in WAF Detection
1. testing from a known datacenter IP.
WAFs may behave differently for datacenter IPs versus residential IPs. test from the same type of IP you plan to scrape from.
2. ignoring the response body.
some WAFs do not set special headers but return challenge pages. always check the response body for JavaScript challenges or CAPTCHA forms.
def check_challenge_page(html):
"""check if the response is a WAF challenge page."""
challenge_indicators = [
"cf-browser-verification",
"challenge-platform",
"captcha",
"_datadome",
"px-captcha",
"bot-protection",
"access denied",
]
html_lower = html.lower()
for indicator in challenge_indicators:
if indicator in html_lower:
return True, indicator
return False, None
3. assuming no WAF means no protection.
some websites use custom anti-bot solutions that WhatWaf and header analysis will not detect. rate limiting, IP blocking, and behavioral analysis can exist without a commercial WAF product.
Conclusion
WAF detection is the first step in any serious scraping project. knowing what you are up against saves hours of debugging failed requests and lets you allocate the right resources from the start. WhatWaf gives you a quick identification, but combining it with HTTP header analysis and response body inspection gives you the full picture.
once you identify the WAF, you need proxies that can handle it. our Singapore mobile proxy service uses genuine 4G/5G connections that most WAFs trust by default.
the key takeaway: do not use the same scraping strategy for every site. detect the WAF first, then match your proxy type, browser configuration, and request patterns to the specific protection you are facing.