How to Scrape Walmart Product Pages with Residential Proxies
Walmart is the world’s largest retailer by revenue and the second-largest e-commerce platform in the United States. With millions of products listed on walmart.com, the platform is a critical data source for competitive pricing, market research, product monitoring, and e-commerce intelligence.
Walmart has significantly strengthened its anti-scraping defenses in recent years, employing bot detection technologies that challenge even experienced scrapers. This guide provides a complete Python framework for extracting Walmart product data using residential proxies to maintain reliable access.
Why Walmart Scraping Requires Robust Proxies
Walmart employs multiple layers of anti-bot protection:
- PerimeterX (HUMAN Security): Walmart uses PerimeterX, one of the most sophisticated bot detection solutions, which analyzes browser fingerprints, mouse movements, and request patterns.
- IP reputation scoring: Datacenter IPs and known VPN/proxy ranges are flagged immediately.
- Rate limiting: Aggressive per-IP request quotas that trigger blocks after sustained activity.
- JavaScript challenges: Client-side scripts that require full browser execution to pass.
- Cookie validation: Complex cookie chains that must be maintained across requests.
Residential and mobile proxies are the most effective choice because they use IP addresses assigned by real ISPs and mobile carriers, which bot detection systems treat as legitimate consumer traffic.
Setting Up Your Environment
pip install requests beautifulsoup4 lxml pandas cloudscraperWe include cloudscraper as an alternative to plain requests for handling JavaScript challenges.
Building the Walmart Scraper
Step 1: Configure Session with Anti-Detection
import requests
import cloudscraper
from bs4 import BeautifulSoup
import json
import time
import random
import re
import pandas as pd
from datetime import datetime
class WalmartScraper:
"""Scrape Walmart product data with anti-detection measures."""
BASE_URL = "https://www.walmart.com"
SEARCH_URL = "https://www.walmart.com/search"
API_URL = "https://www.walmart.com/orchestra/home/graphql"
def __init__(self, proxy_url, use_cloudscraper=True):
if use_cloudscraper:
self.session = cloudscraper.create_scraper(
browser={"browser": "chrome", "platform": "windows", "mobile": False}
)
else:
self.session = requests.Session()
self.session.proxies = {
"http": proxy_url,
"https": proxy_url,
}
self.session.headers.update({
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
})
def _fetch_page(self, url, params=None, max_retries=3):
"""Fetch a page with retry logic and anti-detection."""
for attempt in range(max_retries):
try:
response = self.session.get(url, params=params, timeout=25)
if response.status_code == 200:
# Check for bot detection page
if "blocked" in response.text.lower() and len(response.text) < 5000:
print(f"Bot detection triggered, attempt {attempt + 1}")
time.sleep(random.uniform(15, 30))
continue
return response.text
elif response.status_code == 403:
print(f"Blocked (403), rotating proxy recommended. Attempt {attempt + 1}")
time.sleep(random.uniform(10, 20))
elif response.status_code == 429:
print(f"Rate limited, waiting...")
time.sleep(random.uniform(30, 60))
else:
print(f"Status {response.status_code}, attempt {attempt + 1}")
time.sleep(random.uniform(5, 10))
except requests.exceptions.RequestException as e:
print(f"Request error: {e}")
time.sleep(random.uniform(5, 10))
return None
def _extract_json_data(self, html):
"""Extract the __NEXT_DATA__ JSON from Walmart's page."""
match = re.search(
r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
html,
re.DOTALL,
)
if match:
try:
return json.loads(match.group(1))
except json.JSONDecodeError:
pass
return NoneStep 2: Search for Products
def search_products(self, query, num_pages=3, sort="best_match"):
"""Search Walmart for products matching a query."""
all_products = []
for page in range(1, num_pages + 1):
params = {
"q": query,
"page": page,
"sort": sort,
"affinityOverride": "default",
}
print(f"Scraping search page {page} for '{query}'...")
html = self._fetch_page(self.SEARCH_URL, params=params)
if not html:
print(f"Failed to fetch page {page}")
continue
products = self._parse_search_results(html)
if not products:
print(f"No products on page {page}")
break
all_products.extend(products)
print(f" Found {len(products)} products (total: {len(all_products)})")
time.sleep(random.uniform(3, 7))
return all_products
def _parse_search_results(self, html):
"""Parse product listings from search results."""
products = []
# Try extracting from __NEXT_DATA__
next_data = self._extract_json_data(html)
if next_data:
products = self._parse_next_data_search(next_data)
# Fallback to HTML parsing
if not products:
products = self._parse_html_search(html)
return products
def _parse_next_data_search(self, data):
"""Parse search results from __NEXT_DATA__ JSON."""
products = []
try:
# Navigate to search results in the JSON structure
props = data.get("props", {}).get("pageProps", {})
initial_data = props.get("initialData", {})
search_result = initial_data.get("searchResult", {})
items = search_result.get("itemStacks", [])
for stack in items:
for item in stack.get("items", []):
if item.get("__typename") != "Product":
continue
product = {
"item_id": item.get("usItemId"),
"product_id": item.get("productId"),
"title": item.get("name"),
"brand": item.get("brand"),
"price": item.get("priceInfo", {}).get("currentPrice", {}).get("price"),
"price_text": item.get("priceInfo", {}).get("currentPrice", {}).get("priceString"),
"was_price": item.get("priceInfo", {}).get("wasPrice", {}).get("price"),
"unit_price": item.get("priceInfo", {}).get("unitPrice"),
"rating": item.get("averageRating"),
"review_count": item.get("numberOfReviews"),
"seller": item.get("sellerName"),
"fulfillment": item.get("fulfillmentType"),
"in_stock": item.get("availabilityStatusV2", {}).get("value") == "IN_STOCK",
"url": f"https://www.walmart.com{item.get('canonicalUrl', '')}",
"image_url": item.get("imageInfo", {}).get("thumbnailUrl"),
"badges": [b.get("text") for b in item.get("badges", {}).get("flags", []) if b.get("text")],
"scraped_at": datetime.now().isoformat(),
}
products.append(product)
except (KeyError, TypeError) as e:
print(f"Error parsing JSON search results: {e}")
return products
def _parse_html_search(self, html):
"""Fallback HTML parser for search results."""
soup = BeautifulSoup(html, "lxml")
products = []
cards = soup.select("div[data-item-id]")
for card in cards:
product = {}
product["item_id"] = card.get("data-item-id")
title_el = card.select_one("span[data-automation-id='product-title']")
product["title"] = title_el.get_text(strip=True) if title_el else None
price_el = card.select_one("div[data-automation-id='product-price'] span")
if price_el:
price_text = price_el.get_text(strip=True)
product["price_text"] = price_text
match = re.search(r"\$([\d,]+\.?\d*)", price_text)
product["price"] = float(match.group(1).replace(",", "")) if match else None
rating_el = card.select_one("span.w_iUH7")
product["rating"] = rating_el.get_text(strip=True) if rating_el else None
link_el = card.select_one("a[link-identifier]")
if link_el:
href = link_el.get("href", "")
product["url"] = f"https://www.walmart.com{href}" if href.startswith("/") else href
if product.get("title"):
products.append(product)
return productsStep 3: Extract Detailed Product Information
def get_product_details(self, product_url):
"""Fetch detailed information for a single product."""
html = self._fetch_page(product_url)
if not html:
return None
next_data = self._extract_json_data(html)
if not next_data:
return self._parse_product_html(html)
return self._parse_product_json(next_data)
def _parse_product_json(self, data):
"""Parse detailed product data from __NEXT_DATA__."""
try:
props = data.get("props", {}).get("pageProps", {})
initial_data = props.get("initialData", {}).get("data", {})
product = initial_data.get("product", {})
details = {
"item_id": product.get("usItemId"),
"product_id": product.get("productId"),
"title": product.get("name"),
"brand": product.get("brand"),
"short_description": product.get("shortDescription"),
"long_description": product.get("detailedDescription"),
# Pricing
"price": product.get("priceInfo", {}).get("currentPrice", {}).get("price"),
"price_text": product.get("priceInfo", {}).get("currentPrice", {}).get("priceString"),
"was_price": product.get("priceInfo", {}).get("wasPrice", {}).get("price"),
"savings": product.get("priceInfo", {}).get("savings"),
"price_per_unit": product.get("priceInfo", {}).get("unitPrice"),
# Ratings and reviews
"rating": product.get("averageRating"),
"review_count": product.get("numberOfReviews"),
# Availability
"in_stock": product.get("availabilityStatus") == "IN_STOCK",
"fulfillment_type": product.get("fulfillmentType"),
"seller_name": product.get("sellerName"),
"seller_id": product.get("sellerId"),
# Category
"category_path": [
cat.get("name") for cat in product.get("category", {}).get("path", [])
],
# Specifications
"specifications": {},
# Images
"images": [
img.get("url") for img in product.get("imageInfo", {}).get("allImages", [])
if img.get("url")
],
# URL
"url": f"https://www.walmart.com{product.get('canonicalUrl', '')}",
"scraped_at": datetime.now().isoformat(),
}
# Extract specifications
specs = product.get("specifications", [])
for spec_group in specs:
for spec in spec_group.get("specifications", []):
key = spec.get("name")
value = spec.get("value")
if key and value:
details["specifications"][key] = value
return details
except (KeyError, TypeError) as e:
print(f"Error parsing product JSON: {e}")
return None
def _parse_product_html(self, html):
"""Fallback parser for product details from HTML."""
soup = BeautifulSoup(html, "lxml")
details = {}
title_el = soup.select_one("h1[itemprop='name']")
details["title"] = title_el.get_text(strip=True) if title_el else None
price_el = soup.select_one("span[itemprop='price']")
if price_el:
details["price_text"] = price_el.get_text(strip=True)
rating_el = soup.select_one("span.rating-number")
details["rating"] = rating_el.get_text(strip=True) if rating_el else None
desc_el = soup.select_one("div.about-desc")
details["description"] = desc_el.get_text(strip=True) if desc_el else None
return detailsStep 4: Extract Product Reviews
def get_product_reviews(self, item_id, num_pages=3):
"""Fetch reviews for a product."""
all_reviews = []
for page in range(1, num_pages + 1):
url = (
f"https://www.walmart.com/reviews/product/{item_id}"
f"?page={page}&sort=relevancy"
)
html = self._fetch_page(url)
if not html:
break
soup = BeautifulSoup(html, "lxml")
review_elements = soup.select("div[itemprop='review']")
if not review_elements:
# Try extracting from JSON
next_data = self._extract_json_data(html)
if next_data:
reviews = self._parse_reviews_json(next_data)
if reviews:
all_reviews.extend(reviews)
else:
break
else:
break
else:
for el in review_elements:
review = {}
title_el = el.select_one("h3")
review["title"] = title_el.get_text(strip=True) if title_el else None
body_el = el.select_one("span[itemprop='reviewBody']")
review["body"] = body_el.get_text(strip=True) if body_el else None
rating_el = el.select_one("meta[itemprop='ratingValue']")
review["rating"] = rating_el.get("content") if rating_el else None
author_el = el.select_one("span[itemprop='author']")
review["author"] = author_el.get_text(strip=True) if author_el else None
date_el = el.select_one("meta[itemprop='datePublished']")
review["date"] = date_el.get("content") if date_el else None
if review.get("body"):
all_reviews.append(review)
time.sleep(random.uniform(2, 5))
return all_reviews
def _parse_reviews_json(self, data):
"""Parse reviews from __NEXT_DATA__."""
reviews = []
try:
props = data.get("props", {}).get("pageProps", {})
review_data = props.get("initialData", {}).get("data", {}).get("reviews", {})
customer_reviews = review_data.get("customerReviews", [])
for rev in customer_reviews:
review = {
"title": rev.get("reviewTitle"),
"body": rev.get("reviewText"),
"rating": rev.get("rating"),
"author": rev.get("userNickname"),
"date": rev.get("reviewSubmissionTime"),
"verified_purchase": rev.get("badges", {}).get("verifiedPurchaser", False),
"positive_feedback": rev.get("positiveFeedback"),
"negative_feedback": rev.get("negativeFeedback"),
}
reviews.append(review)
except (KeyError, TypeError):
pass
return reviewsStep 5: Run the Complete Pipeline
def main():
proxy_url = "http://user:pass@proxy.dataresearchtools.com:8080"
scraper = WalmartScraper(proxy_url)
# Search for products
search_queries = [
"wireless earbuds",
"laptop stand",
"USB-C hub",
]
all_products = []
for query in search_queries:
print(f"\nSearching Walmart for: {query}")
products = scraper.search_products(query, num_pages=2)
for p in products:
p["search_query"] = query
all_products.extend(products)
time.sleep(random.uniform(5, 10))
print(f"\nTotal search results: {len(all_products)}")
# Get details for top products
detailed = []
for product in all_products[:10]:
url = product.get("url")
if url:
print(f"Fetching details: {product.get('title', 'Unknown')[:50]}...")
details = scraper.get_product_details(url)
if details:
detailed.append(details)
time.sleep(random.uniform(4, 8))
# Get reviews for top products
for product in detailed[:5]:
item_id = product.get("item_id")
if item_id:
print(f"Fetching reviews for item {item_id}...")
reviews = scraper.get_product_reviews(item_id, num_pages=2)
product["reviews"] = reviews
product["reviews_scraped"] = len(reviews)
print(f" Got {len(reviews)} reviews")
time.sleep(random.uniform(4, 8))
# Save results
with open("walmart_search_results.json", "w", encoding="utf-8") as f:
json.dump(all_products, f, indent=2, ensure_ascii=False)
with open("walmart_detailed.json", "w", encoding="utf-8") as f:
json.dump(detailed, f, indent=2, ensure_ascii=False)
# Analysis
df = pd.DataFrame(all_products)
df.to_csv("walmart_products.csv", index=False)
print(f"\nResults Summary:")
print(f" Total products: {len(all_products)}")
print(f" Detailed products: {len(detailed)}")
prices = [p["price"] for p in all_products if p.get("price")]
if prices:
print(f" Price range: ${min(prices):.2f} - ${max(prices):.2f}")
print(f" Average price: ${sum(prices)/len(prices):.2f}")
if __name__ == "__main__":
main()Handling PerimeterX Bot Detection
PerimeterX (now HUMAN Security) is the primary challenge when scraping Walmart. Here are strategies to bypass it:
Use CloudScraper
The cloudscraper library handles JavaScript challenges automatically. It simulates browser TLS fingerprints and solves basic JavaScript challenges without requiring a full browser.
Browser-Level Scraping
For the most difficult scenarios, use Playwright or Selenium with stealth plugins:
from playwright.sync_api import sync_playwright
def scrape_with_playwright(url, proxy_config):
"""Use Playwright for JavaScript-heavy pages."""
with sync_playwright() as pw:
browser = pw.chromium.launch(
headless=True,
proxy=proxy_config,
)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
),
)
page = context.new_page()
page.goto(url, wait_until="networkidle")
time.sleep(random.uniform(2, 4))
content = page.content()
browser.close()
return contentCookie Persistence
Maintain cookies across requests. PerimeterX sets tracking cookies that must persist throughout your scraping session. Clearing cookies mid-session immediately flags you as a bot.
Proxy Best Practices for Walmart
- Residential over datacenter: Always use residential or mobile proxies. Datacenter IPs are blocked almost instantly.
- US-based IPs: Walmart.com primarily serves US customers. Use US-based proxy IPs for best results.
- Rotation frequency: Rotate IPs every 10-15 requests, but maintain cookies within each IP session.
- Concurrent limits: Limit concurrent requests to 2-3 through different proxies. Mass concurrent requests trigger alerts.
Data Applications
Walmart product data supports numerous business operations:
- Competitive pricing: Compare your prices against Walmart’s marketplace sellers for the same products.
- Inventory monitoring: Track stock levels for products you sell or source from Walmart.
- Review analysis: Mine customer reviews for product quality insights and feature requests.
- Seller intelligence: Monitor third-party sellers on Walmart Marketplace for competitive positioning.
- Category trends: Analyze bestseller rankings and new product launches across categories.
Conclusion
Scraping Walmart product pages requires overcoming PerimeterX bot detection, JavaScript rendering challenges, and aggressive rate limiting. The Python framework in this guide provides multiple approaches — from API-based extraction to HTML parsing — with proper anti-detection measures at each layer.
Residential proxies from DataResearchTools provide the foundation for sustainable Walmart scraping by delivering trusted IP addresses that bypass bot detection systems. For additional web scraping techniques and proxy terminology, explore our proxy glossary.
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix