How to Scrape Walmart Product Pages with Residential Proxies

How to Scrape Walmart Product Pages with Residential Proxies

Walmart is the world’s largest retailer by revenue and the second-largest e-commerce platform in the United States. With millions of products listed on walmart.com, the platform is a critical data source for competitive pricing, market research, product monitoring, and e-commerce intelligence.

Walmart has significantly strengthened its anti-scraping defenses in recent years, employing bot detection technologies that challenge even experienced scrapers. This guide provides a complete Python framework for extracting Walmart product data using residential proxies to maintain reliable access.

Why Walmart Scraping Requires Robust Proxies

Walmart employs multiple layers of anti-bot protection:

  • PerimeterX (HUMAN Security): Walmart uses PerimeterX, one of the most sophisticated bot detection solutions, which analyzes browser fingerprints, mouse movements, and request patterns.
  • IP reputation scoring: Datacenter IPs and known VPN/proxy ranges are flagged immediately.
  • Rate limiting: Aggressive per-IP request quotas that trigger blocks after sustained activity.
  • JavaScript challenges: Client-side scripts that require full browser execution to pass.
  • Cookie validation: Complex cookie chains that must be maintained across requests.

Residential and mobile proxies are the most effective choice because they use IP addresses assigned by real ISPs and mobile carriers, which bot detection systems treat as legitimate consumer traffic.

Setting Up Your Environment

pip install requests beautifulsoup4 lxml pandas cloudscraper

We include cloudscraper as an alternative to plain requests for handling JavaScript challenges.

Building the Walmart Scraper

Step 1: Configure Session with Anti-Detection

import requests
import cloudscraper
from bs4 import BeautifulSoup
import json
import time
import random
import re
import pandas as pd
from datetime import datetime

class WalmartScraper:
    """Scrape Walmart product data with anti-detection measures."""

    BASE_URL = "https://www.walmart.com"
    SEARCH_URL = "https://www.walmart.com/search"
    API_URL = "https://www.walmart.com/orchestra/home/graphql"

    def __init__(self, proxy_url, use_cloudscraper=True):
        if use_cloudscraper:
            self.session = cloudscraper.create_scraper(
                browser={"browser": "chrome", "platform": "windows", "mobile": False}
            )
        else:
            self.session = requests.Session()

        self.session.proxies = {
            "http": proxy_url,
            "https": proxy_url,
        }

        self.session.headers.update({
            "User-Agent": (
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/120.0.0.0 Safari/537.36"
            ),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Connection": "keep-alive",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
        })

    def _fetch_page(self, url, params=None, max_retries=3):
        """Fetch a page with retry logic and anti-detection."""
        for attempt in range(max_retries):
            try:
                response = self.session.get(url, params=params, timeout=25)

                if response.status_code == 200:
                    # Check for bot detection page
                    if "blocked" in response.text.lower() and len(response.text) < 5000:
                        print(f"Bot detection triggered, attempt {attempt + 1}")
                        time.sleep(random.uniform(15, 30))
                        continue
                    return response.text
                elif response.status_code == 403:
                    print(f"Blocked (403), rotating proxy recommended. Attempt {attempt + 1}")
                    time.sleep(random.uniform(10, 20))
                elif response.status_code == 429:
                    print(f"Rate limited, waiting...")
                    time.sleep(random.uniform(30, 60))
                else:
                    print(f"Status {response.status_code}, attempt {attempt + 1}")
                    time.sleep(random.uniform(5, 10))

            except requests.exceptions.RequestException as e:
                print(f"Request error: {e}")
                time.sleep(random.uniform(5, 10))

        return None

    def _extract_json_data(self, html):
        """Extract the __NEXT_DATA__ JSON from Walmart's page."""
        match = re.search(
            r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
            html,
            re.DOTALL,
        )
        if match:
            try:
                return json.loads(match.group(1))
            except json.JSONDecodeError:
                pass
        return None

Step 2: Search for Products

    def search_products(self, query, num_pages=3, sort="best_match"):
        """Search Walmart for products matching a query."""
        all_products = []

        for page in range(1, num_pages + 1):
            params = {
                "q": query,
                "page": page,
                "sort": sort,
                "affinityOverride": "default",
            }

            print(f"Scraping search page {page} for '{query}'...")
            html = self._fetch_page(self.SEARCH_URL, params=params)
            if not html:
                print(f"Failed to fetch page {page}")
                continue

            products = self._parse_search_results(html)
            if not products:
                print(f"No products on page {page}")
                break

            all_products.extend(products)
            print(f"  Found {len(products)} products (total: {len(all_products)})")
            time.sleep(random.uniform(3, 7))

        return all_products

    def _parse_search_results(self, html):
        """Parse product listings from search results."""
        products = []

        # Try extracting from __NEXT_DATA__
        next_data = self._extract_json_data(html)
        if next_data:
            products = self._parse_next_data_search(next_data)

        # Fallback to HTML parsing
        if not products:
            products = self._parse_html_search(html)

        return products

    def _parse_next_data_search(self, data):
        """Parse search results from __NEXT_DATA__ JSON."""
        products = []

        try:
            # Navigate to search results in the JSON structure
            props = data.get("props", {}).get("pageProps", {})
            initial_data = props.get("initialData", {})
            search_result = initial_data.get("searchResult", {})
            items = search_result.get("itemStacks", [])

            for stack in items:
                for item in stack.get("items", []):
                    if item.get("__typename") != "Product":
                        continue

                    product = {
                        "item_id": item.get("usItemId"),
                        "product_id": item.get("productId"),
                        "title": item.get("name"),
                        "brand": item.get("brand"),
                        "price": item.get("priceInfo", {}).get("currentPrice", {}).get("price"),
                        "price_text": item.get("priceInfo", {}).get("currentPrice", {}).get("priceString"),
                        "was_price": item.get("priceInfo", {}).get("wasPrice", {}).get("price"),
                        "unit_price": item.get("priceInfo", {}).get("unitPrice"),
                        "rating": item.get("averageRating"),
                        "review_count": item.get("numberOfReviews"),
                        "seller": item.get("sellerName"),
                        "fulfillment": item.get("fulfillmentType"),
                        "in_stock": item.get("availabilityStatusV2", {}).get("value") == "IN_STOCK",
                        "url": f"https://www.walmart.com{item.get('canonicalUrl', '')}",
                        "image_url": item.get("imageInfo", {}).get("thumbnailUrl"),
                        "badges": [b.get("text") for b in item.get("badges", {}).get("flags", []) if b.get("text")],
                        "scraped_at": datetime.now().isoformat(),
                    }
                    products.append(product)

        except (KeyError, TypeError) as e:
            print(f"Error parsing JSON search results: {e}")

        return products

    def _parse_html_search(self, html):
        """Fallback HTML parser for search results."""
        soup = BeautifulSoup(html, "lxml")
        products = []

        cards = soup.select("div[data-item-id]")
        for card in cards:
            product = {}

            product["item_id"] = card.get("data-item-id")

            title_el = card.select_one("span[data-automation-id='product-title']")
            product["title"] = title_el.get_text(strip=True) if title_el else None

            price_el = card.select_one("div[data-automation-id='product-price'] span")
            if price_el:
                price_text = price_el.get_text(strip=True)
                product["price_text"] = price_text
                match = re.search(r"\$([\d,]+\.?\d*)", price_text)
                product["price"] = float(match.group(1).replace(",", "")) if match else None

            rating_el = card.select_one("span.w_iUH7")
            product["rating"] = rating_el.get_text(strip=True) if rating_el else None

            link_el = card.select_one("a[link-identifier]")
            if link_el:
                href = link_el.get("href", "")
                product["url"] = f"https://www.walmart.com{href}" if href.startswith("/") else href

            if product.get("title"):
                products.append(product)

        return products

Step 3: Extract Detailed Product Information

    def get_product_details(self, product_url):
        """Fetch detailed information for a single product."""
        html = self._fetch_page(product_url)
        if not html:
            return None

        next_data = self._extract_json_data(html)
        if not next_data:
            return self._parse_product_html(html)

        return self._parse_product_json(next_data)

    def _parse_product_json(self, data):
        """Parse detailed product data from __NEXT_DATA__."""
        try:
            props = data.get("props", {}).get("pageProps", {})
            initial_data = props.get("initialData", {}).get("data", {})
            product = initial_data.get("product", {})

            details = {
                "item_id": product.get("usItemId"),
                "product_id": product.get("productId"),
                "title": product.get("name"),
                "brand": product.get("brand"),
                "short_description": product.get("shortDescription"),
                "long_description": product.get("detailedDescription"),

                # Pricing
                "price": product.get("priceInfo", {}).get("currentPrice", {}).get("price"),
                "price_text": product.get("priceInfo", {}).get("currentPrice", {}).get("priceString"),
                "was_price": product.get("priceInfo", {}).get("wasPrice", {}).get("price"),
                "savings": product.get("priceInfo", {}).get("savings"),
                "price_per_unit": product.get("priceInfo", {}).get("unitPrice"),

                # Ratings and reviews
                "rating": product.get("averageRating"),
                "review_count": product.get("numberOfReviews"),

                # Availability
                "in_stock": product.get("availabilityStatus") == "IN_STOCK",
                "fulfillment_type": product.get("fulfillmentType"),
                "seller_name": product.get("sellerName"),
                "seller_id": product.get("sellerId"),

                # Category
                "category_path": [
                    cat.get("name") for cat in product.get("category", {}).get("path", [])
                ],

                # Specifications
                "specifications": {},

                # Images
                "images": [
                    img.get("url") for img in product.get("imageInfo", {}).get("allImages", [])
                    if img.get("url")
                ],

                # URL
                "url": f"https://www.walmart.com{product.get('canonicalUrl', '')}",

                "scraped_at": datetime.now().isoformat(),
            }

            # Extract specifications
            specs = product.get("specifications", [])
            for spec_group in specs:
                for spec in spec_group.get("specifications", []):
                    key = spec.get("name")
                    value = spec.get("value")
                    if key and value:
                        details["specifications"][key] = value

            return details

        except (KeyError, TypeError) as e:
            print(f"Error parsing product JSON: {e}")
            return None

    def _parse_product_html(self, html):
        """Fallback parser for product details from HTML."""
        soup = BeautifulSoup(html, "lxml")
        details = {}

        title_el = soup.select_one("h1[itemprop='name']")
        details["title"] = title_el.get_text(strip=True) if title_el else None

        price_el = soup.select_one("span[itemprop='price']")
        if price_el:
            details["price_text"] = price_el.get_text(strip=True)

        rating_el = soup.select_one("span.rating-number")
        details["rating"] = rating_el.get_text(strip=True) if rating_el else None

        desc_el = soup.select_one("div.about-desc")
        details["description"] = desc_el.get_text(strip=True) if desc_el else None

        return details

Step 4: Extract Product Reviews

    def get_product_reviews(self, item_id, num_pages=3):
        """Fetch reviews for a product."""
        all_reviews = []

        for page in range(1, num_pages + 1):
            url = (
                f"https://www.walmart.com/reviews/product/{item_id}"
                f"?page={page}&sort=relevancy"
            )

            html = self._fetch_page(url)
            if not html:
                break

            soup = BeautifulSoup(html, "lxml")
            review_elements = soup.select("div[itemprop='review']")

            if not review_elements:
                # Try extracting from JSON
                next_data = self._extract_json_data(html)
                if next_data:
                    reviews = self._parse_reviews_json(next_data)
                    if reviews:
                        all_reviews.extend(reviews)
                    else:
                        break
                else:
                    break
            else:
                for el in review_elements:
                    review = {}

                    title_el = el.select_one("h3")
                    review["title"] = title_el.get_text(strip=True) if title_el else None

                    body_el = el.select_one("span[itemprop='reviewBody']")
                    review["body"] = body_el.get_text(strip=True) if body_el else None

                    rating_el = el.select_one("meta[itemprop='ratingValue']")
                    review["rating"] = rating_el.get("content") if rating_el else None

                    author_el = el.select_one("span[itemprop='author']")
                    review["author"] = author_el.get_text(strip=True) if author_el else None

                    date_el = el.select_one("meta[itemprop='datePublished']")
                    review["date"] = date_el.get("content") if date_el else None

                    if review.get("body"):
                        all_reviews.append(review)

            time.sleep(random.uniform(2, 5))

        return all_reviews

    def _parse_reviews_json(self, data):
        """Parse reviews from __NEXT_DATA__."""
        reviews = []
        try:
            props = data.get("props", {}).get("pageProps", {})
            review_data = props.get("initialData", {}).get("data", {}).get("reviews", {})
            customer_reviews = review_data.get("customerReviews", [])

            for rev in customer_reviews:
                review = {
                    "title": rev.get("reviewTitle"),
                    "body": rev.get("reviewText"),
                    "rating": rev.get("rating"),
                    "author": rev.get("userNickname"),
                    "date": rev.get("reviewSubmissionTime"),
                    "verified_purchase": rev.get("badges", {}).get("verifiedPurchaser", False),
                    "positive_feedback": rev.get("positiveFeedback"),
                    "negative_feedback": rev.get("negativeFeedback"),
                }
                reviews.append(review)

        except (KeyError, TypeError):
            pass

        return reviews

Step 5: Run the Complete Pipeline

def main():
    proxy_url = "http://user:pass@proxy.dataresearchtools.com:8080"
    scraper = WalmartScraper(proxy_url)

    # Search for products
    search_queries = [
        "wireless earbuds",
        "laptop stand",
        "USB-C hub",
    ]

    all_products = []
    for query in search_queries:
        print(f"\nSearching Walmart for: {query}")
        products = scraper.search_products(query, num_pages=2)
        for p in products:
            p["search_query"] = query
        all_products.extend(products)
        time.sleep(random.uniform(5, 10))

    print(f"\nTotal search results: {len(all_products)}")

    # Get details for top products
    detailed = []
    for product in all_products[:10]:
        url = product.get("url")
        if url:
            print(f"Fetching details: {product.get('title', 'Unknown')[:50]}...")
            details = scraper.get_product_details(url)
            if details:
                detailed.append(details)
            time.sleep(random.uniform(4, 8))

    # Get reviews for top products
    for product in detailed[:5]:
        item_id = product.get("item_id")
        if item_id:
            print(f"Fetching reviews for item {item_id}...")
            reviews = scraper.get_product_reviews(item_id, num_pages=2)
            product["reviews"] = reviews
            product["reviews_scraped"] = len(reviews)
            print(f"  Got {len(reviews)} reviews")
            time.sleep(random.uniform(4, 8))

    # Save results
    with open("walmart_search_results.json", "w", encoding="utf-8") as f:
        json.dump(all_products, f, indent=2, ensure_ascii=False)

    with open("walmart_detailed.json", "w", encoding="utf-8") as f:
        json.dump(detailed, f, indent=2, ensure_ascii=False)

    # Analysis
    df = pd.DataFrame(all_products)
    df.to_csv("walmart_products.csv", index=False)

    print(f"\nResults Summary:")
    print(f"  Total products: {len(all_products)}")
    print(f"  Detailed products: {len(detailed)}")
    prices = [p["price"] for p in all_products if p.get("price")]
    if prices:
        print(f"  Price range: ${min(prices):.2f} - ${max(prices):.2f}")
        print(f"  Average price: ${sum(prices)/len(prices):.2f}")


if __name__ == "__main__":
    main()

Handling PerimeterX Bot Detection

PerimeterX (now HUMAN Security) is the primary challenge when scraping Walmart. Here are strategies to bypass it:

Use CloudScraper

The cloudscraper library handles JavaScript challenges automatically. It simulates browser TLS fingerprints and solves basic JavaScript challenges without requiring a full browser.

Browser-Level Scraping

For the most difficult scenarios, use Playwright or Selenium with stealth plugins:

from playwright.sync_api import sync_playwright

def scrape_with_playwright(url, proxy_config):
    """Use Playwright for JavaScript-heavy pages."""
    with sync_playwright() as pw:
        browser = pw.chromium.launch(
            headless=True,
            proxy=proxy_config,
        )
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
            ),
        )

        page = context.new_page()
        page.goto(url, wait_until="networkidle")
        time.sleep(random.uniform(2, 4))

        content = page.content()
        browser.close()
        return content

Cookie Persistence

Maintain cookies across requests. PerimeterX sets tracking cookies that must persist throughout your scraping session. Clearing cookies mid-session immediately flags you as a bot.

Proxy Best Practices for Walmart

  • Residential over datacenter: Always use residential or mobile proxies. Datacenter IPs are blocked almost instantly.
  • US-based IPs: Walmart.com primarily serves US customers. Use US-based proxy IPs for best results.
  • Rotation frequency: Rotate IPs every 10-15 requests, but maintain cookies within each IP session.
  • Concurrent limits: Limit concurrent requests to 2-3 through different proxies. Mass concurrent requests trigger alerts.

Data Applications

Walmart product data supports numerous business operations:

  • Competitive pricing: Compare your prices against Walmart’s marketplace sellers for the same products.
  • Inventory monitoring: Track stock levels for products you sell or source from Walmart.
  • Review analysis: Mine customer reviews for product quality insights and feature requests.
  • Seller intelligence: Monitor third-party sellers on Walmart Marketplace for competitive positioning.
  • Category trends: Analyze bestseller rankings and new product launches across categories.

Conclusion

Scraping Walmart product pages requires overcoming PerimeterX bot detection, JavaScript rendering challenges, and aggressive rate limiting. The Python framework in this guide provides multiple approaches — from API-based extraction to HTML parsing — with proper anti-detection measures at each layer.

Residential proxies from DataResearchTools provide the foundation for sustainable Walmart scraping by delivering trusted IP addresses that bypass bot detection systems. For additional web scraping techniques and proxy terminology, explore our proxy glossary.


Related Reading

Scroll to Top