How to scrape Auchan Drive France grocery data

Scrape Auchan Drive France and you tap into one of the largest hypermarket grocery catalogues in Europe, served through a click-and-collect model that varies prices by store. Auchan operates a federation of physical hypermarkets and Drive pickup points across France, with each Drive store maintaining its own catalogue and its own price list within the corporate pricing framework. The scraping landscape is shaped by three things: postcode-driven store selection that gates every product page, a JSON catalogue API that returns store-specific availability, and a Cloudflare front end that profiles non-French traffic aggressively.

This guide focuses on Auchan Drive specifically, which is the click-and-collect property at www.auchan.fr/drive. Auchan also operates a separate home delivery property and a hypermarket banner catalogue with different APIs.

The store-selection prerequisite

Every Auchan product page requires a store context before it returns prices. The store context is set by selecting a postal code or a specific Drive location, which sets a mag cookie that all subsequent requests must include. Without the store cookie, you get either redirects to the homepage or a stripped-down view with no pricing.

import httpx

async def select_store(postcode: str, proxy: str):
    async with httpx.AsyncClient(proxy=proxy, follow_redirects=True, timeout=20) as c:
        # Hit the store selector endpoint to get a session
        r = await c.get(f"https://www.auchan.fr/store-selector?postalCode={postcode}")
        # The mag cookie should now be set
        return c.cookies

Different postcodes resolve to different Drive stores. If you want catalogue coverage across all of France, you need to enumerate the Drive store list and rotate through stores in your scraping pipeline. The store list is published as a static JSON at https://www.auchan.fr/api/stores that returns roughly 200 active Drive locations.

The Auchan product API

Once a store is selected, product details come from the catalogue API at https://www.auchan.fr/api/v1/products/<gtin>. The endpoint returns the canonical product object along with the store-specific price, promotion, and stock state.

async def fetch_product(gtin: str, store_cookies, proxy: str):
    url = f"https://www.auchan.fr/api/v1/products/{gtin}"
    async with httpx.AsyncClient(proxy=proxy, cookies=store_cookies, timeout=20) as c:
        r = await c.get(url, headers={"Accept": "application/json"})
        if r.status_code == 200:
            return r.json()
        return None

The response includes gtin, name, brand, price, pricePerUnit, unit, inStock, promotion, nutritionalInfo, and categoryPath. The pricePerUnit field is the most analytically useful for grocery work because grocery products are sold in highly variable pack sizes and the per-unit price is the only fair comparison across formats.

Crawling the category tree

Auchan’s category tree is exposed at https://www.auchan.fr/api/v1/categories. It returns a nested JSON of department, sub-department, and category nodes. Each terminal category has a slug that is used in listing URLs.

Listing endpoints look like https://www.auchan.fr/api/v1/products?categoryId=<id>&page=<n>&size=50. The pagination cap is 1,000 products per category, which is enough for most categories but requires faceted decomposition for the largest ones (epicerie sucree, boissons).

async def crawl_category(category_id: int, store_cookies, proxy_pool, max_pages: int = 20):
    all_items = []
    for page in range(1, max_pages + 1):
        proxy = proxy_pool.next()
        url = "https://www.auchan.fr/api/v1/products"
        params = {"categoryId": category_id, "page": page, "size": 50}
        async with httpx.AsyncClient(proxy=proxy, cookies=store_cookies, timeout=20) as c:
            r = await c.get(url, params=params, headers={"Accept": "application/json"})
            if r.status_code != 200:
                break
            data = r.json()
            items = data.get("products", [])
            if not items:
                break
            all_items.extend(items)
    return all_items

For a daily catalogue snapshot across all stores, the work scales linearly with the number of stores you cover. A national snapshot covering 200 stores at 50,000 SKUs each is 10M product reads per day. That is achievable on a small mobile proxy pool with reasonable parallelism, but it requires careful work scheduling.

Multi-store price comparison

The most analytically interesting Auchan dataset is the price variation across stores for the same SKU. Even within Auchan, prices for identical products differ by store based on local competitive pressure, promotional schedules, and regional sourcing. For brand owners, this surfaces where Auchan is using a SKU as a loss leader versus where they are pricing at full margin.

CREATE TABLE auchan_snapshot (
    snapshot_at TIMESTAMP NOT NULL,
    store_id VARCHAR(32) NOT NULL,
    gtin VARCHAR(14) NOT NULL,
    price_eur DECIMAL(10,2),
    price_per_unit_eur DECIMAL(12,4),
    unit VARCHAR(8),
    in_stock BOOLEAN,
    promotion_text TEXT,
    PRIMARY KEY (snapshot_at, store_id, gtin)
);

A price-variance view across stores reveals interesting patterns. Stores in dense urban Paris consistently price higher than stores in suburban regions. Stores near a major Lidl or Aldi competitor consistently price 5-10% lower on the categories where the competitor is strongest.

Proxy strategy for Auchan

Auchan’s bot detection sits on Cloudflare with additional behavioral analysis. French residential or mobile IPs are required for stable scraping. EU residential pools sometimes work but the success rate is meaningfully lower than France-specific inventory.

For workloads under 5,000 product reads per day, a small French residential pool with sticky sessions of 10-20 minutes is sufficient. For higher volumes, a dedicated mobile port on Orange or SFR is the cleaner path. The mobile IP costs more but its trust score is high enough to sustain 10+ requests per second for hours without challenges.

Rate limits and request shaping

Auchan does not publish rate limits but the observed behavior is consistent: roughly 2 requests per second per IP is the safe sustained rate. Bursts up to 10 requests per second work for short periods. Beyond that you hit Cloudflare challenges that take 10-30 minutes to clear per IP.

The request shaping that performs best on Auchan is store-by-store sequential scraping rather than SKU-by-SKU parallel scraping. Each store has its own session cookie, and rotating store cookies on every request adds overhead without improving throughput. Stick with one store cookie per IP for the duration of that store’s catalogue sweep, then rotate.

Detecting and routing around CAPTCHA challenges on Auchan

When Auchan flags your traffic, the response is usually a Cloudflare or vendor interrogation page rather than a clean HTTP error. Your scraper needs to detect this content-type swap explicitly. Look for the signature cf-mitigated header, the presence of __cf_chl_ cookies, or HTML containing Just a moment.... Treat any of these as a soft block.

def is_challenged(response) -> bool:
    if response.status_code in (403, 503):
        return True
    if "cf-mitigated" in response.headers:
        return True
    if "__cf_chl_" in response.headers.get("set-cookie", ""):
        return True
    body = response.text[:2000].lower()
    return "just a moment" in body or "checking your browser" in body

When you detect a challenge, do not retry on the same IP for at least 30 minutes. Mark that IP as cooling and route subsequent requests to a different IP in your pool. Aggressive retries on a flagged IP cause the cooling window to extend and can lead to long-term blacklisting of your subnet.

For pages that absolutely must be fetched, have a fallback path that uses a headless browser with a real France residential IP. The browser path costs more per page but solves the small percentage of challenges that the API path cannot handle. Most production setups maintain a 95/5 split between the lightweight HTTP path and the browser fallback path.

Working with EUR pricing and FX normalization

Pricing in France is denominated in EUR, and any cross-market analysis requires careful FX normalization. The naive approach of converting at scrape time using a live FX feed introduces noise into your trend lines because exchange rate movements get conflated with real price changes. Store the price in local EUR and apply FX conversion at query time using a daily reference rate.

CREATE TABLE fx_rates (
    rate_date DATE NOT NULL,
    base_ccy VARCHAR(3) NOT NULL,
    quote_ccy VARCHAR(3) NOT NULL,
    rate DECIMAL(18,8) NOT NULL,
    PRIMARY KEY (rate_date, base_ccy, quote_ccy)
);

Source the daily rates from a reliable feed such as the European Central Bank reference rates or your bank’s wholesale feed. Avoid scraping retail FX rates because they include the bank’s spread and produce inconsistent comparisons.

Comparing Auchan to other regional marketplaces

Marketplace	Country focus	Catalogue scale	Bot strictness
Auchan Drive	France	Large	High
Carrefour Drive	Adjacent markets	Medium	Medium
Leclerc Drive	Adjacent markets	Smaller	Lower

Cross-marketplace analyses help separate platform-specific dynamics from genuine market trends. If a price drops on Auchan Drive but stays flat across the comparable competitors, that is a platform-driven event rather than a market-wide signal. Your scraping pipeline should ingest from at least three platforms in any market where you intend to publish category insights.

Operational monitoring and alerting

Every production scraper needs three monitoring layers regardless of target. The first is per-IP success rate over a 5-minute window, alerting if any IP drops below 80%. The second is parser error rate, alerting if more than 1% of fetched pages fail to extract the canonical fields. The third is data freshness, alerting if your downstream consumers see snapshots more than 24 hours old.

import time
from collections import deque

class IPHealthTracker:
    def __init__(self, window_seconds: int = 300):
        self.window = window_seconds
        self.events = {}

    def record(self, ip: str, success: bool):
        bucket = self.events.setdefault(ip, deque())
        now = time.time()
        bucket.append((now, success))
        while bucket and bucket[0][0] < now - self.window:
            bucket.popleft()

    def success_rate(self, ip: str) -> float:
        bucket = self.events.get(ip)
        if not bucket:
            return 1.0
        successes = sum(1 for _, ok in bucket if ok)
        return successes / len(bucket)

Wire this into Prometheus or your existing observability stack so the on-call engineer sees IP degradation as it happens rather than after the daily snapshot fails.

Legal and compliance considerations for France

Public product, price, and availability data are generally treated as fair to scrape in most jurisdictions, but France has its own consumer protection and personal data frameworks. Confine your collection to non-personal data: SKU identifiers, prices, descriptions, ratings as aggregates, and seller display names. Avoid collecting individual buyer reviews with names, phone numbers, or email addresses attached, and avoid pulling any data behind a login.

For commercial deployment, document your basis for processing, your data retention period, and your purpose limitation. Most data protection regimes treat scraped public data more favorably when there is a clear lawful basis and the data is not used for direct marketing to identified individuals. The W3C Web Annotation guidance and similar published frameworks remain useful starting points for documenting your approach.

Pipeline orchestration and scheduling

For any non-trivial scraping operation, a dedicated orchestration layer is the difference between a script you babysit and a service that runs unattended. The two strong open-source choices in 2026 are Prefect 3 and Dagster. Both handle the patterns you need: DAG dependencies, retries, observability, secret management, and dynamic fan-out across IPs and categories.

from prefect import flow, task

@task(retries=3, retry_delay_seconds=60)
def fetch_category(category_id: int, page: int):
    return crawl_one_page(category_id, page)

@task
def store_pages(pages: list):
    write_to_db(pages)

@flow(name="auchan-daily-sweep")
def daily_sweep(category_ids: list):
    futures = []
    for cid in category_ids:
        for page in range(1, 50):
            futures.append(fetch_category.submit(cid, page))
    pages = [f.result() for f in futures]
    store_pages(pages)

Run the flow on a 6-hour or 24-hour schedule depending on how dynamic the underlying catalogue is. For seasonal markets like apparel, a 6-hour cadence catches the meaningful movements without driving up proxy costs unnecessarily.

Sample analytics queries on the collected dataset

Once your snapshots are landing reliably, the analytics layer is where the value materializes. A few queries that consistently come up across Auchan datasets:

-- Top 50 SKUs by price drop in the last 7 days
SELECT sku, MIN(selling_price) - MAX(selling_price) AS price_drop
FROM snapshot
WHERE snapshot_at > now() - interval '7 days'
GROUP BY sku
ORDER BY price_drop ASC
LIMIT 50;

-- Stock-out frequency per category
SELECT category_id,
       SUM(CASE WHEN in_stock = 0 THEN 1 ELSE 0 END)::float / COUNT(*) AS oos_rate
FROM snapshot
WHERE snapshot_at > now() - interval '30 days'
GROUP BY category_id
ORDER BY oos_rate DESC;

Add a brand share view, a seller concentration view, and a campaign-frequency view and you have a competitive intelligence product. The collection layer is the prerequisite; the analytics layer is where you create defensible value.

Building robust deduplication across noisy listings

When you scrape any ecommerce marketplace at scale, the long-tail catalogue is full of near-duplicate listings. The same physical product appears under different titles, different sellers, slight variations in pack size, and slightly different image sets. For analytics that try to compute brand share or category trends, deduplication is mandatory and it is harder than it looks.

The standard approach uses a three-pass funnel. The first pass groups by exact match on EAN or GTIN where present. The second pass groups by normalized title plus brand using a TF-IDF cosine similarity threshold of 0.85. The third pass groups by image hash similarity using perceptual hashing. Each pass merges groups produced by the previous pass.

import hashlib
from PIL import Image
import imagehash

def perceptual_hash(image_path: str) -> str:
    img = Image.open(image_path)
    return str(imagehash.phash(img, hash_size=16))

def normalize_title(title: str) -> str:
    title = title.lower()
    for token in ["[free shipping]", "[same day]", "(new)", "*sale*"]:
        title = title.replace(token, "")
    return " ".join(title.split())

Tune the similarity thresholds against a hand-labeled gold set of 500 to 1,000 known duplicate clusters. Without a gold set, you will either over-merge (collapsing distinct variants) or under-merge (leaving the same product in many groups). Both failure modes break downstream analytics.

Versioning your scraper for catalogue evolution

Every ecommerce site evolves its catalogue structure regularly. New attribute fields appear, old fields are deprecated, category trees are reorganized, and pricing display logic changes. Your scraper code has to evolve with these changes, and a versioning pattern that keeps old data interpretable is critical.

The pattern that works best is to stamp every snapshot row with the scraper version that produced it. When you deploy a new version of the parser, increment the version number. Downstream analytics can filter by version when they need consistent semantics across a time range, or join across versions when they want long-running trend analysis.

ALTER TABLE snapshot ADD COLUMN scraper_version VARCHAR(16);
CREATE INDEX scraper_version_idx ON snapshot(scraper_version);

Pair this with a small registry table that documents what each scraper version did differently. When a downstream user asks why a particular metric jumped on a specific date, the version registry usually has the answer.

Caching strategy and incremental crawls

Full daily snapshots scale linearly with catalogue size, which becomes expensive at multi-million SKU scale. Most production deployments shift from full snapshots to incremental refreshes after the initial ramp. The pattern uses three signals to decide what to refetch on each cycle.

The first signal is freshness deadline. Every SKU has a maximum staleness budget (say 24 hours), and any SKU older than its budget gets refreshed.

The second signal is volatility. SKUs that have changed price recently get higher refresh priority because they are more likely to change again. SKUs that have been stable for weeks can drop to a longer refresh interval.

The third signal is business priority. SKUs that downstream users actually query (tracked by query logs) get higher refresh priority than dormant SKUs that nobody has looked at in months.

def schedule_refresh(sku_id: int, last_changed_at, last_queried_at, last_fetched_at) -> int:
    """returns priority score; higher = refresh sooner"""
    age = (now - last_fetched_at).total_seconds() / 3600
    volatility = 10 if (now - last_changed_at).days < 7 else 1
    relevance = 5 if (now - last_queried_at).days < 1 else 1
    return age * volatility * relevance

This kind of priority-driven scheduler reduces total request volume by 60-80% compared to blind full snapshots, while keeping the data fresh on the SKUs that actually matter to the business.

Common pitfalls when scraping Auchan Drive

Three issues recur. The first is store-level price variance. Auchan Drive prices are set per fulfillment store, not nationally. The same SKU can vary by 5-15% between Lille and Marseille. A scraper that does not pin a store_id averages across stores and loses the geographic price signal that makes Drive data analytically interesting. Always store the drive_id alongside every price.

The second is unit-of-sale ambiguity. Fresh produce is sold by weight (per kilo) but packaged in approximate units. The product page shows both prix au kilo and prix par piece. The realized cost depends on the actual weight at checkout. For trend analysis, pin to prix au kilo and ignore the per-piece estimate.

The third is loyalty-program price drift. Auchan Waaoh card holders see different prices for some promotional SKUs. Anonymous scraping captures the public price. Cardholder scraping requires session cookies that expire every few hours. Decide which population your analysis serves and stay consistent.

FAQ

Why does the same product show different prices on different Auchan Drive stores?
Auchan Drive stores set their own pricing within a corporate framework. Local competition (proximity to a Leclerc, Lidl, or Carrefour) and regional sourcing costs both influence the per-store price. Multi-store snapshots are the only way to see the full pricing landscape.

Does Auchan offer an affiliate or partner API?
Auchan participates in affiliate networks that expose a subset of the catalogue with commission tracking, but the affiliate feeds are not designed for competitive intelligence. They lag the live catalogue by 24-48 hours and exclude promotions. For real-time data, scraping the public catalogue is the only path.

Can I scrape Auchan from a Belgian or Spanish residential IP?
Auchan operates in multiple European countries with separate sites. Auchan Belgium uses a different domain (auchan.be) and a different catalogue. For French data specifically, French IPs are strongly preferred. Belgian or Spanish IPs work for short bursts but degrade quickly.

How fresh are the prices in the API response?
Auchan refreshes prices in batches. The catalogue API reflects the current published price with a CDN cache lifetime of 60-120 seconds. For most monitoring use cases that is real-time enough. For high-frequency competitor tracking, hitting the API every 5-10 minutes catches all meaningful changes.

Are there legal considerations specific to scraping French ecommerce sites?
France enforces GDPR strictly and has additional consumer protection rules through the DGCCRF. Limit your collection to non-personal product, price, and availability data. Avoid scraping any personal data, any logged-in pages, and any data that includes individual customer reviews with identifiable information.

How often do Drive prices change?
Most SKUs reprice weekly on Sunday night for the new promotional period. Fresh produce and fish reprice daily. Snapshots every Monday morning capture the canonical weekly state.

Does Auchan Drive expose stock counts?
Only as a binary in_stock flag and an occasional ‘low stock’ badge. True quantity is hidden. For demand-signal analytics, treat the in_stock flag as the only reliable inventory feature.

To build a broader European grocery intelligence stack, browse the ecommerce scraping category for tooling reviews, proxy comparisons, and framework deep dives that pair with the patterns above.