GDPR compliance for web scraping in 2026: a practical guide

GDPR compliance for web scraping in 2026: a practical guide

GDPR web scraping compliance is the single most-cited legal blocker that engineering teams hit when they try to scale data collection across European targets. The regulation does not ban scraping. It does not even mention scraping. What it does is impose a strict regime on how personal data is collected, processed, and stored, and almost every meaningful scraping project sweeps up at least some personal data along the way. The result is a gap between what teams think they are doing (collecting public information) and what regulators see (processing personal data without a documented lawful basis). This guide walks through what GDPR actually says, how the EU enforcement environment shifted in 2024 and 2025, what scraping operators in 2026 must put in place to stay defensible, and where the live court rulings draw the line.

The guide is built for technical leads, data engineers, and product owners who already run or are planning a scraping pipeline that touches EU traffic. It is not a substitute for counsel. It is a working framework so you walk into that conversation with the right map.

What GDPR actually covers in scraping context

The General Data Protection Regulation (Regulation (EU) 2016/679) applies to the processing of personal data of individuals in the EU and EEA, regardless of where the processor sits. That extraterritorial reach (Article 3) is the part most non-EU teams underestimate. If you scrape a dataset that contains EU residents’ personal data, GDPR applies even if your servers are in Singapore, your team is in the US, and the website you scraped is hosted in Brazil.

Personal data under Article 4(1) is any information relating to an identified or identifiable natural person. The bar is very low. A name, an email, a username, an IP address, a cookie identifier, a device fingerprint, a profile photo, a location signal, a job title combined with a company, even a forum post that reveals an opinion attached to a pseudonym that can be re-identified, all of these qualify. The European Data Protection Board (EDPB) has consistently taken an expansive view. If a human can plausibly be re-identified from your dataset, even after combination with other public sources, the data is personal.

Processing under Article 4(2) is also broad: collection, storage, structuring, retrieval, dissemination, alignment, and erasure all count. Scraping is collection. Storing the result in a database is storage. Running it through an LLM is processing. Sending it to a customer is dissemination. Each step needs a lawful basis.

The lawful bases are set out in Article 6. The two that matter for scraping are consent (almost always impossible to obtain at scrape time) and legitimate interest (the workhorse for B2B and research scraping, but it requires a documented Legitimate Interest Assessment, often called an LIA). The remaining bases (contract, legal obligation, vital interest, public interest) rarely apply.

For a wider tour of the personal-versus-public-data question, see the personal vs public data scraping framework. For US-side parallels, the CCPA compliance guide for scrapers is the right next read.

The legitimate interest pathway in practice

Article 6(1)(f) allows processing where it is necessary for the legitimate interests pursued by the controller, except where those interests are overridden by the rights and freedoms of the data subject. That balancing test is the entire game.

A practical LIA has three parts. First, the purpose test: is the interest you are pursuing genuine, lawful, and clearly articulated? Market research, fraud prevention, journalism, and competitive intelligence all generally pass. Mass profile harvesting for cold outbound spam usually does not.

Second, the necessity test: can the same outcome be achieved with less personal data? If your downstream use case only needs aggregated counts, you should not be storing names. If you need company-level signals, you should not be storing individual contributor profiles. Data minimisation under Article 5(1)(c) is not optional.

Third, the balancing test: do the data subjects’ interests, rights, or fundamental freedoms override yours? This is where you weigh expectations. A user who posted on a public forum reasonably expects that post to be readable by other humans. They probably do not expect it to be ingested by a competitor’s LLM training pipeline at industrial scale and re-surfaced in unrelated contexts. The test is contextual.

Document the LIA. Date it. Sign it. Re-run it whenever the scope, source list, or downstream use changes. EU regulators in 2025 increasingly asked for the LIA on first contact during investigations, and absence of one is itself evidence of non-compliance.

Compliance checklist for scraping operators

ControlWhat it requiresWhy it matters
Lawful basis documentedWritten LIA per scraping targetArticle 6 evidence
Privacy notice publishedPublic page describing your processingArticle 14 transparency
Data minimisation by designScrape only fields you actually useArticle 5(1)(c)
Purpose limitationDefine use cases; do not silently expandArticle 5(1)(b)
Storage limitsSet retention windows; auto-deleteArticle 5(1)(e)
Right-to-erasure mechanismPublic email or form for deletion requestsArticle 17
DPO or contact appointedIf at scale, appoint a Data Protection OfficerArticles 37-39
Records of processingArticle 30 register listing each scraping activityArticle 30
DPIA where high riskRequired for large-scale profiling or sensitive dataArticle 35
Vendor and processor agreementsArticle 28 contracts with proxy and storage providersArticle 28
Cross-border transfer mechanismSCCs or adequacy if data leaves EEAChapter V
Breach response plan72-hour notification protocol to supervisory authorityArticle 33
PseudonymisationStrip direct identifiers where not strictly neededArticle 25

A team that ticks every row above is genuinely defensible. A team that ticks fewer than half is one regulator letter away from a problem.

How EU enforcement shifted in 2024 and 2025

Two trends matter. First, regulators moved from reactive enforcement (responding to complaints) to proactive sweeps targeting AI training data, news scraping, and B2B contact databases. The Italian Garante, the French CNIL, and the Dutch Autoriteit Persoonsgegevens all opened investigations of scraping operators in 2024 and 2025, several of which closed with seven-figure fines. Second, the line between data controller and data processor in scraping pipelines tightened. Buying scraped datasets from third parties no longer insulates you. If you process the data, you are a controller for that processing, full stop, and your Article 28 contract with the seller does not change that.

The Meta v. Bright Data ruling in the Israeli court, while not GDPR per se, was widely cited by EU regulators in 2025 because it addressed the same fact pattern: scraping public profiles at scale. The court found that scraping logged-out public data did not violate Meta’s terms (they only bind logged-in users), but the EU regulators were quick to point out that absence of a contract violation does not equal a lawful basis under GDPR. Public availability is not a get-out-of-GDPR card.

For the deeper US-side analysis of similar fact patterns, see the HiQ Labs v LinkedIn ruling explainer.

Decision tree for an EU-touching scrape

Use this before you queue a new target.

Q1: Does the target site host personal data of EU/EEA residents?
    ├── No  -> GDPR likely not in scope. Document the assessment.
    └── Yes -> Q2
Q2: Do you have a documented lawful basis (LIA preferred)?
    ├── No  -> Stop. Write the LIA first.
    └── Yes -> Q3
Q3: Have you minimised fields to only what you need?
    ├── No  -> Trim the schema before launching.
    └── Yes -> Q4
Q4: Is there a published privacy notice describing this processing?
    ├── No  -> Publish the notice; link it in your contact form.
    └── Yes -> Q5
Q5: Is data leaving the EEA after processing?
    ├── Yes -> Confirm SCCs or adequacy decision are in place.
    └── No  -> Q6
Q6: Is the volume or sensitivity high enough to trigger DPIA?
    ├── Yes -> Run the DPIA before launch.
    └── No  -> Proceed; log the assessment in your Article 30 register.

Each branch produces an artefact. That paper trail is what defends you in an investigation.

Special cases: AI training data, journalism, and B2B

AI training is the highest-risk scraping use case in 2026. The European AI Act now layers obligations on top of GDPR for any training pipeline that touches personal data. Transparency about training data sources is becoming an audit expectation, not a nice-to-have. If you scrape to train a model, document each source, the lawful basis, the consent state, and the opt-out mechanism. Several large model providers were fined in 2025 for scraping personal data without an LIA.

Journalism and academic research enjoy a partial exemption under Article 85. Member states implement this differently, so a French journalist and a German academic operate under different practical rules even though both fall under the journalism exemption. The exemption is not a blanket immunity; it requires the processing to be genuinely for journalism or research purposes, not commercial repackaging.

B2B scraping is the gray zone where most commercial teams live. Contact data of identified individuals at companies (jane.doe@acmecorp.com plus a job title) is personal data, full stop. The fact that it is professional context does not remove personhood. Member states diverge on how strictly this is enforced (Germany strict, UK pragmatic, Spain mid), but the safe assumption for cross-border B2B scrapers is that contact-level processing requires an LIA plus an opt-out path.

Data subject rights and how to honour them

Articles 12 to 22 grant data subjects rights that survive scraping. The two that bite scrapers hardest are the right to erasure (Article 17) and the right to object (Article 21).

You must publish a clear way for any individual to request deletion of their data from your stored corpus. Email is acceptable. A public form is better. The response window is one month, extendable to three months for complex requests. You must verify the request (to prevent malicious deletion) but you cannot use verification as a delay tactic.

Right to object is broader. Anyone can object to processing based on legitimate interest, and you must stop unless you can demonstrate compelling legitimate grounds that override their interests. In practice, most teams just delete and move on, because litigating each objection is more expensive than the marginal data point.

For internal policy guidance on how to operationalise these rights, see building an ethics-first scraping policy.

External references

For the canonical regulation text and EDPB guidance, the official source is gdpr.eu and the EDPB guidelines library at edpb.europa.eu. For the Meta v. Bright Data ruling text and EU regulator commentary on it, the EDPB published a working note in late 2024 that summarises the cross-border implications.

Comparison: GDPR vs other major privacy regimes for scrapers

RegimePersonal data definitionPublic data carve-outRight to erasureExtraterritorial reach
GDPR (EU)Very broadNoneYes (Article 17)Yes
CCPA (California)Broad, household levelPartial (publicly available)Yes (limited)Yes if doing business in CA
PDPA (Singapore)Identifiable individualBroader carve-out (publicly available)LimitedYes
DPDP (India)Digital personal dataLimited carve-outYes (correction and erasure)Yes
LGPD (Brazil)Mirrors GDPRNoneYesYes

The pattern is clear: GDPR is the strictest, CCPA is the most enforcement-active, and the Asia-Pacific regimes are catching up fast. Build for GDPR and the rest follow.

FAQ

Is scraping public data legal under GDPR?
Public availability does not exempt data from GDPR. If a webpage is publicly readable but contains personal data, processing that data still requires a lawful basis. The European Data Protection Board has been explicit on this.

Do I need consent to scrape?
Consent is rarely workable for scraping because you cannot meaningfully obtain it from the data subject before collection. Most teams rely on legitimate interest (Article 6(1)(f)), which requires a documented Legitimate Interest Assessment.

What if the website’s terms forbid scraping?
Terms of service are a contract issue, not a GDPR issue. Even if scraping breaches terms, GDPR analysis is independent. You can have a clean GDPR position and still face a contract claim, or vice versa.

Does GDPR apply if I am outside the EU?
Yes, if the data subjects are in the EU or EEA. Article 3 extends the regulation extraterritorially. Hosting your servers offshore does not move the data outside scope.

What is the typical fine range in 2026?
For administrative fines, the range starts in the low six figures for first-time scraping breaches and reaches into the tens of millions for systematic, large-scale, or AI-training violations. The statutory cap is the higher of EUR 20 million or 4 percent of global annual turnover.

Extended enforcement analysis 2024-2026

The pace of GDPR scraping enforcement accelerated sharply between 2024 and 2026. The Italian Garante’s Replika decision (April 2024) and the follow-up OpenAI fine (December 2024, EUR 15 million) both pivoted on the same lever, Article 6 lawful basis combined with Article 5(1)(a) transparency. Neither case turned on whether the data was technically public. Both turned on whether the scraper documented a Legitimate Interest Assessment that survived a balancing test, and whether data subjects had a realistic path to object.

The Hamburg Commissioner’s 2024 guidance on AI training data explicitly states that publicly accessible does not mean publicly licensable for retraining. The French CNIL’s 2025 sandbox guidance for generative AI training reaches the same conclusion. The Dutch DPA’s August 2025 enforcement note added a third leg, saying scrapers operating under legitimate interest must apply purpose limitation at the chunking and embedding stage, not just at ingest.

For a scraping operator the practical takeaway is that the LIA must now be a living document. Reviewers will expect to see at minimum the original LIA, an updated LIA every twelve months, evidence of the rights-honouring workflow firing in production, and proof that purpose limitation propagated downstream. A static one-page LIA from 2023 will not survive 2026 supervision.

Implementation patterns that pass scrutiny

A compliant 2026 scraping pipeline typically includes seven controls.

  1. A robots.txt and AI-crawler header check at fetch time, with the choice logged.
  2. A purpose tag attached to every record at ingest, propagated through embeddings and downstream tables.
  3. A retention TTL applied at the row level, enforced by a daily sweeper.
  4. A pseudonymisation pass that strips direct identifiers before vectorisation.
  5. A source URL and timestamp stored per record so deletion requests can be honoured by URL.
  6. A data subject request inbox with a measured median response time below seventy two hours.
  7. A quarterly LIA review with a one-page diff against the prior version.

These controls cost roughly two engineering weeks to set up and a quarter of one engineer ongoing. That is materially cheaper than a single regulator inquiry, which typically consumes four to six engineering weeks of unplanned work.

Cross-jurisdiction comparison expanded

QuestionEU GDPRUK GDPRCalifornia CCPASingapore PDPAIndia DPDP
Public data exempt?NoNoPartial (publicly available carve-out)No (still personal data)No
Lawful basis required?Yes (six options)Yes (six options)Notice plus opt-outConsent or deemed consentConsent or legitimate uses
Right to objectYesYesRight to opt-out of sale or shareWithdrawal of consentYes
Max fineEUR 20M or 4 percentGBP 17.5M or 4 percentUSD 7,500 per intentional violationSGD 1MINR 250 crore
Cross-border transfer rulesSCCs, adequacyUK SCCs, adequacyLimitedTransfer limitation obligationNotified countries only

Additional FAQ

Does pseudonymisation remove GDPR scope?
No. Pseudonymous data remains personal data under Article 4(5) because re-identification is possible. Anonymisation in the strict sense (irreversible) does remove scope, but most scraping pipelines do not achieve true anonymisation.

What about scraping that runs entirely outside the EU?
Article 3(2) extraterritoriality means the scraper is in scope if it targets EU data subjects or monitors their behaviour. Hosting infrastructure outside the EU does not by itself remove scope.

Can I rely on contract as a lawful basis instead of legitimate interest?
Only when a contract with the data subject genuinely requires the scrape. For most third-party scraping there is no contract with the data subject, so Article 6(1)(b) does not apply. Legitimate interest under Article 6(1)(f) is the usual basis.

What does proportionate mean in the LIA balancing test?
Proportionate means the scrape collects only what is necessary, runs at a frequency justified by the purpose, applies de-identification where possible, and stops when the purpose is met. Indefinite retention rarely passes the test.

Practical lawful basis selection

The choice of GDPR lawful basis sits at the centre of every scraping decision. Six bases are listed in Article 6, but only three are realistically available to a scraping operator. Consent under Article 6(1)(a) requires affirmative action by the data subject, which is rarely obtainable for third-party scraping. Contract under Article 6(1)(b) requires a contract with the data subject, which scrapers typically do not have. Legitimate interest under Article 6(1)(f) is the workhorse, requiring a documented assessment that weighs the legitimate interest of the controller against the rights and freedoms of the data subject.

Public interest under Article 6(1)(e) is occasionally relevant for journalism, academic research, or specific regulatory functions. Vital interest under Article 6(1)(d) almost never applies to scraping. Legal obligation under Article 6(1)(c) applies when a specific law requires the processing, which is unusual for commercial scraping.

The Legitimate Interest Assessment is therefore the document that determines whether a scrape is lawful. A 2026-quality LIA includes a stated purpose, an identification of the legitimate interest, a necessity test, a balancing test against data subject rights, a description of safeguards, and a conclusion. Each section should be specific to the scrape, not boilerplate. Regulator inquiries routinely call out boilerplate LIAs as evidence of bad faith.

What changes when special category data is involved

Article 9 of the GDPR creates a separate regime for special category data, including racial or ethnic origin, political opinions, religious beliefs, trade union membership, genetic data, biometric data, health data, sex life, and sexual orientation. Special category data may not be processed at all unless one of ten exceptions in Article 9(2) applies.

For scrapers the practical implication is that scraping that picks up special category data inadvertently still falls under Article 9. Indirect inference (for example name plus location plus organisation suggesting religious affiliation) can also trigger Article 9. The recommended posture is to detect and exclude special category signals at ingest, with a documented filter and a periodic audit.

The Italian Garante’s 2024 investigation of OpenAI explicitly cited the scraping of special category data as one factor in the eventual fine. The takeaway is that scrapers cannot rely on the data being public to escape Article 9. The lawful basis must come from one of the Article 9(2) exceptions, none of which neatly fits commercial scraping.

Documenting the right to object

Article 21 gives data subjects the right to object at any time to processing based on legitimate interest. The controller must stop processing unless it can demonstrate compelling legitimate grounds that override the rights and freedoms of the data subject, or processing is for the establishment, exercise, or defence of legal claims.

A scraper relying on Article 6(1)(f) must therefore have an objection workflow. The 2026 best practice is a public-facing form that accepts an identifier (name, email, profile URL) and routes the request to a queue with a measured response time. The response should confirm that processing has stopped or explain why a compelling legitimate ground continues to apply. The latter response is rare and should be reviewed by counsel before sending.

The objection workflow must operate in addition to the Article 17 right to erasure. The two rights are related but distinct. Erasure is a request to delete the data. Objection is a request to stop the processing. A controller may need to respond to both for the same individual.

Next steps

The fastest way to move from exposure to compliance is to write the LIA first, publish the privacy notice second, and stand up the right-to-erasure inbox third. Everything else (DPIA, Article 30 register, processor contracts) builds on top of those three. For broader policy and team-rollout guidance, head to the DRT compliance and ethics hub and start with the ethics-first policy guide.

This guide is informational, not legal advice.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)