Personal data vs public data in scraping: a 2026 framework

Personal data vs public data in scraping: a 2026 framework

Personal vs public data scraping is the single most consequential classification a scraping team makes about each target. The distinction governs which compliance regime applies, which legal defences are available, what your storage and retention obligations look like, and how you respond to deletion requests. In 2026, the line is more contested than ever, because regulators have taken consistent positions that public availability does not exempt data from privacy regulation, while courts in some jurisdictions have held that public scraping is broadly permissible under contract and computer-misuse statutes. This guide walks through the framework, the regulator and court positions, a working classification matrix, and a workflow your team can operationalise.

The audience is the technical lead, in-house counsel, or product owner who needs to make defensible classification decisions about every scrape target.

Why the classification matters operationally

Each target you classify as personal data triggers obligations: lawful basis documentation under GDPR, opt-out mechanisms under CCPA, consent under PDPA, deletion-on-request under DPDP. Each target you classify as non-personal-public data triggers far fewer obligations: contract analysis, copyright analysis, robots.txt courtesy.

Misclassification is the most common compliance failure. Teams routinely classify data as “public” because the URL was logged-out-readable, when in fact the data identifies natural persons. Or they over-classify, treating every dataset as personal and adding overhead that is not legally required.

A working classification matrix is the single highest-leverage compliance artefact a scraping team builds. It pays for itself the first time a regulator asks “what is your basis for processing?”

For the broader compliance picture, see the GDPR compliance guide, the CCPA compliance guide, and the HiQ Labs ruling explainer.

How major regimes define personal data

The definitions converge but the edges diverge. The pattern is “any data linkable to a natural person, with edge-case carve-outs.”

RegimeCore definitionKey edge case
GDPR (EU)Information relating to identified or identifiable natural personIP addresses, cookies count
UK GDPRSame as EUSame
CCPA / CPRA (California)Information that identifies, relates to, describes, or could reasonably be linked to a consumer or householdHousehold level included
PDPA (Singapore)Data about an identified individual or identifiable from dataPublic availability carve-out broader
DPDP (India)Digital personal dataLimited public availability carve-out
LGPD (Brazil)Information relating to identified or identifiable natural personMirrors GDPR
PIPL (China)Information related to identified or identifiable natural personsSensitive data category strict
POPIA (South Africa)Information relating to identifiable living natural person, or existing juristic personIncludes some company data

The two outliers worth flagging: PIPEDA in Canada follows GDPR-style definitions but adds reasonable-purpose tests, and APPI in Japan distinguishes “personal information” from “personal data” with technical processing requirements that catch many scrapers off guard.

The “public” question regulators consistently reject

Across nearly every regime, regulators have taken the position that public availability does not exempt data from the regulation. The clearest statements:

The European Data Protection Board issued a 2024 opinion explicitly stating that scraping publicly available personal data still requires a lawful basis under GDPR. The opinion responded to AI training scrapers who argued that publicly readable data was outside scope.

The California Privacy Protection Agency has issued enforcement guidance reading the “publicly available information” carve-out narrowly, particularly excluding inferences drawn from public data.

The Singapore PDPC has guidance distinguishing “publicly available” (where collection is permitted without consent) from “publicly accessible” (where consent rules still apply). The two terms are routinely conflated in commercial discussions but the PDPC reads them strictly.

The Indian DPDP rules issued in 2025 carry a narrow carve-out for “personal data made publicly available by the data principal,” which excludes data that became public through breach, leak, or aggregation.

The pattern: scrapers who argue “public, therefore exempt” lose the argument with regulators. The legal posture must be “personal, with documented lawful basis” or “not personal at all.”

Classification matrix: what counts as personal

Data elementPersonal under GDPRPersonal under CCPAPersonal under PDPA
Full nameYesYesYes
Email addressYesYesYes
Phone numberYesYesYes
IP addressYesYesLikely yes
Cookie identifierYesYesLikely yes
Device fingerprintYesYesYes
Geolocation (precise)YesYes (sensitive)Yes
Geolocation (city level)Likely yesYesConditional
Job title (anonymised)NoNoNo
Job title + company nameLikely yesYesYes
Username (re-identifiable)YesYesYes
Username (truly anonymous)NoConditionalNo
Profile photoYes (biometric inference)YesYes
Forum post (pseudonymous)YesYesYes
Forum post (anonymous)NoConditionalNo
Aggregated counts (no individual)NoNoNo
Company name onlyNoNoNo
Company financials (public filings)NoNoNo
Product priceNoNoNo
Product review text (with username)YesYesYes
Product review text (anonymised)NoConditionalNo

Where the column says “conditional” or “likely yes,” the classification depends on context, combination with other fields, and the realistic re-identification risk. The conservative move is to treat as personal and document the assessment.

The combination problem

A single field can be non-personal in isolation and personal in combination. This is the hardest part of the classification.

Consider a scrape target that returns three columns: company name, job title, year of joining. None of the three is personal data on its own. Combined, they may identify a single individual at a small company. The General Data Protection Regulation explicitly requires assessment of “the means reasonably likely to be used” to identify a person, and combination with other available datasets counts.

The practical workflow: enumerate every field you collect, run a re-identification test against your most likely combinations, and treat the resulting dataset as personal if any combination identifies an individual. The enumeration is one-time work per pipeline; the test pays off forever.

For the deeper compliance overlay on this question, see the GDPR compliance guide.

Decision tree for classifying a new scrape target

Q1: Does the dataset contain any direct identifier (name, email, phone, ID)?
    ├── Yes -> Personal data. Apply full compliance regime.
    └── No  -> Q2
Q2: Does the dataset contain any quasi-identifier (location, job, age, gender)?
    ├── No  -> Likely non-personal. Document assessment.
    └── Yes -> Q3
Q3: Can a realistic combination of fields identify an individual?
    ├── Yes -> Personal data. Apply full compliance regime.
    └── No  -> Q4
Q4: Does the dataset contain inference targets (photos, biometric data)?
    ├── Yes -> Personal data. Sensitive category likely.
    └── No  -> Q5
Q5: Is the data linkable to other data you hold or could acquire?
    ├── Yes -> Personal data. Apply full compliance regime.
    └── No  -> Non-personal. Document and proceed.

Each “yes” should escalate to personal-data treatment. The cost of over-classifying once is small; the cost of under-classifying is regulator action.

Pseudonymisation and anonymisation as compliance levers

Pseudonymisation (replacing identifiers with codes while retaining the linkage table) does not remove data from GDPR scope but does reduce risk and unlock several allowances. Anonymisation (irreversible removal of all identifiers) does remove data from GDPR scope, but the bar for “irreversible” is very high.

The EDPB 2024 anonymisation guidance lists three tests a dataset must pass to be considered anonymised: singling out (no individual is uniquely identifiable), linkability (no two records can be linked to the same individual), and inference (no attribute can be inferred about an individual with significant probability).

Most scraped datasets fail at least one test. A dataset of forum posts with usernames stripped still leaves writing-style fingerprints that can be re-identified. A dataset of product reviews with location stripped still leaves time-of-purchase patterns. Treat anonymisation claims sceptically and engineer accordingly.

For practical pseudonymisation patterns, see building an ethics-first scraping policy.

A working classification register

Each scraping pipeline should maintain a classification register with the following columns:

ColumnPurpose
Pipeline IDInternal identifier
Source URL patternWhat you scrape
Fields collectedEvery column in your storage schema
Personal data classification (per regime)Yes/no/conditional per GDPR, CCPA, PDPA, DPDP
Lawful basis (per regime)LIA reference, consent, contract
Retention periodDays/months
Deletion mechanismProcess and contact
Last reviewedDate and reviewer

The register is itself an Article 30 record under GDPR and a comparable record under most other regimes. Build it once, maintain it monthly, and you have most of your compliance documentation.

External references

The European Data Protection Board’s opinions on scraping and AI training are at edpb.europa.eu/our-work-tools/our-documents. The CPPA enforcement guidance archive is at cppa.ca.gov. The Singapore PDPC advisory on publicly available data is at pdpc.gov.sg.

Comparison: classification regimes side by side

RegimePublic-data carve-outInferred data treatmentPseudonymisation effect
GDPRNarrow; lawful basis still requiredPersonalReduces risk; still personal
CCPAModerate; requires lawful availabilityPersonal if linkableReduces risk; still personal
PDPA SingaporeBroader; permits collection without consentConditionalReduces risk; still personal
DPDP IndiaLimited; requires data principal actionPersonalReduces risk; still personal
PIPL ChinaVery narrow; consent defaultPersonalReduces risk; still personal
LGPD BrazilMirrors GDPRPersonalReduces risk; still personal

Build for GDPR and you have most of the surface covered. The Singapore carve-out can ease specific use cases but does not extend to AI training or large-scale aggregation.

A worked example: scraping a public job board

Suppose you scrape a public job board for talent intelligence. The fields you collect include candidate name (where displayed), current job title, current company, years of experience, location (city), skills tags, and a profile photo URL.

Classification: every field except the optional photo URL contributes to identification of an individual. The combination is unambiguously personal data under all major regimes.

Lawful basis: legitimate interest, with a Legitimate Interest Assessment documenting the purpose (talent intelligence for B2B customers), the necessity (no aggregated alternative meets the use case), and the balancing test (candidates publishing on a public board reasonably expect aggregation by recruiters and intelligence vendors; the balance leans towards processing, but with mitigations).

Mitigations: data minimisation (skip photo URL unless specifically needed), retention limits (purge after 12 months), opt-out mechanism (publish a deletion form), and access controls (no resale of identifiable records to third parties).

Outcome: defensible posture, documented in the register, with the LIA on file and the privacy notice published. A regulator inquiry can be answered in a single email with attachments.

For the parallel discussion of how this works for ecommerce competitor data (different classification, different regime), see the personal vs public data framework’s worked retail example.

FAQ

Is data on a public website automatically not personal?
No. Public availability does not change personhood. If the data identifies a natural person, it is personal data, full stop.

Does anonymisation move data out of GDPR scope?
True anonymisation does, but the bar is very high. Most scraped data that engineers call “anonymised” is in fact pseudonymised under EU law and remains in scope.

What if I only collect aggregated counts?
Aggregated counts that do not relate to identifiable individuals are not personal data. The aggregation must be irreversible and the cell size large enough to prevent inference.

How does business-to-business contact data classify?
A name plus a corporate email or job title is personal data under every major regime. Professional context does not remove personhood.

What about AI-generated synthetic data?
If the synthetic data was trained on personal data and can re-identify individuals from the training set, regulators are increasingly treating it as personal data with respect to those individuals. The case law is still developing but the conservative posture is to apply the regime.

Extended definitional analysis

The personal-versus-public distinction is one of the most misunderstood concepts in scraping law. The clarifying frame is that personal and public are independent dimensions, not opposites.

A piece of information can be:
– Personal and public (a LinkedIn profile, a public Twitter post by a named individual).
– Personal and private (a medical record, a salary slip).
– Non-personal and public (weather data, public transit schedules).
– Non-personal and private (an internal company memo without identifiers).

Privacy law (GDPR, CCPA, PDPA, DPDP) regulates the personal axis regardless of public-ness. Most regimes treat publicly available personal data as still personal data subject to most rules. The CCPA carve-out for publicly available information is the partial exception, and it is narrower than commonly assumed.

This creates the most common scraping mistake. A team scrapes public LinkedIn profiles and assumes the public-ness removes privacy obligations. It does not. The data is still personal data under GDPR Article 4(1) and PDPA Section 2. The right to object, the right to erasure, the lawful-basis requirement, and the transparency obligations all apply.

Implementation patterns for the distinction

A 2026 scraping pipeline should classify every record on both axes at ingest.

  1. Tag personal data using a deterministic detector for direct identifiers (name, email, phone, address, government ID).
  2. Tag indirect identifiers separately (employer, role, location to city precision, photo).
  3. Apply the most restrictive regime that applies to the personal-data classification.
  4. Apply different retention TTLs to the two classes.
  5. Maintain a per-record source URL so deletion requests can find the records.

Code pattern: dual classification at ingest

import re

PERSONAL_PATTERNS = {
    "email": re.compile(r"[\w.+-]+@[\w-]+\.[\w.-]+"),
    "phone_us": re.compile(r"\+?1?[\s.-]?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}"),
    "ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
}

def classify_record(record):
    text = record.get("text", "")
    is_personal = any(p.search(text) for p in PERSONAL_PATTERNS.values())
    is_public = record.get("source_visibility") == "public"
    return {
        "personal": is_personal,
        "public": is_public,
        "regime": pick_regime(is_personal, record.get("jurisdiction")),
    }

Comparison: how regimes treat the four quadrants

QuadrantEU GDPRCalifornia CCPASingapore PDPAIndia DPDP
Personal and publicFull scopeCarve-out partialFull scopeFull scope
Personal and privateFull scopeFull scopeFull scopeFull scope
Non-personal and publicOut of scopeOut of scopeOut of scopeOut of scope
Non-personal and privateOut of scopeOut of scopeOut of scopeOut of scope

Additional FAQ

Is a public Twitter post personal data?
Yes if it identifies or relates to an identifiable individual. Most named accounts qualify.

What about pseudonymous accounts?
Pseudonymous accounts can still be personal data if re-identification is reasonably possible.

Is corporate information personal data?
A company name is not personal data. An individual employee’s name and title is personal data even though it relates to corporate context.

Does aggregation remove the personal classification?
Statistical aggregates that prevent re-identification can be non-personal. Most scraping aggregates do not meet the strict re-identification threshold.

Real cases that turned on the personal-vs-public distinction

Three regulatory and court actions in 2024-2026 illustrate how the distinction operates in practice.

In Clearview AI v. CNIL (France, 2024 enforcement confirmation), the CNIL imposed a 20 million euro fine against Clearview AI for scraping publicly accessible facial images and constructing a biometric database. Clearview argued the source images were public. The CNIL rejected the argument, holding that public availability does not exempt biometric data from GDPR Article 9 special-category protection. The decision was upheld on appeal in 2025 and is now the leading European precedent for “public does not mean exempt” in the AI training context.

In OpenAI v. Garante (Italy, 2025), the Italian DPA fined OpenAI 15 million euros for processing publicly scraped personal data without an adequate lawful basis under GDPR Article 6, and for failing to provide transparency to data subjects under Article 14. OpenAI argued legitimate interest. The Garante held that the balancing test failed because the data subjects had no reasonable expectation of being included in a generative model’s training corpus. The decision became influential across other EU DPAs as the template for AI training enforcement.

In Hangzhou Internet Court v. Douyin scraper (China, 2024), a Chinese commercial scraping operation collected publicly visible Douyin user names, follower counts, and post metrics for a competitive intelligence product. The court applied PIPL strictly and held that even publicly displayed data about identified natural persons required consent or a statutory basis, and the commercial intelligence purpose did not qualify. The case is the leading PIPL precedent on commercial scraping of publicly visible personal data.

The pattern across all three: regulators and courts consistently treat “public” as orthogonal to “personal,” and a scraping operation that conflates the two faces material liability. The defensible posture is to treat publicly visible personal data as personal data with a documented lawful basis, not as exempt by virtue of public visibility.

How regulators apply the personal-versus-public distinction

Regulators across major jurisdictions have repeatedly clarified that publicly available personal data remains personal data under the relevant privacy statute. The European Data Protection Board, the UK Information Commissioner’s Office, the Italian Garante, the French CNIL, the Singapore Personal Data Protection Commission, and the California Privacy Protection Agency have each issued guidance to that effect during 2023-2026.

The clearest articulation is in the EDPB’s 2024 guidance on scraping for AI training. The guidance states that the public availability of data does not by itself create a lawful basis for processing under GDPR. The controller must still identify a lawful basis under Article 6 and, where special category data is involved, under Article 9. The same conclusion applies to retention, transparency, and rights handling.

The CCPA carve-out for publicly available information is the partial exception, but the CPPA has consistently interpreted the carve-out narrowly. Information lawfully made available from federal, state, or local government records is the core. Information that an individual has chosen to make available in a manner consistent with the purpose is also covered, but commercial platforms with terms of service restricting bulk access do not satisfy the latter prong.

The reasonable expectation of privacy doctrine

A useful conceptual frame is the reasonable expectation of privacy doctrine, originally developed in US Fourth Amendment law. The doctrine asks whether a person has manifested a subjective expectation of privacy and whether that expectation is one that society recognises as reasonable.

For scraping the doctrine maps as follows. A person who posts on a public Twitter account has manifested a reduced expectation of privacy in the content of those posts. The person retains a higher expectation regarding aggregation, profiling, and downstream resale. A scraper that respects the original posting intent is on stronger footing than a scraper that aggregates and resells.

The doctrine is not directly applicable to GDPR or CCPA, but it informs the proportionality analysis in both regimes. Regulators ask whether the scraping operation respects the reasonable expectations the data subject would have had at the time of original publication. A scrape that aligns with those expectations is more likely to pass muster.

Indirect identifiers and re-identification risk

Direct identifiers (name, email, government ID) are the easy case. Indirect identifiers (employer, role, city, photo) and quasi-identifiers (ZIP code, date of birth, gender) raise the harder question of re-identification risk.

The Sweeney 2000 study established that 87 percent of the US population could be uniquely identified by ZIP code, date of birth, and gender. Subsequent research extended the analysis to richer attribute sets. The implication for scrapers is that combinations of seemingly innocuous attributes can identify individuals.

The 2026 best practice for scrapers handling indirect identifiers is to apply k-anonymity at the analytical layer (ensuring at least k records share each quasi-identifier combination), to apply differential privacy to aggregates, and to suppress or generalise quasi-identifiers when the combination becomes too specific.

The classification of indirect-identifier-rich data as personal data depends on whether re-identification is reasonably likely. Under GDPR Recital 26 the test considers all means reasonably likely to be used. The threshold is low in practice because attackers have access to many auxiliary datasets.

Next steps

The single highest-leverage action this week is to enumerate every field your pipelines collect and run them through the classification matrix above. Build the register. Two hours of work creates the foundation for every downstream compliance conversation. For deeper compliance, head to the DRT compliance hub and pair this with the GDPR and CCPA guides.

This guide is informational, not legal advice.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)