Personal data vs public data in scraping: a 2026 framework
Personal vs public data scraping is the single most consequential classification a scraping team makes about each target. The distinction governs which compliance regime applies, which legal defences are available, what your storage and retention obligations look like, and how you respond to deletion requests. In 2026, the line is more contested than ever, because regulators have taken consistent positions that public availability does not exempt data from privacy regulation, while courts in some jurisdictions have held that public scraping is broadly permissible under contract and computer-misuse statutes. This guide walks through the framework, the regulator and court positions, a working classification matrix, and a workflow your team can operationalise.
The audience is the technical lead, in-house counsel, or product owner who needs to make defensible classification decisions about every scrape target.
Why the classification matters operationally
Each target you classify as personal data triggers obligations: lawful basis documentation under GDPR, opt-out mechanisms under CCPA, consent under PDPA, deletion-on-request under DPDP. Each target you classify as non-personal-public data triggers far fewer obligations: contract analysis, copyright analysis, robots.txt courtesy.
Misclassification is the most common compliance failure. Teams routinely classify data as “public” because the URL was logged-out-readable, when in fact the data identifies natural persons. Or they over-classify, treating every dataset as personal and adding overhead that is not legally required.
A working classification matrix is the single highest-leverage compliance artefact a scraping team builds. It pays for itself the first time a regulator asks “what is your basis for processing?”
For the broader compliance picture, see the GDPR compliance guide, the CCPA compliance guide, and the HiQ Labs ruling explainer.
How major regimes define personal data
The definitions converge but the edges diverge. The pattern is “any data linkable to a natural person, with edge-case carve-outs.”
| Regime | Core definition | Key edge case |
|---|---|---|
| GDPR (EU) | Information relating to identified or identifiable natural person | IP addresses, cookies count |
| UK GDPR | Same as EU | Same |
| CCPA / CPRA (California) | Information that identifies, relates to, describes, or could reasonably be linked to a consumer or household | Household level included |
| PDPA (Singapore) | Data about an identified individual or identifiable from data | Public availability carve-out broader |
| DPDP (India) | Digital personal data | Limited public availability carve-out |
| LGPD (Brazil) | Information relating to identified or identifiable natural person | Mirrors GDPR |
| PIPL (China) | Information related to identified or identifiable natural persons | Sensitive data category strict |
| POPIA (South Africa) | Information relating to identifiable living natural person, or existing juristic person | Includes some company data |
The two outliers worth flagging: PIPEDA in Canada follows GDPR-style definitions but adds reasonable-purpose tests, and APPI in Japan distinguishes “personal information” from “personal data” with technical processing requirements that catch many scrapers off guard.
The “public” question regulators consistently reject
Across nearly every regime, regulators have taken the position that public availability does not exempt data from the regulation. The clearest statements:
The European Data Protection Board issued a 2024 opinion explicitly stating that scraping publicly available personal data still requires a lawful basis under GDPR. The opinion responded to AI training scrapers who argued that publicly readable data was outside scope.
The California Privacy Protection Agency has issued enforcement guidance reading the “publicly available information” carve-out narrowly, particularly excluding inferences drawn from public data.
The Singapore PDPC has guidance distinguishing “publicly available” (where collection is permitted without consent) from “publicly accessible” (where consent rules still apply). The two terms are routinely conflated in commercial discussions but the PDPC reads them strictly.
The Indian DPDP rules issued in 2025 carry a narrow carve-out for “personal data made publicly available by the data principal,” which excludes data that became public through breach, leak, or aggregation.
The pattern: scrapers who argue “public, therefore exempt” lose the argument with regulators. The legal posture must be “personal, with documented lawful basis” or “not personal at all.”
Classification matrix: what counts as personal
| Data element | Personal under GDPR | Personal under CCPA | Personal under PDPA |
|---|---|---|---|
| Full name | Yes | Yes | Yes |
| Email address | Yes | Yes | Yes |
| Phone number | Yes | Yes | Yes |
| IP address | Yes | Yes | Likely yes |
| Cookie identifier | Yes | Yes | Likely yes |
| Device fingerprint | Yes | Yes | Yes |
| Geolocation (precise) | Yes | Yes (sensitive) | Yes |
| Geolocation (city level) | Likely yes | Yes | Conditional |
| Job title (anonymised) | No | No | No |
| Job title + company name | Likely yes | Yes | Yes |
| Username (re-identifiable) | Yes | Yes | Yes |
| Username (truly anonymous) | No | Conditional | No |
| Profile photo | Yes (biometric inference) | Yes | Yes |
| Forum post (pseudonymous) | Yes | Yes | Yes |
| Forum post (anonymous) | No | Conditional | No |
| Aggregated counts (no individual) | No | No | No |
| Company name only | No | No | No |
| Company financials (public filings) | No | No | No |
| Product price | No | No | No |
| Product review text (with username) | Yes | Yes | Yes |
| Product review text (anonymised) | No | Conditional | No |
Where the column says “conditional” or “likely yes,” the classification depends on context, combination with other fields, and the realistic re-identification risk. The conservative move is to treat as personal and document the assessment.
The combination problem
A single field can be non-personal in isolation and personal in combination. This is the hardest part of the classification.
Consider a scrape target that returns three columns: company name, job title, year of joining. None of the three is personal data on its own. Combined, they may identify a single individual at a small company. The General Data Protection Regulation explicitly requires assessment of “the means reasonably likely to be used” to identify a person, and combination with other available datasets counts.
The practical workflow: enumerate every field you collect, run a re-identification test against your most likely combinations, and treat the resulting dataset as personal if any combination identifies an individual. The enumeration is one-time work per pipeline; the test pays off forever.
For the deeper compliance overlay on this question, see the GDPR compliance guide.
Decision tree for classifying a new scrape target
Q1: Does the dataset contain any direct identifier (name, email, phone, ID)?
├── Yes -> Personal data. Apply full compliance regime.
└── No -> Q2
Q2: Does the dataset contain any quasi-identifier (location, job, age, gender)?
├── No -> Likely non-personal. Document assessment.
└── Yes -> Q3
Q3: Can a realistic combination of fields identify an individual?
├── Yes -> Personal data. Apply full compliance regime.
└── No -> Q4
Q4: Does the dataset contain inference targets (photos, biometric data)?
├── Yes -> Personal data. Sensitive category likely.
└── No -> Q5
Q5: Is the data linkable to other data you hold or could acquire?
├── Yes -> Personal data. Apply full compliance regime.
└── No -> Non-personal. Document and proceed.
Each “yes” should escalate to personal-data treatment. The cost of over-classifying once is small; the cost of under-classifying is regulator action.
Pseudonymisation and anonymisation as compliance levers
Pseudonymisation (replacing identifiers with codes while retaining the linkage table) does not remove data from GDPR scope but does reduce risk and unlock several allowances. Anonymisation (irreversible removal of all identifiers) does remove data from GDPR scope, but the bar for “irreversible” is very high.
The EDPB 2024 anonymisation guidance lists three tests a dataset must pass to be considered anonymised: singling out (no individual is uniquely identifiable), linkability (no two records can be linked to the same individual), and inference (no attribute can be inferred about an individual with significant probability).
Most scraped datasets fail at least one test. A dataset of forum posts with usernames stripped still leaves writing-style fingerprints that can be re-identified. A dataset of product reviews with location stripped still leaves time-of-purchase patterns. Treat anonymisation claims sceptically and engineer accordingly.
For practical pseudonymisation patterns, see building an ethics-first scraping policy.
A working classification register
Each scraping pipeline should maintain a classification register with the following columns:
| Column | Purpose |
|---|---|
| Pipeline ID | Internal identifier |
| Source URL pattern | What you scrape |
| Fields collected | Every column in your storage schema |
| Personal data classification (per regime) | Yes/no/conditional per GDPR, CCPA, PDPA, DPDP |
| Lawful basis (per regime) | LIA reference, consent, contract |
| Retention period | Days/months |
| Deletion mechanism | Process and contact |
| Last reviewed | Date and reviewer |
The register is itself an Article 30 record under GDPR and a comparable record under most other regimes. Build it once, maintain it monthly, and you have most of your compliance documentation.
External references
The European Data Protection Board’s opinions on scraping and AI training are at edpb.europa.eu/our-work-tools/our-documents. The CPPA enforcement guidance archive is at cppa.ca.gov. The Singapore PDPC advisory on publicly available data is at pdpc.gov.sg.
Comparison: classification regimes side by side
| Regime | Public-data carve-out | Inferred data treatment | Pseudonymisation effect |
|---|---|---|---|
| GDPR | Narrow; lawful basis still required | Personal | Reduces risk; still personal |
| CCPA | Moderate; requires lawful availability | Personal if linkable | Reduces risk; still personal |
| PDPA Singapore | Broader; permits collection without consent | Conditional | Reduces risk; still personal |
| DPDP India | Limited; requires data principal action | Personal | Reduces risk; still personal |
| PIPL China | Very narrow; consent default | Personal | Reduces risk; still personal |
| LGPD Brazil | Mirrors GDPR | Personal | Reduces risk; still personal |
Build for GDPR and you have most of the surface covered. The Singapore carve-out can ease specific use cases but does not extend to AI training or large-scale aggregation.
A worked example: scraping a public job board
Suppose you scrape a public job board for talent intelligence. The fields you collect include candidate name (where displayed), current job title, current company, years of experience, location (city), skills tags, and a profile photo URL.
Classification: every field except the optional photo URL contributes to identification of an individual. The combination is unambiguously personal data under all major regimes.
Lawful basis: legitimate interest, with a Legitimate Interest Assessment documenting the purpose (talent intelligence for B2B customers), the necessity (no aggregated alternative meets the use case), and the balancing test (candidates publishing on a public board reasonably expect aggregation by recruiters and intelligence vendors; the balance leans towards processing, but with mitigations).
Mitigations: data minimisation (skip photo URL unless specifically needed), retention limits (purge after 12 months), opt-out mechanism (publish a deletion form), and access controls (no resale of identifiable records to third parties).
Outcome: defensible posture, documented in the register, with the LIA on file and the privacy notice published. A regulator inquiry can be answered in a single email with attachments.
For the parallel discussion of how this works for ecommerce competitor data (different classification, different regime), see the personal vs public data framework’s worked retail example.
FAQ
Is data on a public website automatically not personal?
No. Public availability does not change personhood. If the data identifies a natural person, it is personal data, full stop.
Does anonymisation move data out of GDPR scope?
True anonymisation does, but the bar is very high. Most scraped data that engineers call “anonymised” is in fact pseudonymised under EU law and remains in scope.
What if I only collect aggregated counts?
Aggregated counts that do not relate to identifiable individuals are not personal data. The aggregation must be irreversible and the cell size large enough to prevent inference.
How does business-to-business contact data classify?
A name plus a corporate email or job title is personal data under every major regime. Professional context does not remove personhood.
What about AI-generated synthetic data?
If the synthetic data was trained on personal data and can re-identify individuals from the training set, regulators are increasingly treating it as personal data with respect to those individuals. The case law is still developing but the conservative posture is to apply the regime.
Extended definitional analysis
The personal-versus-public distinction is one of the most misunderstood concepts in scraping law. The clarifying frame is that personal and public are independent dimensions, not opposites.
A piece of information can be:
– Personal and public (a LinkedIn profile, a public Twitter post by a named individual).
– Personal and private (a medical record, a salary slip).
– Non-personal and public (weather data, public transit schedules).
– Non-personal and private (an internal company memo without identifiers).
Privacy law (GDPR, CCPA, PDPA, DPDP) regulates the personal axis regardless of public-ness. Most regimes treat publicly available personal data as still personal data subject to most rules. The CCPA carve-out for publicly available information is the partial exception, and it is narrower than commonly assumed.
This creates the most common scraping mistake. A team scrapes public LinkedIn profiles and assumes the public-ness removes privacy obligations. It does not. The data is still personal data under GDPR Article 4(1) and PDPA Section 2. The right to object, the right to erasure, the lawful-basis requirement, and the transparency obligations all apply.
Implementation patterns for the distinction
A 2026 scraping pipeline should classify every record on both axes at ingest.
- Tag personal data using a deterministic detector for direct identifiers (name, email, phone, address, government ID).
- Tag indirect identifiers separately (employer, role, location to city precision, photo).
- Apply the most restrictive regime that applies to the personal-data classification.
- Apply different retention TTLs to the two classes.
- Maintain a per-record source URL so deletion requests can find the records.
Code pattern: dual classification at ingest
import re
PERSONAL_PATTERNS = {
"email": re.compile(r"[\w.+-]+@[\w-]+\.[\w.-]+"),
"phone_us": re.compile(r"\+?1?[\s.-]?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}"),
"ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
}
def classify_record(record):
text = record.get("text", "")
is_personal = any(p.search(text) for p in PERSONAL_PATTERNS.values())
is_public = record.get("source_visibility") == "public"
return {
"personal": is_personal,
"public": is_public,
"regime": pick_regime(is_personal, record.get("jurisdiction")),
}
Comparison: how regimes treat the four quadrants
| Quadrant | EU GDPR | California CCPA | Singapore PDPA | India DPDP |
|---|---|---|---|---|
| Personal and public | Full scope | Carve-out partial | Full scope | Full scope |
| Personal and private | Full scope | Full scope | Full scope | Full scope |
| Non-personal and public | Out of scope | Out of scope | Out of scope | Out of scope |
| Non-personal and private | Out of scope | Out of scope | Out of scope | Out of scope |
Additional FAQ
Is a public Twitter post personal data?
Yes if it identifies or relates to an identifiable individual. Most named accounts qualify.
What about pseudonymous accounts?
Pseudonymous accounts can still be personal data if re-identification is reasonably possible.
Is corporate information personal data?
A company name is not personal data. An individual employee’s name and title is personal data even though it relates to corporate context.
Does aggregation remove the personal classification?
Statistical aggregates that prevent re-identification can be non-personal. Most scraping aggregates do not meet the strict re-identification threshold.
Real cases that turned on the personal-vs-public distinction
Three regulatory and court actions in 2024-2026 illustrate how the distinction operates in practice.
In Clearview AI v. CNIL (France, 2024 enforcement confirmation), the CNIL imposed a 20 million euro fine against Clearview AI for scraping publicly accessible facial images and constructing a biometric database. Clearview argued the source images were public. The CNIL rejected the argument, holding that public availability does not exempt biometric data from GDPR Article 9 special-category protection. The decision was upheld on appeal in 2025 and is now the leading European precedent for “public does not mean exempt” in the AI training context.
In OpenAI v. Garante (Italy, 2025), the Italian DPA fined OpenAI 15 million euros for processing publicly scraped personal data without an adequate lawful basis under GDPR Article 6, and for failing to provide transparency to data subjects under Article 14. OpenAI argued legitimate interest. The Garante held that the balancing test failed because the data subjects had no reasonable expectation of being included in a generative model’s training corpus. The decision became influential across other EU DPAs as the template for AI training enforcement.
In Hangzhou Internet Court v. Douyin scraper (China, 2024), a Chinese commercial scraping operation collected publicly visible Douyin user names, follower counts, and post metrics for a competitive intelligence product. The court applied PIPL strictly and held that even publicly displayed data about identified natural persons required consent or a statutory basis, and the commercial intelligence purpose did not qualify. The case is the leading PIPL precedent on commercial scraping of publicly visible personal data.
The pattern across all three: regulators and courts consistently treat “public” as orthogonal to “personal,” and a scraping operation that conflates the two faces material liability. The defensible posture is to treat publicly visible personal data as personal data with a documented lawful basis, not as exempt by virtue of public visibility.
How regulators apply the personal-versus-public distinction
Regulators across major jurisdictions have repeatedly clarified that publicly available personal data remains personal data under the relevant privacy statute. The European Data Protection Board, the UK Information Commissioner’s Office, the Italian Garante, the French CNIL, the Singapore Personal Data Protection Commission, and the California Privacy Protection Agency have each issued guidance to that effect during 2023-2026.
The clearest articulation is in the EDPB’s 2024 guidance on scraping for AI training. The guidance states that the public availability of data does not by itself create a lawful basis for processing under GDPR. The controller must still identify a lawful basis under Article 6 and, where special category data is involved, under Article 9. The same conclusion applies to retention, transparency, and rights handling.
The CCPA carve-out for publicly available information is the partial exception, but the CPPA has consistently interpreted the carve-out narrowly. Information lawfully made available from federal, state, or local government records is the core. Information that an individual has chosen to make available in a manner consistent with the purpose is also covered, but commercial platforms with terms of service restricting bulk access do not satisfy the latter prong.
The reasonable expectation of privacy doctrine
A useful conceptual frame is the reasonable expectation of privacy doctrine, originally developed in US Fourth Amendment law. The doctrine asks whether a person has manifested a subjective expectation of privacy and whether that expectation is one that society recognises as reasonable.
For scraping the doctrine maps as follows. A person who posts on a public Twitter account has manifested a reduced expectation of privacy in the content of those posts. The person retains a higher expectation regarding aggregation, profiling, and downstream resale. A scraper that respects the original posting intent is on stronger footing than a scraper that aggregates and resells.
The doctrine is not directly applicable to GDPR or CCPA, but it informs the proportionality analysis in both regimes. Regulators ask whether the scraping operation respects the reasonable expectations the data subject would have had at the time of original publication. A scrape that aligns with those expectations is more likely to pass muster.
Indirect identifiers and re-identification risk
Direct identifiers (name, email, government ID) are the easy case. Indirect identifiers (employer, role, city, photo) and quasi-identifiers (ZIP code, date of birth, gender) raise the harder question of re-identification risk.
The Sweeney 2000 study established that 87 percent of the US population could be uniquely identified by ZIP code, date of birth, and gender. Subsequent research extended the analysis to richer attribute sets. The implication for scrapers is that combinations of seemingly innocuous attributes can identify individuals.
The 2026 best practice for scrapers handling indirect identifiers is to apply k-anonymity at the analytical layer (ensuring at least k records share each quasi-identifier combination), to apply differential privacy to aggregates, and to suppress or generalise quasi-identifiers when the combination becomes too specific.
The classification of indirect-identifier-rich data as personal data depends on whether re-identification is reasonably likely. Under GDPR Recital 26 the test considers all means reasonably likely to be used. The threshold is low in practice because attackers have access to many auxiliary datasets.
Next steps
The single highest-leverage action this week is to enumerate every field your pipelines collect and run them through the classification matrix above. Build the register. Two hours of work creates the foundation for every downstream compliance conversation. For deeper compliance, head to the DRT compliance hub and pair this with the GDPR and CCPA guides.
This guide is informational, not legal advice.