Decentralized identity and Web4: scrapers’ implications
Web4 decentralized identity is reshaping the assumptions scraping operators make about web access, authentication, and data trustworthiness. The collection of standards loosely grouped as Web4 (decentralized identifiers, verifiable credentials, agent-to-agent protocols, intent-based access) reached an inflection point in 2025-2026, with major platforms beginning to expose DID-based authentication, browsers shipping wallet integrations, and the IETF moving multiple drafts toward formal standardisation. For scrapers, the implications cut both ways: some doors that were locked behind central authentication open to DID-authenticated agents, while other doors close as anonymous unauthenticated scraping becomes harder. This guide walks through what Web4 actually means in 2026, the standards that matter, how DID and verifiable credentials change the access landscape, the scraping-relevant use cases, and a practical posture for operators.
The audience is the technical lead, product owner, or platform architect who needs to understand where decentralized identity fits in the scraping landscape they will operate over the next 24 months.
What Web4 actually means in 2026
The term “Web4” is contested. Different industry voices use it for different things. In 2026 the dominant usage refers to a converging set of standards and practices that move beyond the platform-mediated identity of Web2 and the wallet-mediated speculation of Web3 toward verifiable, portable, agent-friendly identity.
The constituent technologies:
| Technology | Standard | Status (mid-2026) |
|---|---|---|
| Decentralized Identifiers (DIDs) | W3C Recommendation | Stable since 2022 |
| Verifiable Credentials (VC) | W3C Recommendation | Stable since 2022 |
| OpenID Connect for Identity Assurance | OpenID Foundation | Production |
| OpenID for Verifiable Credentials (OID4VC) | OpenID Foundation | Production |
| DID Comm Messaging | DIF spec | Late draft |
| Trust over IP framework | IETF / ToIP | Active drafts |
| Authority-bound digital wallets | Browser specs | Shipping in major browsers |
The shift in 2025-2026 was that these standards moved from research to production. The EU Digital Identity Wallet (EUDI Wallet) reached general availability in mid-2026 across most member states. The UK Digital Identity Service began commercial issuance. Singapore’s Singpass added VC issuance. India’s DigiLocker integrated VC alongside its existing document store.
For scrapers, the relevant question is: what do these wallets carry, and which sites will require them?
For the broader emerging-tech context, see AI agents as web users and verifiable credentials and scraping.
DIDs explained for scraping operators
A Decentralized Identifier is a globally unique identifier that does not require a central registration authority. The format is did:method:identifier, where the method specifies how the identifier is resolved (did:web, did:key, did:ion, did:plc, and many more).
The point of a DID is that the holder controls the keys associated with it. A DID document, resolved by the method-specific resolution process, contains the public keys that the holder uses to authenticate.
For scraping access, DIDs change the authentication model in two ways. First, a site can require an authenticated visitor without the visitor needing an account on the site (the user’s wallet asserts their DID and signs a challenge). Second, the site can verify properties of the visitor (over 18, EU resident, paid subscriber to a credential issuer) without learning more.
A scraping operator who wants to access a DID-authenticated site has two options: obtain a DID and the relevant credentials (probably hard for scraping at scale), or partner with a credential holder who can act on behalf (cleaner but still bounded).
Verifiable credentials and selective disclosure
Verifiable Credentials are signed assertions issued by an issuer about a subject. A diploma is a credential. A driver’s licence is a credential. A subscription to a publication is a credential.
VCs use cryptographic signatures so that any verifier can confirm the issuer’s signature without contacting the issuer. The holder presents the credential as a Verifiable Presentation, which can include selective disclosure (showing only certain fields) and zero-knowledge proofs (proving a property without revealing the underlying data).
For scraping, VCs reshape the authorisation model. A site that today says “you must have a paid subscription to read this article” can, in a VC world, verify the subscription credential without requiring the user to have an account on the site. The credential travels with the user (or the user’s agent).
This has direct scraping implications:
| Scenario | Pre-VC world | VC world |
|---|---|---|
| Paywalled article access | Requires site account, login flow | Requires VC presentation |
| Age-gated content | Account verification | Age VC selective disclosure |
| Geographic restriction | IP geolocation | Residency VC |
| Subscription bundling | Each site separate | Cross-site credential reuse |
Scrapers operating against VC-protected sites face a fundamentally different access landscape. The traditional residential-proxy approach that defeats IP-based geofencing does not defeat credential-based gating.
For the broader credentials-and-scraping discussion, see verifiable credentials and scraping.
How agent-to-agent protocols matter
The DIDComm and Trust over IP frameworks specify how two agents (each with a DID) can establish authenticated, encrypted communication channels. The expected use case is human-to-human or service-to-service, but the protocols are agent-agnostic.
For 2026 scrapers, agent-to-agent protocols matter because they enable a new class of structured data exchange that bypasses the traditional scrape-or-API binary. Instead of scraping a website’s rendered HTML or hitting a vendor’s REST API, a scraping agent can establish a DIDComm channel with the source’s data agent, present a credential proving authorisation, and receive structured data over an encrypted channel.
The 2025-2026 deployments of this pattern are still early. Several supply-chain platforms expose DIDComm endpoints alongside their REST APIs. Several open-banking aggregators expose DIDComm as the preferred channel. The trend is real, the volume is small, but the trajectory points toward more agent-to-agent and less HTML-or-REST.
Comparison: identity models that scrapers operate within
| Model | Identity authority | Visibility to scraper | Authorization mechanism |
|---|---|---|---|
| Web1 (open web) | None | Full | None |
| Web2 (platform) | Platform | Partial (if logged out) | Account + session |
| Web3 (wallet) | Self via blockchain | Pseudonymous | Wallet signature |
| Web4 (DID + VC) | Self with verified attestations | Selective | Credential presentation |
Each model has different scraping implications. Web4 is the model where scraping needs to think about credentials, not just IPs.
Decision tree: how to access a Web4-authenticated source
Q1: Does the source require any form of authentication?
├── No -> Standard scraping; existing techniques apply.
└── Yes -> Q2
Q2: Is the authentication account-based (Web2 style)?
├── Yes -> Account creation; standard logged-in scraping considerations.
└── No -> Q3
Q3: Is the authentication credential-based (Web4 style)?
├── Yes -> Q4
└── No -> Likely wallet-signature (Web3); evaluate.
Q4: Can your operation legitimately hold the required credential?
├── Yes -> Implement VC presentation; proceed.
└── No -> Q5
Q5: Is there a partnership path with a credential holder?
├── Yes -> Negotiate access via partner.
└── No -> Source is effectively unscrapable for your operation.
The decision tree forces explicit consideration of the credential question. For sources where the answer is “unscrapable”, the alternative is partnership or licensed access.
Worked example: scraping a VC-gated medical research portal
A 2026 medical research portal hosts open-access papers but gates downloadable supplementary data behind a verifiable credential proving the requester is an affiliated researcher at an accredited institution.
Web2 scraping path: create an account if possible, validate email, request access, scrape what is exposed. Often blocked by manual review.
Web4 access path: hold a Researcher Credential issued by an accredited issuer (university, professional body). Present the credential at the portal. Receive structured data over the credential-authorised channel.
For a scraper operating on behalf of a research institution, the Web4 path is cleaner: the institution is already issuing credentials to its researchers; the scraper acts on behalf of the institution; the credential travels with the request.
For a scraper operating commercially without an institutional relationship, the Web4 path is closed. The operator has to either partner with an institution or rely on the portal’s open-access surface.
Browser wallet integration in 2026
Major browsers shipped wallet integrations in 2025-2026:
| Browser | Wallet integration | Status |
|---|---|---|
| Brave | Native crypto + DID wallet | Production |
| Chrome | Optional via Web5 extensions | Mature extension ecosystem |
| Firefox | Native via Mozilla Account integration | Production |
| Safari | Apple Wallet integration | Apple-controlled |
| Edge | Microsoft Authenticator integration | Enterprise |
Browser-resident wallets bring DID and VC presentation to the user-facing layer. The browser exposes a JavaScript API (the WebID specification) that sites can call to request credentials.
For headless and agentic browsers, the equivalent is wallet plug-ins that expose the same API but with programmatic credential management. Stagehand and Browserbase added wallet support in 2025; browser-use added it in 2026.
Privacy implications and selective disclosure
VCs support selective disclosure: a holder can present only the fields needed for a request. A user proving age can present “I am over 18” without revealing the date of birth or any other field.
Zero-knowledge proofs go further: the holder can prove a predicate (over 18, in EU, paid subscriber) without presenting the underlying credential at all. The cryptography is mature; the deployment is patchy.
For scraping operators, the implications are:
- Credential-based authentication discloses only what the credential explicitly carries.
- Selective disclosure reduces the signal available to behavioural fingerprinting (because fewer fields are revealed).
- Zero-knowledge presentation is functionally indistinguishable from anonymous access for the verifier.
These features generally favour the user, not the scraper. A scraper that wants to extract user identity from a site’s interaction logs has less to work with when users authenticate via ZKP-presented VCs.
For the broader privacy-preserving discussion, see privacy-preserving scraping.
What scraping operators should do in 2026
Three concrete actions.
First, audit your target sources for VC-gating signals. The signal is usually visible in the authentication flow: a “Sign in with EUDI Wallet” button, a “Connect Wallet” prompt, an OID4VC redirect. If your target sources are adopting these, plan your access path now.
Second, evaluate partnership options. For sources where credential holding is impractical, partnerships with credential holders (research institutions, accredited resellers, licensed aggregators) are the access path. The market for these partnerships is forming now.
Third, consider becoming a credential issuer. For some scraping operators, the role flips: instead of scraping data, you become the issuer of credentials about data quality, freshness, or provenance. Several scraping platforms began issuing data-provenance credentials in 2025-2026.
For the related agent-as-user question, see AI agents as web users.
External references
The W3C DID specification is at w3.org/TR/did-core. The W3C Verifiable Credentials data model is at w3.org/TR/vc-data-model-2.0. The OpenID for Verifiable Credentials specification is at openid.net/specs/openid-4-verifiable-credential-issuance-1_0.html. IETF working drafts on Trust over IP are tracked at datatracker.ietf.org.
Comparison: scraping access methods in a Web4-influenced world
| Method | Effectiveness against Web4 | Cost | Risk |
|---|---|---|---|
| Residential proxy | Low (geofence only) | Medium | Detection |
| Account creation | Variable (only Web2) | Low-medium | TOS breach |
| Browser automation | Moderate (with wallet) | Medium | Detection |
| Credential acquisition | High (where legitimate) | High setup | Legal alignment |
| Partnership / licensing | High | Highest setup | Lowest detection risk |
| Agent-to-agent (DIDComm) | High where supported | Medium | Lowest detection risk |
The pattern is clear: the future favours legitimate access paths. Operators who plan for credentialed access will have more options in 2027 than operators who do not.
A worked compliance overlay
Web4 access has compliance implications that traditional scraping does not. A scraper using a research credential is making representations about the user/institution. False or misleading credential use is fraud, not just a TOS issue.
Three controls a Web4-using scraper should implement:
- Credential governance: written policy on which credentials the operation holds, who is the legitimate holder, what use is in-scope.
- Audit logging: every credential presentation logged with timestamp, target, and outcome.
- Revocation handling: when a credential is revoked (issuer or holder action), the operation must stop using it within a defined window.
For the broader compliance posture, see building an ethics-first scraping policy.
FAQ
Is Web4 actually a thing in 2026?
The term is contested but the underlying standards (DIDs, VCs, OID4VC) are real and shipping. Whether you call it Web4 or just “verifiable digital identity”, it is reshaping access.
Can I scrape a VC-gated site without a credential?
Generally no. The credential is the authorisation. Scraping around it would be the equivalent of bypassing a paywall.
Do I need to deploy DIDs in my scraping infrastructure?
Not yet, for most operators. The technology is mature but most sources still use Web2 authentication. Plan for adoption rather than deploy ahead.
What is the relationship between Web3 and Web4?
Web3 focused on decentralized money via blockchain wallets. Web4 focuses on decentralized identity via DIDs and VCs. The technologies overlap (some DID methods use blockchain) but the use cases are distinct.
What is the EU Digital Identity Wallet?
A government-issued, privacy-preserving wallet that EU residents can use to present credentials (driver’s licence, professional qualifications, age) at compatible services. General availability across the EU mid-2026.
Extended decentralized identity analysis
The decentralized identity stack in 2026 consists of four standards. First, decentralized identifiers (DIDs) per W3C DID Core 1.0. Second, verifiable credentials (VCs) per W3C VC Data Model 2.0. Third, presentation exchange per DIF Presentation Exchange 2.0. Fourth, key binding and proof formats (LDP, JWT, SD-JWT, BBS+).
For scrapers DIDs and VCs reshape three things. First, identity-gated content moves from cookie-based session auth to credential-based access. Second, proof of personhood (PoP) credentials become a counter-bot signal. Third, content provenance shifts from platform-attested to creator-attested via signed credentials.
The 2024-2026 wave of EU eIDAS 2.0 and the European Digital Identity Wallet pushed DID and VC adoption from research to production. By 2026 several large platforms accept VC-based proof of age and proof of residence.
Implementation pattern: DID-aware fetcher
import json
from did_resolver import resolve_did
from vc_lib import verify_vc, present
async def fetch_with_vc(url, did, vc_token):
did_doc = await resolve_did(did)
presentation = present(vc_token, audience=url)
headers = {
"Authorization": f"VC {presentation}",
"DID": did,
}
response = await http.get(url, headers=headers)
return response
async def verify_inbound_vc(presentation, expected_audience):
result = verify_vc(presentation)
if not result.valid:
return False
if result.audience != expected_audience:
return False
return True
SD-JWT pattern for selective disclosure
Selective Disclosure JWT (SD-JWT) lets a holder reveal only a subset of credential claims to a verifier. Scrapers acting as verifiers can request only the claims they need (for example country of residence) without seeing the full credential. This is privacy-preserving and reduces compliance burden.
def select_disclosures(sd_jwt, claims_to_reveal):
payload = parse_sd_jwt(sd_jwt)
revealed = {k: v for k, v in payload.items() if k in claims_to_reveal}
return reissue_with_disclosures(sd_jwt, revealed)
Comparison: identity models for scrapers
| Model | Privacy | Provenance | Replay protection | Scraper effort |
|---|---|---|---|---|
| Cookie session | Low | None | Per-session | Low |
| OAuth bearer | Low | Issuer-attested | Per-token | Moderate |
| API key | Low | Issuer-attested | Per-key | Low |
| DID plus VC | High (with SD-JWT) | Issuer-attested with crypto proof | Per-presentation | High |
| zk-credentials | Highest | Crypto-attested | Per-proof | Highest |
Web4 vocabulary for scrapers
Web4 is an evolving label that overlaps with decentralized identity, content authenticity (C2PA), and AI-agent-native protocols. For scrapers the practical Web4 surface includes four primitives.
- C2PA content credentials embedded in media files for provenance.
- did:web identifiers for site-level identity (a DID hosted at .well-known/did.json).
- AI agent identity DIDs distinguishing automated traffic from human traffic.
- Cross-platform trust frameworks built on top of DIF specifications.
Additional FAQ
Are DIDs replacing OAuth?
Not yet. OAuth remains dominant. DIDs are gaining ground for high-assurance use cases.
Do scrapers need their own DIDs?
For agentic browsers acting on behalf of a user, yes increasingly. The DID is how the scraper identifies itself to the target service.
What about C2PA?
C2PA content credentials are useful for scrapers that need to verify media provenance, particularly for AI training data curation.
How does this interact with bot detection?
A scraper presenting a verified personhood VC may be treated as human-equivalent. A scraper presenting an agent VC is identified as an agent and routed accordingly.
The W3C DID core specification in detail
The W3C DID Core 1.0 specification, recommended in July 2022, defines decentralized identifiers as a new type of identifier that is created and managed without reliance on a centralized registry. A DID resolves to a DID Document, which contains the verification methods, service endpoints, and other metadata associated with the identifier.
DIDs come in many methods. did:web is a method that uses a domain name as the basis. did:key uses a cryptographic key directly. did:ion uses the Sidetree protocol on Bitcoin. did:plc uses the Bluesky-developed Public Ledger of Credentials. Each method has different trade-offs in decentralization, performance, and cost.
For scrapers the most relevant methods are did:web (for site-level identity) and did:key (for ephemeral keys). did:web is essentially a DNS-based approach where a DID resolves via fetching the .well-known/did.json file at the domain. This is operationally simple and integrates with existing web infrastructure.
The DID Document specifies one or more verification methods, each of which is a public key and an associated algorithm. Authentication, assertion, key agreement, and capability invocation are different relationships a verification method can have to the DID. A scraper signing a request uses an authentication-relationship key.
Verifiable credential lifecycle
A verifiable credential has three actors: the issuer, the holder, and the verifier. The issuer creates and signs the credential. The holder stores the credential and presents it to verifiers. The verifier checks the credential’s signature, status, and contents.
The lifecycle proceeds in five steps. First, the issuer issues a credential to the holder, typically via OID4VCI. Second, the holder stores the credential in a wallet. Third, a verifier requests a presentation, typically via OID4VP. Fourth, the holder constructs a presentation (which may include selective disclosure) and sends it to the verifier. Fifth, the verifier validates the presentation and acts on it.
For scrapers acting as verifiers the verification step is the operational concern. Verification involves checking the cryptographic signature against the issuer’s verification method, checking the issuer against a trust list, checking the credential’s expiration, and checking the credential’s revocation status.
The trust list is the most operationally complex piece. The verifier must decide which issuers it trusts. Some trust lists are centralised (a government list of accredited issuers). Others are federated (a mutual recognition agreement among issuers). The decision is policy-driven and context-dependent.
Selective disclosure and zero-knowledge proofs
Selective disclosure is the ability to reveal a subset of credential claims without revealing the rest. SD-JWT is a 2024-stable format that achieves selective disclosure through hash-based blinding of individual claims.
Zero-knowledge proofs go further. A ZKP-based credential allows a holder to prove a statement about the credential (for example over 18) without revealing any specific claim. The holder constructs a proof that the verifier can check without seeing the underlying data.
For scrapers ZKP is operationally heavier but privacy-preserving. The 2026 pattern is to use SD-JWT for most cases (good privacy, modest compute) and ZKP for high-sensitivity cases (best privacy, higher compute).
The ZKP toolkit in 2026 includes AnonCreds (the Hyperledger flagship), BBS+ signatures (for ZK on standard VCs), and several research-grade systems. Production deployments are growing but remain a minority.
C2PA and content provenance
C2PA (Coalition for Content Provenance and Authenticity) is a parallel standard focused on media provenance rather than identity. A C2PA manifest, embedded in an image or video file, describes the file’s origin and edit history through signed assertions.
For scrapers harvesting media files the C2PA manifest is a useful provenance signal. A scraper feeding AI training data can use C2PA to filter out content that has been flagged by the creator as not for AI training. A scraper feeding a news aggregator can use C2PA to verify the file’s claimed source.
The C2PA ecosystem grew rapidly in 2024-2026. Major camera manufacturers ship C2PA-capable hardware. Major image editors embed C2PA manifests on save. Major social platforms display C2PA badges. The pattern is similar to TLS adoption in the early 2010s.
A 2026 best practice for scrapers is to read and preserve C2PA manifests at ingest. The manifest itself is small. Preservation enables downstream consumers to make their own provenance decisions.
Common pitfalls when adopting DIDs in a scraping pipeline
Three failure modes consistently bite teams that introduce DID-based identity into existing scraping infrastructure.
The first pitfall is choosing the wrong DID method for the use case. Teams default to did:web because it is simple, but did:web inherits all the trust limitations of DNS and TLS, including registrar takeover and CA mis-issuance. For high-assurance use cases like agent identity that crosses regulatory boundaries, did:key or did:ion provide stronger guarantees at the cost of more complex resolution. Map the trust requirement to the method before writing code.
The second pitfall is not rotating verification method keys. A DID is durable, but the keys associated with it are not. Most DID methods support adding new verification methods and retiring old ones via DID Document updates. A scraping operation that uses the same key for years exposes itself to key compromise with no recovery path. Build key rotation into the operational runbook from day one.
The third pitfall is conflating proof-of-personhood with proof-of-uniqueness. A personhood credential proves the holder is human; it does not prove the holder is unique to your platform. Sybil resistance requires additional signals like nullifier sets or federated uniqueness checks, which most off-the-shelf personhood credentials do not provide.
Next steps
The fastest first step is to audit your top sources for any wallet/credential signals in the authentication flow. If you find any, the time to plan your access path is now, before VC-gating becomes the default. For broader emerging-tech context, head to the DRT emerging-tech hub and pair this with the verifiable credentials guide.
This guide is informational, not engineering or legal advice.