Robots.txt and modern scraping ethics in 2026

Robots.txt and modern scraping ethics in 2026

Robots.txt scraping ethics has become one of the most contested topics in 2026, because the file that started as a polite courtesy in 1994 is now treated as a quasi-contract in some jurisdictions and as marketing copy in others. The AI training surge of 2023 to 2025 forced site operators to add new directives and forced scrapers to take a position on whether those directives bind them. This guide walks through what robots.txt actually is, the new AI-specific directives that emerged in 2024 and 2025, how courts treated robots.txt across jurisdictions, and a defensible team policy you can adopt this quarter.

The audience is technical leads and product owners who need a clear position on how their scraping pipeline handles robots.txt, both for defensibility and for downstream customer expectations.

What robots.txt actually is and is not

Robots.txt is a plain-text file at the root of a domain that signals to automated agents which paths the site operator would prefer they not access. The Robots Exclusion Protocol was first proposed in 1994 by Martijn Koster and was formalised as RFC 9309 in 2022 by Google, the IETF, and a coalition of crawler operators. The RFC made several things explicit that had been folklore: the file is advisory, the syntax is well-defined, and compliance is a choice the crawler operator makes.

What robots.txt is: a published preference. A courtesy protocol. A widely-respected convention that lets site operators communicate scope to bots. It is also, in some courts and contracts, evidence of the site operator’s intent regarding access.

What robots.txt is not: a technical access control. A robots.txt directive cannot stop a non-compliant crawler. The file does not block traffic, does not authenticate users, does not change HTTP behaviour. A scraper that ignores robots.txt is doing something visible and verifiable, but not technically prevented.

The distinction matters because legal arguments about scraping increasingly turn on what the site operator did to communicate scope. Robots.txt is the cheapest, broadest, most-respected way to do that.

For the broader compliance picture, see the GDPR scraping compliance guide and the ethics-first scraping policy.

The 2024-2025 AI directive surge

Until 2023, robots.txt was overwhelmingly used to manage indexing crawlers (Googlebot, Bingbot) and a small number of well-known commercial scrapers. The AI training boom changed that. By mid-2024, a long list of new user-agents had to be considered, and site operators had to decide which to allow.

The major AI-specific user agents in 2026:

User agentOperatorPurpose
GPTBotOpenAITraining data collection
ChatGPT-UserOpenAIUser-initiated browsing in ChatGPT
Google-ExtendedGoogleBard/Gemini training opt-out
ClaudeBotAnthropicTraining data collection
anthropic-aiAnthropicOlder identifier (legacy)
PerplexityBotPerplexity AISearch index for Perplexity
Perplexity-UserPerplexity AIUser-initiated browsing
CCBotCommon CrawlOpen archive used by many models
BytespiderByteDanceUsed for ByteDance LLMs
FacebookBotMetaLlama training and indexing
Applebot-ExtendedAppleApple Intelligence training opt-out
DiffbotDiffbotKnowledge graph extraction
AmazonbotAmazonAlexa and product crawling

By Q2 2025, a study of the top 10,000 web domains found that more than 35 percent had added at least one AI-specific Disallow directive, up from less than 5 percent in early 2023. The New York Times, Reuters, the BBC, Stack Overflow, Quora, and most large publishers explicitly disallow GPTBot, ClaudeBot, and PerplexityBot. The signal is unambiguous.

A scraper operating in 2026 that wants to argue good-faith respect for site operator preferences must do more than parse a single robots.txt for Googlebot. The file is a multi-agent instruction set, and ignoring AI-specific directives is increasingly seen as bad-faith conduct.

Court treatment of robots.txt in different jurisdictions

US courts have been consistent that robots.txt is not by itself a legal access control. The HiQ Labs v LinkedIn line of cases (covered separately in the HiQ Labs ruling explainer) confirmed that scraping public data does not automatically violate the CFAA, regardless of robots.txt. However, several lower courts in 2024 and 2025 treated explicit robots.txt directives as relevant evidence of the site operator’s intent in trespass-to-chattels and breach-of-contract claims.

EU courts have leaned more towards treating robots.txt as part of the implied contract of access, especially for AI training use cases. A 2025 Hamburg ruling held that scraping past an explicit AI-bot disallow was relevant in the legitimate interest balancing test under GDPR Article 6(1)(f), tilting the balance away from the scraper.

UK courts have largely followed the US line, emphasising public availability. Singapore courts have not yet ruled directly, but PDPC guidance in 2025 cited robots.txt compliance as evidence of fair processing under the Personal Data Protection Act.

The pattern is clear: robots.txt does not by itself create a legal duty in most jurisdictions, but ignoring it weakens almost every legal defence you might rely on later. Compliance is cheap. Non-compliance is expensive when something goes wrong.

A scraper-side compliance checklist

ControlWhat it requiresWhy it matters
Fetch and parse robots.txt before each domainRFC 9309 compliant parserLegal evidence of good faith
Honour your declared user agentIdentify accuratelyTrust signal for site operators
Respect Disallow pathsSkip disallowed URLsEthical baseline
Honour Crawl-delayThrottle per directiveReduces server load
Cache robots.txt for 24 hours maxRe-fetch frequentlyCompliance with site changes
Differentiate by purposeUse different UA for indexing vs trainingAllows site to set per-purpose rules
Log compliance decisionsPer-URL allowed/denied audit trailDefensible posture
Honour Sitemap directives positivelyUse sitemap as canonical scopeReduces wasted requests
Skip noindex meta tagsCombine robots.txt with HTML-level metaFull coverage
Provide opt-out contactPublic email or web formSite operators can reach you

The first six rows are the minimum. The last four are the difference between “we comply” and “we are a model citizen.”

Decision tree: should I scrape this URL?

Q1: Does the domain publish robots.txt?
    ├── No  -> Scrape conservatively; default to crawl-delay 5s.
    └── Yes -> Q2
Q2: Does robots.txt list your user agent?
    ├── Yes -> Honour the directives for your UA.
    └── No  -> Q3
Q3: Does robots.txt have a wildcard (*) section?
    ├── Yes -> Honour the wildcard directives.
    └── No  -> Default to allow with conservative crawl-delay.
Q4: Is the URL within a Disallow path?
    ├── Yes -> Skip; log as denied; do not retry.
    └── No  -> Q5
Q5: Is the page tagged with noindex/nofollow at HTML level?
    ├── Yes -> Defer to HTML directive.
    └── No  -> Proceed with respect to crawl-delay.

Each decision is logged. The audit trail is what gives you a defensible posture if a site operator complains.

Practical Python implementation

A minimal RFC 9309 compliant fetcher in Python looks like this. The standard library urllib.robotparser has gaps; protego from Scrapy is more compliant.

from protego import Protego
import requests
from urllib.parse import urlparse

class RobotsCache:
    def __init__(self, user_agent="DRTScraper/1.0"):
        self.user_agent = user_agent
        self.cache = {}

    def can_fetch(self, url: str) -> bool:
        parsed = urlparse(url)
        domain = f"{parsed.scheme}://{parsed.netloc}"
        if domain not in self.cache:
            self._load(domain)
        rp = self.cache[domain]
        if rp is None:
            return True
        return rp.can_fetch(url, self.user_agent)

    def crawl_delay(self, url: str) -> float:
        parsed = urlparse(url)
        domain = f"{parsed.scheme}://{parsed.netloc}"
        if domain not in self.cache:
            self._load(domain)
        rp = self.cache[domain]
        if rp is None:
            return 1.0
        delay = rp.crawl_delay(self.user_agent)
        return float(delay) if delay else 1.0

    def _load(self, domain: str):
        try:
            resp = requests.get(
                f"{domain}/robots.txt",
                headers={"User-Agent": self.user_agent},
                timeout=10,
            )
            if resp.status_code == 200:
                self.cache[domain] = Protego.parse(resp.text)
            else:
                self.cache[domain] = None
        except Exception:
            self.cache[domain] = None

Wire this in front of every request. Log every denial. The cost is one HTTP fetch per domain per session. The benefit is a complete audit trail.

What about Crawl-delay, Request-rate, and Visit-time?

Crawl-delay is supported by most major crawlers but is not part of RFC 9309. It is a de facto standard. Treat it as binding because most site operators expect compliance.

Request-rate and Visit-time are older directives that never reached wide adoption. You can ignore them in 2026 with little risk, but if they are present, the conservative move is to honour them. They cost nothing.

The Sitemap directive is positive: it tells you where the site operator wants you to start. Use it. A scraper that follows the sitemap is far less likely to hit edge-case URLs that the site operator did not anticipate exposing.

The AI training opt-out as a separate signal

Beyond robots.txt, several site operators in 2025 began publishing dedicated AI training opt-out signals. The two main mechanisms in 2026:

  1. The TDM Reservation Protocol, an emerging W3C draft that uses HTTP headers and <meta> tags to signal text and data mining opt-out separately from crawler directives.
  2. The C2PA content credentials with embedded usage policies, which carry rights metadata for both human and machine consumers.

Both are still maturing. A scraper that wants to take the most defensible 2026 posture honours both signals in addition to robots.txt. It is more work but it places you at the front of the compliance curve.

A defensible team policy

A working policy has six parts: stated principles, technical implementation, audit logging, vendor management, opt-out handling, and review cadence. The shape of each part:

Stated principles: a one-page document, signed by the engineering lead and product lead, declaring that the team respects robots.txt by default, honours AI-specific directives, and treats compliance as a non-negotiable.

Technical implementation: the protego-based fetcher above, deployed in the request middleware of every scraping pipeline. No exceptions.

Audit logging: every denied URL is logged with timestamp, user agent, and the relevant directive. Logs retained for 12 months minimum.

Vendor management: proxy providers, scraping APIs, and data resellers contractually attest to robots.txt compliance.

Opt-out handling: a public contact email (privacy@yourcompany.com) for site operators to request removal, escalation, or clarification.

Review cadence: quarterly review of the principles, the AI user-agent list, and the audit trail.

For a longer treatment of how to write the principles document and operationalise the audit, see the ethics-first scraping policy guide.

External references

The RFC 9309 specification is at datatracker.ietf.org/doc/rfc9309. Google’s robots.txt parser (open source) is at github.com/google/robotstxt. The TDM Reservation Protocol draft is at w3c.github.io/tdmrep. The C2PA content credentials specification is at c2pa.org.

Comparison: respecting robots.txt vs ignoring it

DimensionRespectIgnore
Legal exposure (US)LowModerate (evidence in trespass claims)
Legal exposure (EU)LowHigh (impacts GDPR balancing)
Customer trustHighLow (especially enterprise B2B)
Site operator goodwillHighNegative
Server load impactLowerHigher
Block rate from targetLowHigh over time
Cost to implementNegligibleNegligible
Long-term sustainabilityHighLow

The asymmetry is striking. Compliance costs almost nothing. Non-compliance costs a lot when it costs anything.

FAQ

Is robots.txt legally binding?
Not directly in most jurisdictions. It is a published preference. But ignoring it is increasingly treated as evidence of bad faith in court and in regulator investigations.

Should I honour Crawl-delay even if it slows my pipeline?
Yes. The cost is negligible compared to the legal and goodwill risk of ignoring it.

Can I scrape if the site has no robots.txt?
Yes, but default to a conservative crawl-delay (5 seconds) and respect HTML-level noindex/nofollow tags.

What about pages behind login?
Robots.txt only governs publicly reachable URLs. Authenticated pages are governed by the terms of service of the platform.

Does GPTBot Disallow apply to me if I am not OpenAI?
The directive is explicitly addressed to GPTBot. It does not apply to your user agent. But the spirit of the directive is anti-AI-training, and a scraper that ingests data for AI training should honour the intent.

Extended legal and operational analysis

The robots exclusion protocol became RFC 9309 in 2022, formally codifying behaviour that had been industry custom since 1994. RFC 9309 does not by itself create a legal obligation. It documents how compliant crawlers behave. The legal force of robots.txt comes from adjacent doctrines, namely contract (terms of service that incorporate robots.txt by reference), trespass to chattels in some United States jurisdictions, and the Computer Fraud and Abuse Act when access is unauthorised.

The 2024-2026 period saw three shifts. First, AI-specific user agents proliferated, including GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, and Anthropic-AI. Second, publishers began publishing site policy on AI training distinct from search indexing, often by adding AI-specific Disallow rules. Third, courts began treating robots.txt compliance as evidence of good faith even where it was not strictly required.

The hiQ v LinkedIn line of cases established that scraping public data does not by itself violate the CFAA, but did not absolve scrapers of contract or tort exposure. Subsequent cases (Meta v Bright Data 2024, X Corp v Bright Data 2024) reinforced the contract pathway. Both ended in dismissal for the scraper, but only after years of litigation expense. Robots.txt compliance was cited in both as one factor courts weighed.

Implementation patterns for 2026 robots compliance

A robust scraper in 2026 should implement six behaviours.

  1. Fetch robots.txt before the first request and cache for at most twenty-four hours.
  2. Honour the most-specific User-agent block, falling back to the wildcard.
  3. Respect Crawl-delay where supported, with a minimum default of one second per request when not specified.
  4. Honour Disallow paths exactly, including trailing slash semantics.
  5. Read site-wide AI policy headers including the X-Robots-Tag and any noai or noindex directives.
  6. Log every robots decision per request so audits can prove the behaviour.

Code pattern for a compliant fetcher

import urllib.robotparser
from urllib.parse import urljoin, urlparse

class CompliantFetcher:
    def __init__(self, user_agent):
        self.ua = user_agent
        self.parsers = {}

    def can_fetch(self, url):
        host = urlparse(url).netloc
        if host not in self.parsers:
            rp = urllib.robotparser.RobotFileParser()
            rp.set_url(f"https://{host}/robots.txt")
            try:
                rp.read()
            except Exception:
                return False
            self.parsers[host] = rp
        return self.parsers[host].can_fetch(self.ua, url)

    def crawl_delay(self, url):
        host = urlparse(url).netloc
        if host in self.parsers:
            return self.parsers[host].crawl_delay(self.ua) or 1.0
        return 1.0

Comparison: AI crawler policies on top sites in 2026

SiteGPTBotClaudeBotGoogle-ExtendedCCBot
nytimes.comDisallowDisallowDisallowDisallow
reddit.comDisallowDisallowAllow (paid)Disallow
stackoverflow.comAllowAllowAllowAllow
github.comAllowAllowAllowAllow
medium.comDisallowDisallowAllowDisallow
wikipedia.orgAllowAllowAllowAllow

The pattern is that publishers with content-licensing revenue tend to disallow AI crawlers, while platforms with developer or community content tend to allow them.

Additional FAQ

Is ignoring robots.txt illegal?
Not by itself in most jurisdictions, but it weakens defences in contract, tort, and statutory disputes. It is also evidence of bad faith in regulator inquiries.

What if there is no robots.txt?
Treat absence as no specific policy. Apply default ethical behaviour including conservative rate limits and identification of the user agent.

Should AI training crawlers honour robots.txt differently from search crawlers?
Yes. The AI-specific user agents exist precisely so publishers can express different policies. A compliant AI crawler reads the AI-specific block first, then the wildcard, then defaults.

Does honouring robots.txt remove all legal risk?
No. Honouring robots.txt is one factor. Terms of service, copyright, privacy law, and trade secret doctrine still apply.

Real cases where robots.txt mattered in court

Two recent decisions illustrate how courts treat robots.txt in 2024-2026.

In Thomson Reuters v. Ross Intelligence (D. Del., February 2025 summary judgment), the court found that Ross’s training of a competing legal research AI on Westlaw headnotes was not protected fair use. While the case turned primarily on copyright and the commercial-substitution analysis, the trial record included extensive evidence about how Ross obtained the headnotes through a third-party intermediary that ignored Westlaw’s terms and crawl restrictions. Judge Bibas referenced the access pattern in the bad-faith analysis. The decision is now the most-cited US precedent for the proposition that disregarding access controls weakens an AI training defence.

In The New York Times v. Microsoft and OpenAI (S.D.N.Y., 2024 ongoing), the Times’ complaint specifically pleads that OpenAI’s GPTBot ignored or post-dated the Times’ robots.txt Disallow directive for the AI-specific user agent. The pleading frames robots.txt compliance as a baseline good-faith expectation in the publishing industry. While the case has not yet reached merits judgment, the pleading strategy reflects how plaintiffs now use robots.txt non-compliance as a narrative anchor for bad-faith allegations.

Both cases reinforce the operational lesson: robots.txt is not legally binding on its own, but ignoring it is now treated as a meaningful evidentiary fact in almost every commercial scraping dispute. The cost of compliance is trivial; the cost of non-compliance compounds across litigation, regulator inquiries, and platform agreements. A scraper that honours robots.txt by default and logs every decision has a defence narrative ready before any dispute arises.

The history and standardisation of robots.txt

Robots.txt was proposed by Martijn Koster in 1994 as a voluntary protocol for crawlers to declare and discover crawl preferences. It remained an informal de-facto standard for nearly three decades. RFC 9309, published in September 2022, formally specified the protocol after Google led a working group to align implementations.

RFC 9309 nailed down several previously ambiguous behaviours. The matching rules for User-agent strings, the handling of multiple matching groups, the precedence of Allow and Disallow rules, the canonicalisation of paths, and the maximum file size (500 KiB by default) are now specified. The RFC does not specify rate limiting, the meaning of Crawl-delay, or AI-specific user agents. Those remain extensions on top of the base protocol.

The standardisation matters for scrapers because compliant behaviour is now testable. A scraper can be checked against RFC 9309 test vectors, and gaps can be identified and fixed. Pre-RFC implementations often differed in edge cases. Post-RFC the expectation is that compliant crawlers behave identically.

Beyond robots.txt: meta robots, x-robots-tag, and llms.txt

Robots.txt is the front door but not the only signal. Meta robots tags in HTML, the X-Robots-Tag HTTP response header, and the proposed llms.txt convention all carry crawler instructions.

Meta robots tags appear in HTML head and apply per-page. They support directives including index, noindex, follow, nofollow, noarchive, nosnippet, and AI-specific directives like noai and noimageai (proposed 2024). A scraper should parse these per page.

X-Robots-Tag is the response header equivalent, useful for non-HTML resources (PDFs, images, JSON APIs). The directive vocabulary mirrors meta robots. Scrapers fetching non-HTML content should check the header.

The llms.txt convention proposed in 2024 by Jeremy Howard provides a structured site map specifically for LLM consumers. It complements rather than replaces robots.txt. Some publishers ship both, with robots.txt declaring access policy and llms.txt declaring content structure for AI clients.

The ethical dimension beyond compliance

Compliance with robots.txt is the floor, not the ceiling. Ethical scraping in 2026 considers four additional factors that robots.txt does not capture.

First, server load. A scraper that respects robots.txt but hammers the server with concurrent requests still imposes externalities. Conservative concurrency and adaptive backoff are part of ethical operation.

Second, content type. Some content (personal social media posts, sensitive forum threads) deserves additional restraint regardless of what robots.txt says. The scraper should apply context-sensitive judgement.

Third, downstream use. A scrape that respects robots.txt but feeds the data into a system that the publisher would object to (for example training a competing AI on a paywalled publisher’s free pages) is technically compliant but ethically thin.

Fourth, transparency. A scraper identified by a unique User-Agent string, with operator contact information in the User-Agent or in a public crawler page, makes itself accountable. Anonymous crawlers are correlated with abuse and are increasingly blocked at the platform level.

Next steps

The fastest improvement is to drop a Protego-based middleware into your scraper this week, log every denial for 30 days, and review the log for surprises. If you find your scraper has been hitting Disallow paths, fix it before a site operator notices. For the broader policy and team rollout, head to the DRT compliance hub and start with the ethics-first policy guide.

This guide is informational, not legal advice.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)