Fair use and copyright for AI training data in 2026
Fair use AI training data is the most contested doctrine in the entire AI copyright debate, and 2026 is the year the doctrine finally has live precedent. From late 2023 through 2025, a series of cases (Authors Guild v OpenAI, NYT v OpenAI, Getty Images v Stability AI, Doe v Github, Universal v Anthropic) gave courts the chance to articulate how the four-factor fair use test applies to scraping copyrighted content for model training. The answers are nuanced. Some have changed the cost structure of building a frontier model. Some have changed what scrapers can defensibly collect. This guide walks through the doctrine, the cases, the practical implications, and a 2026 compliance posture for teams scraping for AI training.
The audience is the data engineer, ML practitioner, and product owner whose pipeline includes scraping copyrighted material as input to a model.
What the four-factor fair use test actually asks
US fair use is codified at 17 U.S.C. Section 107. The statute identifies four factors that courts weigh:
- The purpose and character of the use, including whether it is commercial or transformative.
- The nature of the copyrighted work.
- The amount and substantiality of the portion used in relation to the whole.
- The effect of the use upon the potential market for or value of the copyrighted work.
For AI training, the first and fourth factors do most of the work. Factor one asks whether the model’s use of the data is transformative (typically yes, because training learns statistical patterns rather than reproducing the work) and commercial (typically yes for commercial models). Factor four asks whether the trained model substitutes for or otherwise harms the market for the original work. This is the factor that has dominated the 2024-2025 case law.
For the broader compliance picture across regimes, see the GDPR compliance guide and the personal vs public data scraping framework.
The pre-AI fair use precedents that still matter
Three pre-AI cases anchor the doctrine. Authors Guild v Google (2015, 2nd Cir) held that Google’s mass scanning of in-copyright books to build a search index was fair use. The use was highly transformative (search snippets are not substitutes for the books), the amount was technically large but functionally limited (snippets were truncated), and the market effect was minimal or positive (snippets drove book discovery and sales).
Authors Guild v HathiTrust (2014, 2nd Cir) reached a similar result for academic library digitisation: transformative purpose, no market substitute.
Field v Google (2006, D Nev) held that Google’s caching of web pages for search was fair use, with explicit weight given to the fact that website owners can use robots.txt to opt out (the implied licence theory).
Each of these cases carries forward into the AI training debate but with critical differences: AI models are themselves potential market substitutes for the original content in ways that search snippets are not.
The 2024-2025 AI training case law
NYT v OpenAI (filed 2023, partial summary judgement 2025) is the most-watched case in the entire space. The Times alleged that OpenAI scraped its archive, that GPT models can be prompted to reproduce verbatim Times articles, and that ChatGPT-as-a-product directly substitutes for Times search and reading. The 2025 partial ruling rejected OpenAI’s motion to dismiss on factor four, finding that the Times had plausibly alleged market substitution. The case is heading to trial in 2026 and the discovery has produced extraordinary disclosures about training data composition.
Authors Guild v OpenAI (consolidated 2023-2024) addressed the same scraping question for in-copyright books. The court has not yet issued a fair use ruling but allowed the case to proceed on factor four.
Getty Images v Stability AI (filed 2023, UK and US) addressed image training. Stability AI scraped Getty’s image collection (with Getty’s watermarks visible in some generated outputs), and Getty alleged copyright and trademark infringement. The UK ruling in 2025 found a clear copying claim; the fair use analogue (fair dealing under UK law) failed because the use was deemed too commercial and the market effect on Getty’s licensing business was real.
Doe v Github (2024, ND Cal) addressed code scraping for Copilot training. The court allowed the case to proceed but signalled scepticism about pure-fair-use defences when the model can reproduce code with attribution stripped.
Universal v Anthropic (filed 2024) addressed scraping of song lyrics. Settled in 2025 with Anthropic agreeing to filter training data and outputs. The settlement implies a defendant’s view that the four-factor test was not a clean win.
The collective signal from these cases: fair use for AI training is not dead, but it is much narrower than the most aggressive 2022-2023 readings suggested. Factor four is doing the work, and where the model can substitute for the original work in the market, fair use loses.
EU TDM exceptions: a different framework
The EU has a different doctrine. The Directive on Copyright in the Digital Single Market (DSM Directive, 2019) created two text and data mining (TDM) exceptions:
Article 3 TDM: a mandatory exception for research organisations and cultural heritage institutions. Lawful access required. No opt-out by rightsholders.
Article 4 TDM: a broader exception for any TDM purpose, including commercial AI training. Lawful access required. Rightsholders can reserve their rights through machine-readable means (the “opt-out” mechanism, often via the TDM Reservation Protocol or robots.txt-style directives).
The Article 4 opt-out is what reshaped European AI training in 2024-2025. Major publishers began publishing TDM-Reservation headers en masse, signalling that their content was off-limits to commercial training. Scrapers operating in the EU now must check for the opt-out and respect it; ignoring it strips the Article 4 defence.
The EU AI Act (2024) layers transparency obligations on top: any general-purpose AI model placed on the EU market must publish a “sufficiently detailed summary” of its training data. This forces a level of training-data disclosure that fundamentally changes the legal posture.
For the broader robots.txt and AI directive landscape, see robots.txt and modern scraping ethics.
Practical compliance posture for 2026 AI training
A scraper operating an AI training pipeline in 2026 should adopt seven practices:
Maintain a training data manifest. For every dataset, record the source URL set, the date of collection, the user agent used, the robots.txt state at collection time, the TDM-Reservation state at collection time, and any rights metadata.
Honour AI-specific user agent directives. If you scrape under an AI-bot identity (GPTBot equivalent), make that identity public and respect the directives addressed to it.
Honour TDM-Reservation signals. Implement a parser for both robots.txt-style and HTTP header TDM-R signals; skip content where rights are reserved.
Filter training corpora for opt-out reaffirmation. If a content owner publishes a TDM-Reservation after the initial scrape, re-honour it on the next training cycle.
Implement training-data deduplication and memorisation reduction. Models that memorise verbatim are factor-three losers (using the substantial whole) and factor-four losers (substituting for the original).
Build output filters for known copyrighted material. If your model can reproduce a chunk of a copyrighted work with high fidelity on prompt, factor four is plausibly satisfied at output time even if it was a stretch at training time.
Maintain a publish-ready training data summary. The EU AI Act requires it; US litigation discovery effectively requires it.
Decision tree: is this scrape defensible for AI training?
Q1: Is the source URL publicly accessible without bypassing controls?
├── No -> Skip; fair use does not rescue unauthorised access.
└── Yes -> Q2
Q2: Does the source publish AI-specific opt-out directives (robots.txt, TDM-R)?
├── Yes -> Honour the opt-out; skip.
└── No -> Q3
Q3: Is the content in-copyright (i.e., not public domain or open licence)?
├── No -> Proceed; verify licence terms.
└── Yes -> Q4
Q4: Is the model commercial?
├── Yes -> Factor 1 leans against; proceed only with strong factor 4 story.
└── No -> Factor 1 leans for; proceed with documentation.
Q5: Will the model plausibly substitute for the source content in the market?
├── Yes -> Factor 4 likely lost; reconsider inclusion or filter outputs.
└── No -> Strongest defensive posture.
Each branch produces a documented decision in the manifest. That manifest is what your discovery response and EU AI Act summary will cite.
A working training-data filter checklist
| Control | What it requires | Why it matters |
|---|---|---|
| robots.txt parser | RFC 9309 compliant | Factor 1 good faith |
| TDM-Reservation parser | HTTP and meta tag | EU Article 4 defence |
| AI user agent identity | Public, attributable | Allows site to set per-purpose rules |
| Training data manifest | Per-dataset record | Discovery and EU AI Act |
| Deduplication | Across training corpus | Reduce memorisation |
| Output similarity filter | At inference time | Factor 4 mitigation |
| Re-honour opt-outs | Pre-each training cycle | Ongoing good faith |
| Licence metadata capture | Where present | Public domain and open licence proof |
| Sensitive content filter | Personal data, medical, legal | GDPR and special categories |
| Provenance tracking | Source-to-token traceable | Audit response |
A team that ticks every row above can credibly defend a fair use posture in the US and a TDM exception posture in the EU. A team that ticks fewer than half is defending in court.
What about derivative works and outputs?
Fair use applies to your training inputs. It does not necessarily apply to your model’s outputs. If your model produces output that is substantially similar to a training input, that output is itself potentially infringing, regardless of the input’s fair use status.
This is the practical lesson from Doe v Github and Getty v Stability AI: training-data fair use does not buy you output-side immunity. Build the output filter. Test for memorisation. Watermark or filter outputs that score above a similarity threshold against known copyrighted works.
The cost of an output filter is real (latency, false positives) but small compared to the litigation cost of a model that reproduces copyrighted material on demand.
External references
The US Copyright Office published a multi-part report on AI and copyright in 2024-2025; the relevant volume is at copyright.gov/policy/artificial-intelligence. The EU DSM Directive (2019/790) is at eur-lex.europa.eu/eli/dir/2019/790/oj. The TDM Reservation Protocol draft is at w3c.github.io/tdmrep.
Comparison: fair use vs Article 4 TDM exception
| Dimension | US Fair Use | EU Article 4 TDM |
|---|---|---|
| Default posture | Defensive (factor analysis) | Permissive (subject to opt-out) |
| Lawful access required | Implicit | Explicit |
| Opt-out by rightsholder | None (informal robots.txt) | Mandatory machine-readable |
| Commercial use | Allowed if factors balance | Allowed unless opted out |
| Transparency obligation | None statutory | EU AI Act mandates summary |
| Cure for opt-out after scraping | None (factor 4 may bite) | Re-honour on next cycle |
| Predictability for training operators | Low (case-by-case) | Higher (clearer rules) |
The EU framework is more predictable but more expensive (you must build the opt-out parser). The US framework is less predictable but offers more flexibility for early-stage research.
Special cases: code, images, and journalism
Code scraping (think GitHub) carries copyright but also broad open-source licensing. The challenge is that licences attach to specific files, and a training corpus aggregates millions. Doe v Github held that GPL-style attribution requirements may survive aggregation, meaning that models trained on GPL code can produce output that strips attribution required by the licence. Build the filter.
Image scraping (think Getty) carries copyright plus database rights in the EU. Watermarks and metadata provenance matter. A model that reproduces a Getty watermark on output is in the worst possible factor-four position.
Journalism scraping (think NYT) carries copyright and increasingly investment-protection statutes (Germany, Australia, Canada). The factor-four story is hardest here because models that summarise news directly substitute for the publisher.
For a forward-looking discussion of how RAG over scraped journalism corpora handles these risks, see RAG over scraped data.
FAQ
Is scraping for AI training fair use?
The 2026 answer is “sometimes.” The four-factor test applies, factor four is doing most of the work, and market substitution by the model is the question to focus on.
Can I rely on robots.txt to opt me into fair use?
The Field v Google implied-licence theory still has weight, but only as one signal among many. Robots.txt compliance helps factor one (good faith) and may bear on factor four.
Does the EU framework apply if I am US-based?
If you place the model on the EU market, yes. The EU AI Act has explicit extraterritorial reach.
What about open-licence content like Creative Commons?
The licence governs. CC-BY content can be used with attribution. CC-NC content cannot be used commercially. CC0 content has no restrictions. Always read the specific licence.
What is the safest training data posture in 2026?
Combine permissive open data, licensed datasets where available, scraped data with explicit opt-out compliance, and an output filter for memorisation. Document everything in a training manifest.
Extended case law analysis 2024-2026
The fair use doctrine for AI training data sharpened considerably between 2024 and 2026. Three cases shape the current landscape.
The Authors Guild v OpenAI consolidated litigation (Southern District of New York) reached a summary judgment phase in late 2025 on the question of whether training a large language model on copyrighted books constitutes fair use. The court applied the Warhol Foundation v Goldsmith framework and focused on transformativeness in the first factor and market harm in the fourth. The training-as-fair-use defence narrowed where the model could output substantially similar text to the underlying work.
Thomson Reuters v Ross Intelligence (District of Delaware, February 2025) was the first published opinion directly rejecting a fair use defence for training a competing AI on the plaintiff’s headnotes. The court emphasised commercial use, low transformativeness, and direct market harm.
Andersen v Stability AI (Northern District of California, ongoing) is testing the same framework for diffusion-model image training. The 2024 motion to dismiss largely survived for the artists, signalling that the courts will not dismiss these cases at the pleading stage.
The pattern across these cases is that fair use for AI training is more vulnerable when the trained model can reproduce protected expression, when the training market overlaps the licensed market, and when the training is commercial.
Implementation patterns for fair-use-defensible scraping
A 2026 AI training data pipeline that wants the strongest fair use posture should implement seven controls.
- Maintain a documented purpose statement that emphasises transformative use and limits commercial application.
- Apply de-duplication at the document level to reduce verbatim memorisation.
- Apply a memorisation eval that probes the model for verbatim outputs of training data and removes high-risk samples.
- Honour AI-crawler robots.txt directives because publisher signals weigh in the fourth factor analysis.
- Avoid known commercial datasets where licences are available and unused.
- Apply opt-out registries (the Spawning project, the IETF AI Preferences working group output).
- Document the chain of custody from source URL to embedding to model weight.
Code pattern: memorisation probe
def probe_memorisation(model, training_samples, threshold=0.8):
risky = []
for sample in training_samples:
prompt = sample[:128]
completion = model.generate(prompt, max_tokens=256)
if rouge_l(completion, sample[128:384]) > threshold:
risky.append(sample)
return risky
Comparison: fair use posture by training data category
| Category | Transformativeness | Market harm | Fair use posture |
|---|---|---|---|
| Public-domain text | High | Low | Strong |
| Open-licensed code | Variable | Variable | Mixed |
| News articles | Moderate | High | Weak |
| Books | Moderate | High | Weak |
| Social media public posts | Moderate | Low | Mixed |
| Commercial images | Low | High | Weak |
Additional FAQ
Is fair use the same in the EU?
No. The EU does not have a general fair use doctrine. The closest equivalents are the text-and-data-mining exceptions in Article 3 (research) and Article 4 (general, with opt-out) of the 2019 Copyright Directive.
Does the four-factor test apply to all media?
Yes. The 17 USC 107 framework applies regardless of medium, but the application differs.
How does opt-out interact with fair use?
Opt-out is not a fair use requirement under US law. It is required under Article 4 of the EU Copyright Directive. As a practical matter, honouring opt-out reduces the fourth factor harm and improves the defence.
Is non-commercial training automatically fair use?
No. Non-commercial weighs in the first factor but does not by itself decide the case.
The four factors applied to AI training
The fair use analysis under 17 USC 107 considers four factors. Factor one is the purpose and character of the use, including whether the use is of a commercial nature or is for non-profit educational purposes. Factor two is the nature of the copyrighted work. Factor three is the amount and substantiality of the portion used. Factor four is the effect of the use on the potential market for or value of the copyrighted work.
For AI training each factor presents distinct questions. Factor one turns on whether the training is transformative. The Supreme Court’s 2023 Warhol Foundation v Goldsmith decision narrowed the transformative use analysis, focusing on whether the secondary use has a purpose distinct from the original. Training a general-purpose language model is plausibly transformative. Training a model intended to compete in the same market as the source work is not.
Factor two distinguishes published from unpublished work and creative from factual work. Published creative works lean against fair use. Factual or functional work leans for fair use. AI training corpora typically contain both, and the factor cuts both ways depending on the sample.
Factor three considers the amount used. AI training typically copies the entire work into the training pipeline, although the trained model retains only statistical patterns. Courts have split on how to weigh this factor for training. Some treat the full-work ingestion as weighing against fair use. Others focus on what the model retains and find it weighs neutrally or for fair use.
Factor four is the most contested in AI training cases. Courts ask whether the trained model substitutes for the source work in the relevant market. If the model can output substantially similar content, the substitution effect is direct. If the output is qualitatively different, the substitution is indirect or absent.
Memorisation and the verbatim output problem
A central technical question in AI training fair use is memorisation. A trained model that emits verbatim copies of training data has, in effect, retained the training data in its weights. That retention undermines the transformative use argument under factor one and increases market harm under factor four.
The 2024 and 2025 research literature documented memorisation rates in large language models. Models trained on duplicated content memorise at higher rates. Models trained on rare content memorise at higher rates. Larger models memorise at higher rates. These findings are operational guidance for training pipelines that want to minimise memorisation.
Mitigation techniques include de-duplication at the document level, de-duplication at the chunk level, training data filtering for high-risk content, and post-training memorisation evaluation with output filtering. Each technique reduces but does not eliminate memorisation. A scraper feeding a training pipeline should apply at least the document-level de-duplication.
The relationship between fair use and licensing
Fair use is a defence, not an entitlement. A scraper that has a licence does not need to argue fair use. A scraper without a licence relies on fair use only if the rights holder objects.
The 2024-2026 trend is toward more licensing. Major publishers struck deals with major AI labs (News Corp with OpenAI in May 2024, Reddit with Google in February 2024, Stack Overflow with OpenAI in May 2024). Those deals reduce reliance on fair use for the licensed corpora.
For scrapers the implication is that fair use is becoming the fallback, not the default. The strongest position is to license where possible and to fall back to fair use only for content where licensing is impractical.
Next steps
If your team trains models on scraped data, the highest-leverage improvement this quarter is to build the training data manifest. It costs little, it underpins both US discovery and EU AI Act response, and it makes every downstream compliance question easier. For broader policy, head to the DRT compliance hub and pair this with the robots.txt ethics guide.
This guide is informational, not legal advice.