Courts and Legislators Nudge for a New Data Market
Clean Data “Debate” in Generative AI
Judge Chhabria (Kadrey v. Meta): “Llama is not capable of generating enough text from the plaintiffs’ books to matter, and the plaintiffs are not entitled to the market for licensing their works as AI training data.” [1]
Judge Alsup (Bartz v. Anthropic): “A market could develop … Even so, such a market … is not one the Copyright Act entitles Authors to exploit.” [2]
EU Parliament study: Current EU law “leaves creators without any enforceable mechanism to authorise, deny, or license the use of their works for AI training under negotiated terms.” [3]
June 2025 Northern District of California Decisions
Bartz v. Anthropic (Alsup J., 23 June 2025)
Three authors said Anthropic scanned millions of print and pirate-site books and used them to train Claude. Alsup found that the training copies were fair use because the model transforms whole books into “statistical abstractions” that never substitute for the originals.
On market harm (Factor 4) he accepted, for argument’s sake, that a licensing market could emerge, but held it is “not one the Copyright Act entitles Authors to exploit.” The only infringement he left standing is Anthropic’s internal “pirated library,” which will go to trial on damages. In short: training is safe (for now), while hoarding pirated PDFs is not. [4]
Kadrey v. Meta (Chhabria J., 25 June 2025)
Kadrey v. Meta: Thirteen novelists sued Meta for scraping “shadow libraries” to build Llama. The court said Llama’s weights are “highly transformative”: they store patterns, not expressive chunks. Meta’s expert ran “adversarial prompting” experiments and could not coax any Llama model to emit more than 50 consecutive tokens (≈ 50 words) from any plaintiff’s book. The plaintiffs’ own expert agreed that Llama could not reproduce “any significant percentage” of the texts.
Meta also submitted testimony that none of the 13 authors has ever licensed, or even been asked to license, a book for AI-training purposes. If no market exists, Meta’s unlicensed use cannot depress it and there is nothing for the copyright holder to lose.
On Factor 4, Chhabria called the licensing-market theory a “clear loser,” writing that the authors “are not entitled to the market for licensing their works as AI training data.” He granted Meta summary judgment on training copies but flagged that better evidence of market dilution could swing future cases.
Both judges imposed an empirical burden on plaintiffs: show a functioning or nascent licensing market. Until such a market exists, with prices, contracts, measurable revenue, AI models enjoy a sizeable fair-use advantage.
EU’s Generative AI and Copyright: Training, Creation, Regulation

The Parliament’s Justice Committee study, published on 30 June 2025, suggested that there should be some kind of remuneration for authors whose works are used for AI training. However, the Committee, like the US courts, recognized that any protection of such economic rights cannot be effectively enforced, as authors have no practical way to license their work for AI training.
In both cases, US and EU, the issue comes down to whether a licensing market for authors’ data in relation to AI training models actually exists. The US courts are waiting for more evidence from future plaintiffs. In contrast, the EU Committee proposes proactive steps to establish such a market and build licensing channels at the legislative level.
If Europe succeeds in building a paid licensing channel, US plaintiffs could potentially point to that market to demonstrate cognizable harm, removing the current defense advantage enjoyed by Meta and Anthropic.
What Happens when the “Licensing Void” Fills
Models that can prove “clean data” will cost more
Adobe’s Firefly promotes itself as commercially safe because its first model was trained on licensed Adobe Stock images and public-domain content. On the supply side, Reuters reported that Photobucket discussed pricing ranging from 5 cents to $1 per photo, and more than $1 per video, to license parts of its archive for AI training.
Public-domain material or content under permissive open licenses will remain cheaper. But any option that comes with a clear paper trail will carry a premium because it protects both users and developers from potential lawsuits.
A two-tier data economy will settle in
Tier I will consist of premium, traceable datasets, newspapers, professional photo libraries, industry research, and specialized archives, licensed through collective deals or pay-per-asset APIs.
Tier II will consist of public-domain text, Wikipedia, government data, and other lower-cost sources that remain widely usable, especially in the U.S. under fair-use rules.
Reddit has already shown what that market can look like: its content deal with Google was widely reported at about $60 million per year.
Over time, prices in the premium lane will settle into reference rates, think “$x per photo” or “$y per song”, for high-value sectors such as images, music, and specialist text.
Data marketplaces will feel like app stores
Cloud vendors already host one-click catalogues of ready-to-license datasets. AWS Data Exchange now advertises more than 3,500 third-party datasets spanning everything from news feeds to scientific and medical data.
As those deals scale, AI model builders will increasingly bolt on leakage-testing dashboards and provenance logs to reassure customers, investors, and regulators.
That trend also fits with the EU AI Act’s transparency push, including the requirement for general-purpose AI providers to publish a sufficiently detailed summary of the content used for training.
Practical take-aways for stakeholders
Data is becoming a tradeable commodity. Until recently, tech companies scraped whatever text or images they could find on the open web for free. Now they are being pushed to pay for clean, permission-based datasets.
The EU is signaling that silence will no longer equal consent. Successful AI products in the next decade will be those built on traceable, fairly acquired data streams.
The era of free-for-all scraping is giving way to a regulated data-supply chain, where proof of origin and proof of non-harm will decide who can train, and at what price.
Specialized data brokers are springing up. Think of them as “Spotify for training data”, platforms where companies can subscribe to text, image, audio, or medical datasets. That will make data easier to buy and give creators a clearer licensing channel, but it will also raise prices for premium-quality inputs.
One recent market estimate put the AI training dataset market at USD 2.82 billion in 2024, rising to USD 9.58 billion by 2029.
Everyday AI tools, search assistants, email copilots, and simple image generators, will likely remain low-cost or free. But niche models, especially those built for medicine, law, finance, or high-end creative work, will carry a higher price tag and will increasingly say why: built on licensed data, with compliance, provenance, and indemnification baked in.
The June 2025 U.S. decisions did not green-light endless scraping. They simply pointed out that authors could not yet show market harm because they could not yet show a real market. The EU study, by contrast, points toward building that market and putting a price on data.
Even if courts and legislators move slowly, the scale of the AI boom, and its dependency on high-quality inputs, creates strong incentives for private licensing platforms to emerge. If that happens, the best data may gradually move out of the open web and into pre-licensed, paywalled ecosystems.
