TL;DR

Thorsten Meyer AI’s latest Control Series installment frames data as the AI industry’s hardening chokepoint, citing public text exhaustion projections, Anthropic’s $1.5 billion author settlement and rising demand for expert and sovereign data. Confirmed developments show public web scraping giving way to paid licenses, court fights and tighter control over proprietary corpora; the long-term legal and technical rules are still unsettled.

AI’s largest labs are moving into a more restricted data market as public web text nears its projected training limit, according to Thorsten Meyer AI’s June 2026 Control Series report, making proprietary corpora, licensed archives and sovereign datasets a central source of power in the next phase of model competition.

The report cites Epoch AI’s estimate that the public internet contains roughly 300 trillion tokens of high-quality text, with frontier training datasets already approaching that stock. Epoch AI projects the public supply could be fully used between 2026 and 2032, with a median around 2028; those estimates remain projections, not settled measures of all possible future training data.

The legal and commercial shift is already visible. Anthropic agreed to a $1.5 billion settlement with authors over alleged use of pirated books, described in the source material as the largest recovery in U.S. copyright law. The court drew a line between training on legally acquired books, which it called transformative fair use, and downloading pirated files from shadow libraries; the settlement covered past piracy claims and required destruction of the files, but did not settle future training rights or model-output disputes.

The report also points to synthetic data as a partial workaround, citing Nvidia’s $320 million purchase of Gretel and Microsoft’s use of hundreds of billions of synthetic tokens. But it says synthetic data raises a separate risk: when machine-generated material is reused in areas where accuracy is hard to verify, errors can compound, increasing the value of fresh human-made data.

AI Dispatch · The Control Series · Part 3
Chokepoint 03 — Data

Data: The One Thing You Can’t Rent

The free part of “all human knowledge” is running out. As compute and models commoditize, the corpus you can’t replicate becomes the moat — so data is being fenced, priced, and, in places, treated as a national asset.

Scarcity & value rises ↑
Sovereign / real-world
Avengers combat data · FSD · ISR
can’t be bought
Expert-authored
PhDs, lawyers, surgeons define “good”
the new gold
Licensed content
paywalled, deal-only — now priced
fenced
Public web text
scraped for free — exhausting ~2028
commoditizing
~300T
public text tokens — used up 2026–2032
$1.5B
Anthropic authors settlement — scraping era ends
$14.3B
Meta for 49% of Scale — triggered an exodus
keep the model
Ukraine’s condition — data as sovereign asset
The take

Data was supposed to be the abundant input. It’s the scarce one. It’s also the chokepoint you can actually own — so guard your proprietary data, and don’t hand it to a provider who can become your competitor (the lesson everyone fled Scale to learn). Nations: license it like Ukraine — keep the model, keep the leverage.

Sources: Epoch AI; PBS; Intl AI Safety Report 2026; NPR; Authors Guild; Wolters Kluwer; TechCrunch; TIME; CNBC; Ukraine MoD (2024–Jun 2026). Token estimates are projections; valuations as reported.
thorstenmeyerai.com · 03 / 06

Data Scarcity Reshapes AI Competition

The practical effect is that data is becoming harder to copy than chips or model architectures. The report says H100 rental rates are down 60-75% from their peak, which makes compute more accessible than it was during the first stage of the AI boom. By contrast, a private medical corpus, enterprise workflow history, military sensor feed or expert-annotated legal dataset cannot be rented from a public cloud if the owner refuses to share it.

That matters for companies, governments and creators. For companies, proprietary data can become leverage, but only if contracts stop model providers from using customer information to build rival products. For publishers and authors, licensing can create payment channels, while high settlement costs may favor labs with enough cash to absorb them. For governments, battlefield and intelligence data can be treated less like a vendor input and more like a national asset.

Amazon

licensed AI training data sets

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

From Open Web To Paid Archives

The Control Series frames this as the third AI chokepoint after compute and power. Its argument is that the free phase of training on broadly available web text is ending, while the remaining high-value sources are harder to collect and easier for owners to fence.

Publishing groups have already moved in that direction. The New York Times’ case against OpenAI remains in discovery, according to the source material, while News Corp and other publishers have shifted some activity from litigation to licensing. The report presents those deals and disputes as evidence that training data is moving from a low-cost scraped input to a priced market.

The next layer is expertise. The report says reinforcement learning and reasoning models have increased demand for lawyers, physicists, doctors, engineers and other specialists who can define high-quality answers, check model reasoning and create evaluation data. That expert work is costlier than older labeling tasks, which often relied on large contractor pools paid per simple judgment.

“Data was supposed to be the abundant input. It’s the scarce one.”

— Thorsten Meyer AI, The Control Series

Synthetic Data Generation: A Beginner’s Guide

Synthetic Data Generation: A Beginner’s Guide

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Training Rights Remain Legally Unsettled

Several parts of the data market are still unresolved. It is not yet clear how courts will draw the boundary between lawful training, infringement tied to copied source files and claims over model outputs. The Anthropic settlement removed one high-risk case from trial over past pirated-book claims, but it did not answer every copyright question facing model developers.

The technical outlook is also uncertain. Epoch AI’s token ceiling is a projection, and labs may find ways to improve data efficiency, use multimodal inputs or produce higher-quality synthetic data. The report’s central claim is narrower: the easy public text supply is no longer enough on its own for frontier competition.

Amazon

AI data licensing platforms

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Court Rulings And Data Deals

The next milestones are likely to come from litigation, licensing contracts and corporate data strategy. Ongoing copyright cases, including The New York Times’ case against OpenAI, could clarify the legal boundaries for training datasets, while publisher deals may set market prices for access to archives.

Businesses and governments will also face pressure to set stricter rules for internal data. The report’s recommendation is direct: keep control over proprietary datasets, limit reuse rights in vendor contracts and treat hard-to-replace corpora as strategic assets rather than raw material handed to model providers.

Amazon

proprietary corpus datasets

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the main development in the report?

The report says AI competition is moving from broadly scraped web data toward controlled sources such as licensed archives, enterprise data, expert evaluation sets and sovereign real-world data.

Does this mean AI companies are out of data?

No. The claim is narrower: high-quality public text may be nearing its practical training ceiling, according to Epoch AI projections cited by the report. Labs can still use private data, licensed content, multimodal sources and synthetic data.

Why does the Anthropic settlement matter?

It signals that training data access is becoming a priced legal market. The $1.5 billion settlement addressed past alleged use of pirated books, but it did not settle all future questions about training rights or model outputs.

Why is expert data more valuable now?

Reasoning models need people who can judge whether complex answers are correct. The report says lawyers, doctors, physicists and other specialists now help define quality in ways simple web text cannot.

Are the cited figures investment guidance?

No. The H100 rental-rate change, acquisition price and settlement amount are reported or historical figures cited in the source material. They are not financial, tax or legal advice, and they are not guarantees of future market behavior.

Source: Thorsten Meyer AI

This content is for general information only and is not financial, tax or legal advice. Consult a qualified professional for decisions about your money.
You May Also Like

$965B and Climbing: Anthropic’s Series H Is Really a Compute Bet

Anthropic closes a $65B Series H funding round at a $965B valuation, emphasizing a focus on expanding compute infrastructure rather than valuation alone.

The Free-Download Question: When Running Your Own Model Actually Beats Paying

Analyzing when owning and operating open-weight AI models becomes more cost-effective than paying for cloud API services, based on recent developments in hardware and model capabilities.

The Trojan Horse in Your Living Room: How Smart TVs Became the World’s Most Sophisticated Ad Surveillance Network

Smart TVs now capture detailed screen and audio data via Automatic Content Recognition, fueling targeted advertising and raising privacy concerns amid legal actions.