TL;DR
Thorsten Meyer AI’s latest Control Series installment frames data as the AI industry’s hardening chokepoint, citing public text exhaustion projections, Anthropic’s $1.5 billion author settlement and rising demand for expert and sovereign data. Confirmed developments show public web scraping giving way to paid licenses, court fights and tighter control over proprietary corpora; the long-term legal and technical rules are still unsettled.
AI’s largest labs are moving into a more restricted data market as public web text nears its projected training limit, according to Thorsten Meyer AI’s June 2026 Control Series report, making proprietary corpora, licensed archives and sovereign datasets a central source of power in the next phase of model competition.
The report cites Epoch AI’s estimate that the public internet contains roughly 300 trillion tokens of high-quality text, with frontier training datasets already approaching that stock. Epoch AI projects the public supply could be fully used between 2026 and 2032, with a median around 2028; those estimates remain projections, not settled measures of all possible future training data.
The legal and commercial shift is already visible. Anthropic agreed to a $1.5 billion settlement with authors over alleged use of pirated books, described in the source material as the largest recovery in U.S. copyright law. The court drew a line between training on legally acquired books, which it called transformative fair use, and downloading pirated files from shadow libraries; the settlement covered past piracy claims and required destruction of the files, but did not settle future training rights or model-output disputes.
The report also points to synthetic data as a partial workaround, citing Nvidia’s $320 million purchase of Gretel and Microsoft’s use of hundreds of billions of synthetic tokens. But it says synthetic data raises a separate risk: when machine-generated material is reused in areas where accuracy is hard to verify, errors can compound, increasing the value of fresh human-made data.
Data: The One Thing You Can’t Rent
The free part of “all human knowledge” is running out. As compute and models commoditize, the corpus you can’t replicate becomes the moat — so data is being fenced, priced, and, in places, treated as a national asset.
Data was supposed to be the abundant input. It’s the scarce one. It’s also the chokepoint you can actually own — so guard your proprietary data, and don’t hand it to a provider who can become your competitor (the lesson everyone fled Scale to learn). Nations: license it like Ukraine — keep the model, keep the leverage.
Data Scarcity Reshapes AI Competition
The practical effect is that data is becoming harder to copy than chips or model architectures. The report says H100 rental rates are down 60-75% from their peak, which makes compute more accessible than it was during the first stage of the AI boom. By contrast, a private medical corpus, enterprise workflow history, military sensor feed or expert-annotated legal dataset cannot be rented from a public cloud if the owner refuses to share it.
That matters for companies, governments and creators. For companies, proprietary data can become leverage, but only if contracts stop model providers from using customer information to build rival products. For publishers and authors, licensing can create payment channels, while high settlement costs may favor labs with enough cash to absorb them. For governments, battlefield and intelligence data can be treated less like a vendor input and more like a national asset.
licensed AI training data sets
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
From Open Web To Paid Archives
The Control Series frames this as the third AI chokepoint after compute and power. Its argument is that the free phase of training on broadly available web text is ending, while the remaining high-value sources are harder to collect and easier for owners to fence.
Publishing groups have already moved in that direction. The New York Times’ case against OpenAI remains in discovery, according to the source material, while News Corp and other publishers have shifted some activity from litigation to licensing. The report presents those deals and disputes as evidence that training data is moving from a low-cost scraped input to a priced market.
The next layer is expertise. The report says reinforcement learning and reasoning models have increased demand for lawyers, physicists, doctors, engineers and other specialists who can define high-quality answers, check model reasoning and create evaluation data. That expert work is costlier than older labeling tasks, which often relied on large contractor pools paid per simple judgment.
“Data was supposed to be the abundant input. It’s the scarce one.”
— Thorsten Meyer AI, The Control Series

Synthetic Data Generation: A Beginner’s Guide
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Training Rights Remain Legally Unsettled
Several parts of the data market are still unresolved. It is not yet clear how courts will draw the boundary between lawful training, infringement tied to copied source files and claims over model outputs. The Anthropic settlement removed one high-risk case from trial over past pirated-book claims, but it did not answer every copyright question facing model developers.
The technical outlook is also uncertain. Epoch AI’s token ceiling is a projection, and labs may find ways to improve data efficiency, use multimodal inputs or produce higher-quality synthetic data. The report’s central claim is narrower: the easy public text supply is no longer enough on its own for frontier competition.
AI data licensing platforms
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Court Rulings And Data Deals
The next milestones are likely to come from litigation, licensing contracts and corporate data strategy. Ongoing copyright cases, including The New York Times’ case against OpenAI, could clarify the legal boundaries for training datasets, while publisher deals may set market prices for access to archives.
Businesses and governments will also face pressure to set stricter rules for internal data. The report’s recommendation is direct: keep control over proprietary datasets, limit reuse rights in vendor contracts and treat hard-to-replace corpora as strategic assets rather than raw material handed to model providers.
proprietary corpus datasets
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
What is the main development in the report?
The report says AI competition is moving from broadly scraped web data toward controlled sources such as licensed archives, enterprise data, expert evaluation sets and sovereign real-world data.
Does this mean AI companies are out of data?
No. The claim is narrower: high-quality public text may be nearing its practical training ceiling, according to Epoch AI projections cited by the report. Labs can still use private data, licensed content, multimodal sources and synthetic data.
Why does the Anthropic settlement matter?
It signals that training data access is becoming a priced legal market. The $1.5 billion settlement addressed past alleged use of pirated books, but it did not settle all future questions about training rights or model outputs.
Why is expert data more valuable now?
Reasoning models need people who can judge whether complex answers are correct. The report says lawyers, doctors, physicists and other specialists now help define quality in ways simple web text cannot.
Are the cited figures investment guidance?
No. The H100 rental-rate change, acquisition price and settlement amount are reported or historical figures cited in the source material. They are not financial, tax or legal advice, and they are not guarantees of future market behavior.
Source: Thorsten Meyer AI