Skip to main content

For decades, the web content model was simple: publishers provided content, search engines indexed it, and users were expected to visit the original site to reach the content they were looking for. Exceptions to this rule – where publishers felt that other companies were exploiting their content wholesale – often led to lawsuits. In 2007, an adult magazine, Perfect 10, sued Google for including thumbnails from its site in Google’s image search; and in 2019, LinkedIn sued a company called hiQ over the latter’s attempts to scrape LinkedIn profiles to build (and sell) so-called ‘people analytics’ profiles. 

AI changes everything

AI supercharges this conflict. At every level from training to AI ‘search’ products, it adopts a “consume everything, return nothing” approach that takes content from publishers and also discourages users from visiting their sites. Search engine “AI Overviews” synthesize the product of search crawlers and appear at the top of the page, attempting to answer user queries without those users visiting the sites that provided the answers. AI-first products like ChatGPT, Claude, and Perplexity go one step further, reducing the original sources to mere reference links or footnotes. Google’s answers box reduced the number of mobile queries that visited the source site by 75% and Cloudflare’s Radar estimates that it’s thousands of times more difficult for individual sites to get traffic from ChatGPT and Claude users than from old-style search engines. 
 
Yet the original content is still irreplaceable. Synthetic (generative AI produced) data is now crucial as supervised fine-tuning and reinforcement learning take up an ever-increasing percentage of AI training budgets. This training improves model performance in math, coding, and reasoning tasks, but it doesn’t provide models with knowledge. Original content is still crucial not just for pretraining, but for answering user queries with grounded, up-to-date, and relevant information. 
 
Unfortunately for those creating content, they have not found a particularly receptive audience to their complaints in the courts. In June, US District Judge William Alsup ruled in Bartz v. Anthropic that “[content] used to train specific LLMs did not and will not displace demand for copies of Authors’ works, or not in the way that counts under the Copyright Act”. He held that pretraining on copyrighted works – even without the permission of the authors – constituted fair use. Though Alsup wasn’t asked to consider the use of original content in retrieval-augmented generation (like “AI Overviews” in search engines), the history of similar cases – like Perfect 10 – doesn't provide much reassurance in a different outcome. 

So are content owners wholly out of luck – condemned to the sidelines, slaving away to produce content to be regurgitated wholesale? 

Well, not quite. Alsup held that training was fair use – but that the acquisition of the underlying content matters. Before it began buying books to use for training data, Anthropic downloaded more than seven million pirated books, an effort that Alsup ruled was not justified by fair use and exposed Anthropic to liability. Though the precise contours of what kind of acquisition might expose an AI company to liability are still fuzzy, the message to content producers is clear: protect it or lose it.

 

Cloudflare’s new approach

Enter Cloudflare. 

Cloudflare’s global network sits in front of about 20% of the internet and about 35% of the Fortune 500, making it one of the most significant internet infrastructure providers. On July 1, Cloudflare announced that it would be changing its policy – to block AI crawlers by default. For individual content providers, Cloudflare’s policy shift is crucial. An individual blocking AI crawlers might not have their content hoovered up but runs the risk of being ignored. Cloudflare is simply too big to ignore – even for OpenAI, Google, or Anthropic. 

Instead, Cloudflare plans to roll out “pay-per-crawl": a system whereby site owners can set a price to have their sites crawled by AI bots and manage billing and access directly through Cloudflare, blocking bots that won’t pay the price asked by the publisher. 

Cloudflare’s pay-per-crawl won’t be the end of the conversation. The Cloudflare team has emphasized that what’s been announced is just their initial foray into the space. But the first iteration leaves several questions unanswered.  

For one thing, this incarnation of pay-per-crawl is based on a flat, per-request price across the entire site. It doesn’t allow publishers to offer different pieces of content at different prices. It also doesn’t provide any kind of licensing information about that content: can it be used to return search results, but only with attribution? Can it be used for model pretraining? These are very different use cases, and publishers will want a more flexible – but still standardized – model for communicating these things to AI crawlers and agents. Google, for example, crawls the internet for Gemini training data using the “Google-Extended” user agent, but “Googlebot” for the AI Overviews feature, and it isn’t clear how – or if – Cloudflare plans to distinguish between these two use cases in its pay-per-crawl feature. Cloudflare has anchored its pay-per-crawl solution in HTTP response codes; anchoring licensing in HTTP response headers, such as a new X-Content-License-Type header, could make this relatively straightforward for both crawlers and publishers.  

Nor is it clear who else plans to sign on. Micropayment isn’t a new idea in web monetization, and historically the problem has been the network effect – signing up enough users (on both the consumer and publisher side) to reach a critical mass is crucial, and while several frameworks have tried, none have succeeded. Yet Cloudflare’s proposal has two major advantages. Not only does Cloudflare’s scale – and ability to act as a merchant as record – provide a ready-made infrastructure that previous technical frameworks haven’t been able to match, but the consumers are sophisticated and relatively few. It will be interesting to see if Cloudflare is able to sign up both the other hyperscale cloud providers and major AI companies to its proposal – and whether the cash starts flowing in quantities that matter. 

Whether or not the pay-per-crawl model succeeds, though, what Cloudflare’s “Content Independence Day” shows us is crucially this: If your organization doesn’t have an AI strategy, it’s time to get one. Even if you don’t plan to incorporate AI into a product or service, AI is changing the way that what you produce is valued, how customers find you, and how the world sees you and interacts with you. Arctiq can help organizations of any size understand AI and create a plan that helps you navigate this rapidly-changing space. 

Alex Vulovic
Post by Alex Vulovic
July 29, 2025
Alex has considerable experience designing both business and technical solutions to complex problems and leading teams implementing them.