Is There a Middle Ground in the Tug of War Between News Publishers and AI Firms? Part 1: Framing the Problem

The tug of war between online news publishers and AI developers on the role of copyrighted content in AI model training may lead to a more closed internet for everyone.

The tug of war between online news publishers and AI developers on the role of copyrighted content in AI model training may lead to a more closed internet for everyone. In this two-part blog series, we describe the situation as it’s unfolding and propose policy solutions worth exploring to preserve incentives for publishers to keep creating the timely content that AI developers need and democracy requires. View the second post, “Is There a Middle Ground in the Tug of War Between News Publishers and AI Firms? Part 2: Framing Solutions” to continue the series.

In July, the Senate Judiciary Subcommittee on Crime and Counterterrorism hosted a hearing with the provocative title, “Too Big to Prosecute? Examining the AI Industry’s Mass Ingestion of Copyrighted Works for AI Training.” Happily, some of the conversation with the four witnesses was actually about the crime of online piracy, meaning how artificial intelligence, or AI, firms have downloaded content from shadow libraries (unofficial pirate repositories of digital books and articles) to train their large language models (LLMs). However, the majority of the hearing served only to conflate “piracy of copyrighted works” with “training generative AI models on copyrighted works.” The former may be illegal, but we believe the latter is generally a protected fair use under the legal doctrine that allows limited use of copyrighted works without permission. 

The hearing was a reflection of the tremendous public attention on the intersection of generative artificial intelligence, copyright, and online publishing. Although the hearing largely focused on book publishers, many of the issues discussed also pertain to another type of publisher Public Knowledge has written about quite a bit: online news publishers. They, too, often use the language of “stealing,” “taking,” or “hoovering” to describe what happens to their copyrighted content by digital platforms and believe it to be a type of theft. We find the argument that digital platforms “steal” news content misconstrues copyright law and conflates two very different ideas: piracy, and AI model training. 

In our view, web crawling for the purpose of AI training implicates the freedom to learn, freely use information, and freely express creativity. We also acknowledge the genuine economic risks generative AI poses to creators, including news publishers. And we know that some aspects of AI model design may infringe on content owners’ rights to control the reproduction, distribution, and public display of their work. For example, overfitting, including memorization, by models may result in infringing outputs. However, when we apply existing copyright law to our best understanding of generative AI systems, we find that their core elements are consistent with the law.

In Part 1 of this two-part blog series, we discuss why news publishers fear the impact of AI on their business models and whether these concerns are warranted. In Part 2, we describe strategies publishers are already using to mitigate the impact of AI on their business models, list additional solutions that are emerging to empower them against the threat generative AI represents, and identify policy solutions worthy of additional exploration to preserve the benefits of fair use while preserving incentives for publishers to keep creating the content AI developers and the public need.

AI Developers and News Publishers Are… Frenemies?

Whether they acknowledge it or not, AI developers and news publishers are increasingly codependent. Online news publishers play a crucial role in AI training by providing large datasets of high-quality, human-generated content. Whether through model training or retrieval augmented generation (more on these later), journalism grounds AI models in reality and in the now. The currency and relevance of AI generated outputs depend on access to timely sources of content. Journalism provides factual reporting, context, and in-depth analysis of real-world events and issues happening in the moment. By training AI on diverse journalistic sources, models learn to recognize and mitigate biases present in their training data. And using journalism for model training can help improve fact-checking and combat propaganda and false information. That may be why studies show the training sets underlying LLMs “significantly overweight publisher content” compared to the generic collection of content scraped by Common Crawl. The result: If their outputs undermine all the viable business incentives and models that sustain online news publishing, AI developers will eventually be in a world of hurt.

Conversely, journalism organizations need to understand and, where appropriate, leverage AI systems to adapt to the realities of a digital media landscape. With the rise of the internet, then search and social media, and now AI-mediated search and information distribution, news publishers have been forced to grapple with rapidly changing technology that disrupts their business models. Additionally, publishers are always looking for opportunities to reduce costs, streamline news gathering, facilitate translation, and engage customers, and are therefore aggressively seeking ways to leverage the substantial benefits of AI in their own operations. (This is not without controversy, including as to whether AI tools must be subject to the same journalistic editorial standards that pertain to human journalists. There is also evidence that the vast majority of businesses – 95% – have yet to see real efficiencies from AI materialize.) Lastly, journalists will need to leverage AI to conduct forensics and ensure the legitimacy of images and videos that may have been created with AI tools. 

Finally, AI developers and publishers both have self-benefiting roles to play in maintaining the open and free nature of the internet. Certainly government-funded research, commercial innovation, public policy, and infrastructure have been critical to create the internet as we know it today. But a lot of content and services (like search) are accessible and free to internet users today because of appealing publisher content and the advertising that supports it. Ensuring the viability of online publishing – whether it takes the form of newspapers, personal blogs, Substacks, or other models – allows a wider range of content and services to remain accessible and free. At the same time, news publishers have relied on users (and platform algorithms) freely sharing and promoting their content to drive clicks, views, and advertising revenues. Allowing AI models to “read” and learn from online content is an essential aspect of an Open Internet, and it may create substantial economic and societal benefits. All of this is built on the premise of a free and open web. 

Lastly, journalism shares an interest in permissive copyright rules and strong fair use protections. Journalists are themselves highly dependent on the legal doctrine of fair use – for criticism and commentary, news gathering and reporting, republishing source material, illustration, historical reference, and documenting claims. Hollowing out fair use or dramatically expanding intellectual property rights could whip around and harm journalism itself.

Given these relationships, in our view policy solutions must be developed to ensure that the benefits of generative AI are shared by the body politic writ large without undercutting the journalism necessary for democracy’s survival.

Why Publishers Fear Generative AI 

News publishers have long believed that dominant digital platforms unfairly – and in some cases, illegally – exploit their work. The platforms’ aim, this theory goes, is to garner most or all of the joint value created through their longstanding exchange with publishers: user engagement from news content on search and social media platforms in exchange for referral traffic provided to publishers through links. The current challenges in the news industry predate the internet, but there is no debate that digital disintermediation has dramatically impacted the structure and economics of news delivery. This has led publishers, in some cases, to pursue solutions (like link taxes) that are incompatible with copyright law as well as the principle of an Open Internet.

Now AI, especially generative AI and its embedment in search products, chatbots, and agents, has exacerbated news publishers’ concerns about the devastation platforms have caused to their business models. For example, generative AI’s ability to provide complete narrative answers to some of the most complex user search queries right on the search engine results page undermines the need to click through to online publishers for more information. (When Google rolled out AI Overviews, now AI Mode, in May of 2024, the company actually explicitly promised to users that AI overviews are the perfect solution when “you don’t have time to piece together all the information you need.” In other words, a zero-click search is the product benefit of AI Overviews.) Or, chatbots and agents trained on news publishers’ copyrighted content answer user queries about current events instead of search engines. That means the flow of traffic, ad dollars, and profit could continue to shift toward the dominant AI companies and away from publishers, spiking the trend line in place for decades. 

This challenge isn’t just about model training. AI models are now often complemented by grounding processes, by which AI models are connected to real-time information to improve the accuracy and relevance of their outputs. One example of a grounding process is retrieval augmented generation (RAG), a technique that accesses web pages as part of a query to improve the accuracy and currency of the AI model’s outputs. These kinds of enhancements require access to current information – like today’s news – to validate or update responses that would otherwise be based on prior generations of training data. While some of these responses include citations or links to the source, publishers believe these technologies will result in (even) less traffic, (even) fewer ad dollars, and (even) fewer subscription conversions. Publishers also highlight the risk of brand erosion due to AI slop, hallucinations, and misattribution (or lack of attribution) to the right news sources. 

This line of thinking doesn’t even account for the likely adoption of an advertising-based business model for AI products. That would mean even more ad dollars migrating from publishers to AI firms. Google is already selling ads within AI Overviews. Other AI firms are likely to adopt ad-based business models, as well, for two reasons: it’s the business model the dominant platforms already rely on, and newer AI companies have had little success in attracting paid users. (Publishers’ concerns also do not factor in Google’s brand-new “Preferred Sources” feature, which lets users “select their favorite sources” to be placed most prominently in search results “when those sources have published fresh and relevant content for your search.” Preferred Sources may serve to further marginalize small and diverse news sources, as users are generally more familiar with major, national media brands.)

Lastly, news publishers, like many others, are concerned about false information from chatbots and the impact that “pre-digested verdicts” to important queries, shaped by opaque algorithms and advertising- and engagement-based financial incentives, will have on our overall information environment. 

Early Impact of Generative AI on News Publishers

Publishers aren’t crazy – they’re already reporting the damage generative AI and its offspring are causing to their cost structure and revenue. (Yes, it’s fair to say these trends are getting media coverage in part because publishers are trying to make the case for protective legislation. But without data from AI firms to refute it, this is the story legislators are hearing and acting on. More on that later.)

Alarmed by Declines in Traffic

The emergence of new AI tools (like OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude, and Perplexity) has resulted in more search referrals to many publishers. However, the increase in referrals is not compensating for a higher rate of zero-click searches derived from Google’s AI-powered search overviews. (Search engine optimization or SEO agencies working on behalf of advertisers are obsessing over precisely which keywords, industries, and geographies are more likely to trigger an AI overview. The consensus seems to be that they currently show up in ~20% of searches.) As predicted, links from AI-powered search overviews have plummeted relative to traditional search queries (which have also become unpredictable). Consumer search behavior also seems to be changing: New consumer research shows that Google users who encounter an AI summary are 50% less likely to click on links to other websites than users who see a traditional search result. Why click on blue links if everything you need to know appears upon your query? Google users who encountered an AI summary also rarely – 1% of the time – click on a link in the summary itself. And Google users are more likely to end their browsing session entirely after visiting a search page with an AI summary than on pages without a summary. New data from Digital Content Next’s membership of 19 digital publishers shows median year-over-year referral traffic from Google Search down 10% for the most recent eight-week period. News brands, which may still be able to cover breaking news in ways AI cannot, fell 7%. (Early in August, Google maintained that “total organic click volume” is stable, but goes on to emphasize other measures such as “click quality,” the presence of more links on the search engine results page, and the shift in traffic to different kinds of content. The company’s post generated spirited feedback from publishers and SEO experts in the U.S. and U.K.)

Overwhelmed by New Traffic from Crawlers and Bots

Well upstream from traffic and ad revenue, publishers are taking on new costs as AI training data crawlers and bots overwhelm their systems. 

For example: TollBit was one of the first platforms to enable websites to monetize their content by charging AI companies and bots for access, so it has some of the most extensive history regarding AI crawler behavior. In its most recent quarterly “State of the Bots” report, TollBit reported that total AI user agent traffic among the TollBit customer network grew 87% from the last quarter of 2024 to the first quarter of 2025. This is likely due to higher rates of adoption of these tools among users. Within this total, for the first time, traffic from retrieval augmented generation bots exceeded traffic from training bots, growing at nearly 2.5x the rate of training bot traffic. If this trend continues, it will mean that supporting AI user agent traffic will be an ongoing and increasing cost for publishers as adoption grows. And as noted above, TollBit found that referral traffic from AI bots was still minuscule – just 0.04% of all external referrals to network sites in Q1 2025 – and nowhere near enough to offset the broader decline in traffic from traditional search sites. 

Frustrated by Lack of Control Over Access

Publishers, faced with heavy scraping loads from AI firms but seeing little return in monetizable traffic, are increasingly pushing to assert greater control over how their content is used for training and real-time AI queries. This can be technically complex. For example, some of the largest AI products, like Google AI Overviews, Microsoft Copilot (Blogbot), and Apple’s AI tools (Applebot), do not separate their AI user agents from their search ranking crawlers. Publishers risk losing all their visibility to platform users if they try to manage or block these firms from accessing their content. Publishers see the need to control how their content is used for AI training as a way to counter these technology companies’ monopolistic power. But blocking search ranking crawlers can be business suicide. 

Other AI firms simply ignore the robots’ exclusion protocol, robots.txt, that publishers use to notify technology platforms that they do not wish to have their content crawled. TollBit’s network data, for example, suggests that disallowing real-time scraping by retrieval augmentation bots via robots.txt has zero impact on the referrals the AI apps deliver – they’re still crawling. AI firms may also be using third-party scrapers, stealth scrapers, or masked user agents that continue to scrape sites despite the exclusion protocol. They may also pull cached content from search engines or scrape it from the Internet Archive. This has resulted in online publishers blocking the Internet Archive to avoid their content being scrapable from the Wayback Machine. This means both publishers and AI firms – as well as internet users in general – lose important pieces of digital history. 

In Part 2, we describe strategies publishers are using to respond; inventory solutions that are emerging to empower them against the threat generative AI represents for their business models; and identify some promising policy solutions that preserve the benefits of the fair use doctrine while still preserving incentives for publishers to keep creating the timely content that AI developers need for their business model – and citizens need to stay informed.