Is There a Middle Ground in the Tug of War Between News Publishers and AI Firms? Part 2: Framing Solutions

In Part 2 of this blog series, we describe strategies publishers are using to respond; inventory solutions that are emerging to empower them against the threat generative AI represents for their business models; and identify some promising policy solutions that preserve the benefits of the fair use doctrine while still preserving incentives for publishers to keep creating the timely content that AI developers and the public need.

The tug of war between online news publishers and AI developers on the role of copyrighted content in AI model training may lead to a more closed internet for everyone. In this two-part blog series, we describe the situation as it’s unfolding and propose policy solutions worth exploring to preserve incentives for publishers to keep creating the timely content that AI developers need and democracy requires. View the first post, “Is There a Middle Ground in the Tug of War Between News Publishers and AI Firms? Part 1: Framing the Problem” to discuss why news publishers fear the impact of AI on their business models and whether these concerns are warranted.

Here in Part 2, we describe strategies publishers are using to respond; inventory solutions that are emerging to empower them against the threat generative AI represents for their business models; and identify some promising policy solutions that preserve the benefits of the fair use doctrine while still preserving incentives for publishers to keep creating the timely content that AI developers and the public need.

Defensive Strategies from News Publishers 

Publishers are using a variety of strategies to try to mitigate the cost and impact of generative AI on their infrastructure and business model. These include the predictable triumvirate of licensing, lawsuits, and legislation. But they also include an increasing array of technical measures and market solutions arising in this fast-moving space. 

Licensing 

The largest news publishers – and an increasing number of small ones – are cutting licensing contracts with AI developers. In fact, news organizations are in the forefront of AI content licensing. Most are direct voluntary licensing agreements. However, it’s not always clear exactly what the licensing pertains to: publishers may be providing the AI companies access to their content to train their models (often including paywalled content), or permission to use their content to train models, or permission to use their content for generating outputs. In exchange, in addition to money, publishers may receive brand attribution or preferential placement, and/or access to technology so they can build their own AI-powered products. 

These contracts are not odd exceptions: As of February, there were over 100 confirmed deals involving platforms and publishers, including every major AI developer and over 700 news brands. And the deals are not small: OpenAI’s deal with News Corp is worth $250 million over five years – 2.5 times News Corp’s net income for the past five years. Google’s deal with Reddit is worth $60 million a year. Publishers hope the direct voluntary licensing market within the next decade reaches $30 billion at the high end. This compares to an estimated $7 trillion in AI infrastructure costs over the same time period. 

As we describe below, a court has already determined that there is no legal entitlement to a licensing market. That is, these direct voluntary licensing agreements do not entitle publishers to demand payment from AI firms for something that is a legally acceptable fair use. But the ubiquity and scale of content licensing contracts and even the $3 billion that AI firms have made in commitments to news publishers so far suggests that – for Big Tech and their well-funded partners like OpenAI and Anthropic – there is significant value in striking deals for access to content. The value may come from mitigating legal risk from potentially expensive copyright litigation, negotiating access to material that would not be available otherwise, developing new products that wouldn’t be possible without direct cooperation, or even obtaining exclusivity to gain a competitive edge over rivals – and likely some combination of all of the above. Whatever the motivation(s) for AI firms, the existence of so many and such large agreements undermines claims that there is little to no value to creators in such markets, or that these deals are too cumbersome or complex to administer.

But the AI ecosystem is already tipping dangerously towards privatization and concentration, rather than being open-source, public, or highly competitive, and licensing costs and complexities could become other barriers to entry for new and smaller players. So the trick is how to extend the mutual benefits of licensing more broadly – to smaller AI developers, a wider pool of publishers, and to journalists themselves, including independent journalists – while preventing licensing costs or copyright restrictions from becoming yet another barrier to entry in the AI ecosystem. 

Extending these benefits will require technical infrastructure. Recently, a group of technologists – including the creator of the Real Simple Syndication (RSS) standard – and web publishers has launched a solution designed to allow AI licensing at scale. Technically, this solution, called Real Simple Licensing (RSL), defines specific licensing terms a publisher can set for their content, so AI firms know what data falls under what terms of access. Real Simple Licensing also entails a collective licensing organization that is empowered to negotiate terms and collect royalties. It provides publishers and AI firms “a single point of contact” for paying royalties and provides rightsholders a way to set terms with dozens of potential licensors at once. Hopefully solutions like this can remove another barrier to entry for new and smaller players. We talk more about collective licensing later in this post. 

One AI firm, Perplexity, has launched a revenue-sharing program designed to pay publishers in a very different way: for content access or usage through their nascent browser, Comet. Revenue from user subscriptions to Perplexity is pooled and divided so that Perplexity gets 20% of it and participants in Perplexity’s publisher program get 80%. Revenue is allocated based on direct visits to publishers’ sites by people using Perplexity’s Comet browser; when publisher content is cited as an answer to search queries on Comet; and when content is used to complete tasks by Comet’s AI assistant. 

Lawsuits

About a dozen of the 48 (and counting) cases brought by content owners against AI companies in the U.S. are from news publishers, either individually or collectively. So far, they are lodged against OpenAI; OpenAI with Microsoft; Perplexity; Cohere; and Stability AI. Each has a distinct set of claims and fact patterns. Most have copyright violations at their core, but some include claims that content has been scraped for training even from behind a paywall or when a website has blocked bots from scraping. Some also claim “verbatim regurgitations and substitutional summaries” of publishers’ content and warn of the potential brand damage from “damaging hallucinations” and misattributions. And the most recent, from Penske, a leading publisher of art and entertainment news publications, alleges that Google is illegally using Penske’s content to create output in its AI Overviews. The suit, which cites declines in traffic and Google’s “anticompetitive practices [that] will destroy the business model that supports independent journalism,” may come before the same judge that handled the Google search case. 

The first two court decisions in cases of this kind have affirmed that AI model training represents a transformative fair use of publishers’ copyrighted content. The key themes from the two judges’ decisions in Bartz v. Anthropic and Kadrey v. Meta were:

  • AI training is highly transformative and a fair use of copyrighted content in the situations analyzed in the cases.
  • Generative AI outputs are not inherently infringing or derivative of training data, but outputs that are substantially similar to training data might change the fair use analysis. 
  • There is no legal entitlement for publishers to a licensing market, even if one naturally develops. That means content owners cannot sue because they have “lost” licensing revenue. 
  • Content obtained through illegal means may be conducive to legal redress (but its use in AI training is still fair use).  

On the last point: Anthropic tried to settle the class action copyright lawsuit with authors whose books were allegedly pirated for use in Anthropic’s training data, seeming to want to avoid the mammoth financial exposure associated with a jury decision about damages. The settlement totaled $1.5 billion, amounting to $3,000 per work in the class. However, the judge has sent the settlement proposal back for reworking. 

Although both these cases, like the U.S. Senate hearing described in Part 1, pertained to book publishers (and every case’s legal outcome depends on the facts in that case), publishers may believe these themes extend to news publishers and paywalls. If so, they may foreshadow how publishers will frame their suits to gain redress where warranted in future lawsuits. For example, they may focus on outputs (not model training), and try to show that outputs are infringing. Or they may claim that outputs are market substitutes that “flood the market” or create demonstrable financial harm. They may also focus on questioning the legitimacy of how their content was sourced (e.g., were subscription requirements or paywalls violated, was the content from pirated sources). While the outcome of every case will be different depending on the facts, this approach would appropriately direct the focus away from model training, a highly transformative fair use of copyrighted content. However, it means even successful suits will not address the volume demands or other challenges related to crawling and training on copyrighted content. 

Separately, publishers (and others, including Public Knowledge) were disappointed by a judge’s decision regarding remedies in the U.S. Department of Justice’s lawsuit against Google for its monopolistic practices in search and search text advertising. Among other things, Judge Amit Mehta declined to require Google to provide an “easily useable [sic] mechanism” for publishers to opt-out of having their content used in training or fine-tuning of any of Google’s generative AI models without affecting their presence in search results. He also declined to prohibit agreements that give Google exclusive access to publisher content.

Legislation

Decades before the launch of ChatGPT, publishers advocated for legislative solutions to protect their business models from technological disruption. Publishers have asked Congress for insulation from competition presented by new technologies dating back to the advent of radio and television. More recently, legacy publishers advocated for the “Journalism Competition & Preservation Act” (JCPA), federal legislation to create an antitrust exemption to allow collective bargaining for compensation from digital platforms for (contrary to copyright law) linking to, displaying or snippeting publishers’ content for search and social media. (For our perspective on and opposition to the Journalism Competition & Preservation Act, see our resources page.) There is still active advocacy for the JCPA, but its purported scope – without any changes in its language – is now described as encompassing the use of copyrighted content in the training of large language models. It therefore threatens the preservation of the fair use right to read, train, and learn currently under copyright law. 

Publishers’ concerns and their push for legislative solutions for AI has again been met with some sympathetic ears in Congress. In 2024, for example, the Senate Judiciary Subcommittee on Privacy, Technology, and the Law hosted a hearing, “Oversight of A.I.: The Future of Journalism.” The witnesses included representatives from Condé Nast, the National Association of Broadcasters, and News/Media Alliance. One witness noted, in representative comments, that copyrighted content is “being taken [emphasis added] from behind paywalls without authorization for use in AI training, modeling and display.” These models then produce outputs that “contain summaries, excerpts and full verbatim copies” that “compete in the same market, and serve the same audience and the same purpose” as the original work. This witness also asserted, “Because these uses go far beyond the guard rails established by the courts, this is not considered fair use under current copyright law.” Collectively, the witnesses asked for licensing (induced by new law if necessary), and requirements for transparency, record-keeping (like a searchable database of training content), accountability, and competition. 

All the witnesses said they believed current law is on their side. But… if court cases don’t prove that out, Congress should act to “clarify” the law. Per one witness: “If Congress could clarify that the use of [publisher] content for training and output of AI models is not fair use, then the free market will take care of the rest, just as it has for music, film and television, and sports rights. It will enable private negotiations. Collective rights organizations will arise to simplify the licensing of content.” 

This stance on the part of publishers – that Congress should step in – is stiffened when they are confronted with assertions that AI training is fair use and that the courts will prevail in this view. Meaning: If training is going to be consistently decided to be fair use under the law, then publishers believe the law needs to change. Congress, sadly, is considering these arguments, particularly the Senate. There are at least three bills making their way through the relevant committees, all of which have negative implications for access to information. The “Content Origin Protection and Integrity from Edited and Deepfaked Media Act” (COPIED Act) is theoretically focused on helping people understand when they are looking at content that has been created or altered using artificial intelligence. But it also declares that attaching content provenance information to copyrighted content prohibits anyone from using that content to create new works with AI or to train AI models without permission. In our view, giving content creators such control over how others use copyrighted work as the basis of new creative expression inhibits free expression and overrides generations of copyright law. 

Meanwhile, the “Transparency and Responsibility for Artificial Intelligence Networks Act” (TRAIN Act) is focused on increasing transparency in how AI models are trained using copyrighted materials. But by creating an administrative subpoena process for copyright owners, it may encourage thousands of nuisance lawsuits against AI developers for what is actually a fair use. Lastly, the “AI Accountability and Personal Data Protection Act” ostensibly focuses on protecting creators and consumers from the misuse of personal data and copyrighted content by AI companies. But it actually creates an unworkable and sweeping opt-in mechanism for copyrighted content that would make the U.S. the most restrictive jurisdiction in the world for developing AI technology. (This kind of opt-in mechanism is even more onerous than the EU’s mandatory opt-out system, which is itself beset with implementation problems.)  In addition to these legislative proposals, Senator Josh Hawley (R-Mo.), a member of the Senate Judiciary Committee, has argued there should be property rights assigned to certain types of data, and legal liability for companies that use it to train their models (cue the Section 230 threats). At least four states have also drafted legislation establishing ground rules for AI and copyright.

Of course, publishers are not the only ones hedging their bets on the courts. In addition to cutting billions of dollars’ worth of licensing contracts, AI developers are campaigning for executive and Congressional action. For example, several asked that AI training be specifically authorized by law in their public comments to the new White House AI Action Plan. (Public Knowledge, in our own comments, noted that the existing copyright framework is sufficiently robust to handle the new dynamics introduced by generative AI.) President Trump expressed a sympathetic position in his speech accompanying the Action Plan, but the official position of the Trump administration is that the courts should decide case by case.

Financial and Technical Barriers

Taking a defensive crouch against the affronts described above, and not waiting for relief from democratic institutions, some publishers are developing or buying their own solutions. Some are sequestering their content through financial means, such as paywalls and subscription models; legal means, such as changes to their terms of service; and technical means, such as the use of robots.txt and other signaling mechanisms. In fact, as of January, over 88% of top-ranked news outlets in the U.S. were blocking AI web crawlers. But, as noted above, AI developers don’t always respect these signals – and the use of these methods to block access to content is at odds with the desire and need for open access to information online. This “cat-and-mouse game” between online publishers and AI crawlers makes the internet more closed for everyone. 

Publishers are also forcing takedowns of some of the technical means developers may be using to access copyrighted content. For example, a couple of months ago trade association News/Media Alliance, through undisclosed means, secured the takedown of a leading “paywall bypasser,” website 12ft.io (“12ft Ladder,” designed to scale “10ft Paywalls”). The 12ft Ladder website offered circumvention technology that allowed its users to access otherwise restricted copyright-protected content without paying the required fee. Then, a group of academic publishers obtained a subpoena under the “Digital Millennium Copyright Act” (DMCA) to require Cloudflare to turn over identifying user data of several shadow libraries in order for the publishers to initiate potential legal action against them.

Digital infrastructure companies, spotting opportunities, are entering the space. Cloudflare (a network fronting 20% of the web) will now block AI web crawlers accessing content from its clients without permission or compensation by default – what they describe as a “permission-based approach” to web crawling. This enforceable opt-in product has been glowingly received by journalism advocates, who see it as supporting the consent required for uses within their statutory rights. Others point to how it undermines the reciprocal nature of the open web and believe that “the cure may be worse than the disease” in terms of access to information.

Publishers are also turning to technology providers to stump and block automated bots if robots.txt, HTTP headers, and technical barriers don’t work. One example is DataDome, a “professional bot management solution” that detects, analyzes, and blocks unwanted crawlers. It remains to be seen how well these solutions will work. 

Tolling and Monetization Mechanisms

If you can’t beat ‘em… make money off ‘em. To further the licensing market (or in anticipation of a statutory one), intermediaries are offering various products to toll and monetize AI crawler traffic. For example, Cloudflare has launched a private beta of “Pay Per Crawl,” a new marketplace where publishers can request compensation from AI companies each time one of their pages is crawled. (In an interview, CEO Sam Altman of OpenAI signaled he is open to the idea.) TollBit lets publishers “control access, analyze traffic, and prepare for monetization as the agent economy grows.” ProRata integrates advertising and attribution technologies to enable advertising revenue sharing based on the outputs of LLMs (though so far, only within its own “ethical” search product, Gist). Human Native “bring[s] together suppliers of high quality, premium data with reputable AI developers” to “exchange premium content through a secure and frictionless data licensing platform.” Circling back to that hearing, Created by Humans offers a platform with similar features for authors of books. And the CEO of GoDigital Media Group, inspired by experience from the music industry, proposes an ambitious “ecosystem” consisting of an ai.txt file that informs AI firms about the copyright status of online content; a public database of provenance information; industry collaboration to resolve questions of consent, credit and compensation; application programming interfaces to connect AI firms to copyright offices; a statutory licensing scheme; and a content licensing collective management organization to facilitate payments. 

Publishers are also working collectively to develop monetization models. For example, the Interactive Advertising Bureau (IAB) Tech Lab has held a series of workshops with publishers, platforms (including Google and Meta), and tech firms focused on AI Content Monetization Protocols (CoMP), a technical framework to force AI firms to compensate publishers based on how often their content shows up in LLM queries. (The IAB Tech Lab prefers a per-user-query model to Cloudflare’s per-crawl model on the premise that it will scale better.) Its model is voluntary for AI firms.

Options for a Middle Ground in the Tug of War Between News Publishers and AI Firms 

Given their increasing codependence and shared, mutually-beneficial role in ensuring open access to information, it’s critically important to find middle ground solutions that support the business models of both online publishers and AI developers. Too many of the legal, legislative, and commercial solutions being pursued have negative implications for creative expression, access to information, and an Open Internet for everyone. What follows is an inventory of directions Public Knowledge believes may rise to the opportunity. Each of them requires more assessment, technical examination, and policy analysis, but we see these as promising directions that policy can support or require. 

Mutual and Voluntary Signaling Mechanisms

One path that leverages the traditional norms and reciprocal benefits of an Open Internet is that of strengthening signaling preferences for publishers and AI firms. For example, the Internet Engineering Task Force (IETF) – the organization which sets technical standards for internet protocol – has convened a working group to “standardize building blocks that allow for the expression of preferences about how content is collected and processed.” Creative Commons also recently announced the CC signals project, a new preference signals framework “designed to increase reciprocity and sustain a creative commons in the age of AI.” CC signals will allow dataset holders to signal their preferences for how their content can be reused by machines based on a set of limited but meaningful options shaped in the public interest. CC signals are both a technical and legal tool, but they are based on a social proposition: “a call for a new pact between those who share data and those who use it to train AI models.” Importantly CC signals are intended for use cases where the publisher wants to allow reuse, but with terms attached. 

Another example is Spawning AI’s Do Not Train Tool Suite, which allows publishers to signal data use preferences and consents to AI developers in the form of a Do Not Train registry that AI firms can commit to honor (Stability and Hugging Face already have). It is backed up by Have I Been Trained, a website that lets creators find out if their content has been used to train particular AI models. These tools, too, are designed to update and refresh the “pact” that mutually benefits content owners and digital platforms. Of course, any of these solutions must overcome the objection that scrapers do not consistently honor these signals. Perhaps policy could be used to advance solutions like these, that give greater agency to publishers to decide how their content is being used to train AI models while preserving the reciprocal nature of the open web. 

Data Cooperatives and Collectives

Another path worth exploring is that of data cooperatives, collectives, commons and trusts, all of which are premised on the idea of data as a collective resource conducive to governance models designed for shared benefit. Project Liberty has articulated a vision for cooperatives under shared ownership and providing shared benefits for those that contribute quality, trustworthy data for use in ethical AI development. They also raise the question of how policy can be used to support more cooperative models and shape fairer markets. Data collectives may represent a promising solution for small publishers who lack the archives, scale, resources or expertise to negotiate with – or even be found by – AI firms. 

RadicalXChange, a nonprofit dedicated to “democratic innovation and institutional design”  advocates for “data dignity” and the ability of individuals to “exert democratic collective bargaining power over their data.” Their model, supported by OpenMined’s remote data science software, may provide a framework under which publishers can negotiate for joint decisions controlling the use of their data and appropriate compensation. 

All of the mechanisms we’ve discussed so far may also provide AI firms with a source of competitive differentiation: the ability to market their AI products as being lower risk and higher integrity to users because they obtain informed consent from content creators or owners before using their works to train AI models. This may be backed by certification: for example, Fairly Trained seeks to develop “…a badge that consumers and companies who care about creators’ rights can use to help decide which generative AI models to work with.”  

Self Identification Standards for Bots and Crawlers

Some of the challenges publishers are wrestling with – new traffic from bots and crawlers and the loss of control over access to their own content as we described above – can be addressed by statutorily enforcing a set of self-identification standards via a unique identifier for a bot crawling content. The identifier would play a friction-creating gatekeeper and census-taker role for publishers. In our view, this is highly preferable to commercial solutions that actually block crawlers, which are less compatible with an Open Internet. It also removes the risk of intermediaries with monetization power taking value and consolidating power. For publishers, it would negate the need to get a court subpoena (as the TRAIN Act would require) to understand who has accessed their content for AI training or grounding. 

Extended Collective Licensing

In the third, “interim” installment of its series of reports on AI training and copyright, the U.S. Copyright Office described extended collective licensing (ECL), a system under which a collective management organization (CMO) is authorized to license the works of its members, as well as those of non-members, under certain conditions. Unfortunately, the discussion of extended collective licensing appeared to be rooted in an assertion that whether or not a source work was initially accessed with the copyright owner’s permission has a bearing on whether or not the use is fair. We believe this assertion (along with a few others in the report) is highly erroneous. That said, if the direct voluntary licensing market stalls, ECL has the potential to extend the benefits large publishers are securing through direct voluntary licensing to smaller, local or diverse news organizations that don’t have the resources or the expertise to engage in negotiations – or the scale to attract licensing contracts. But it has benefits for AI firms, too. As the Copyright Office noted, “A CMO’s centralized infrastructure can also provide for streamlined transactions and efficient ongoing licensing administration, reducing overall costs for owners and users alike.” Even though unlicensed training is likely a fair use, a voluntary ECL regime may be structured to provide other benefits – such as safe harbor assurances – for AI firms. In other words, AI firms that work through the collective management organization to access and compensate publishers for their content would be insulated from the risk of litigation by participating publishers. 

Extended collective licensing typically involves a CMO being authorized to license all of the copyrighted works within a particular class for specific uses. There is a lot of wiggle room in designing a CMO, but in the context of an AI-oriented CMO, all copyright owners in that class would be bound to its terms unless they opt out and choose to negotiate separately. Unlike compulsory licenses, with rates and terms set by the government, the licenses issued under an ECL system are negotiated with users in the free market. In this way, an ECL system functions like voluntary collective licensing, but with the government regulating the overall system and exercising some degree of oversight. Just as the witnesses in the opening hearing described precedents from other industries, extended collective licensing may provide a model based on precedents from the music industry. That means it also has some significant hazards, which we will explore further in additional research. 

Statutory Safeguards for Public Interest Uses

In many of the policy debates about AI developers versus content rightsholders, it is Big Tech companies and commercial AI companies that feature in the narrative. But there is a large and growing community of model developers and AI researchers that exist outside of that sphere: academic researchers at universities and national labs; nonprofit AI auditors and developers; true open-source developers; cultural heritage institutions like libraries, archives, and museums; and many more. These developers work towards the public interest, and in many of the legal and legislative paths outlined above these public interest uses could get unfairly disadvantaged. Raising the cost of accessing or using copyrighted information for AI training could lead to a situation in which only well-resourced private, for-profit tech firms will be able to access critical training data and they will likely choose to partner with only the largest news publishers with the most national reach and the richest archives. This would be a disastrous loss for research and the common good, and there should be specific statutory safeguards, supplementing and supporting Fair Use arguments, that give clear protection to these public interest uses.


There is a right way and a wrong way to go about this. In the EU, there are explicit text and data mining (TDM) exceptions to copyright law which likely apply to AI data collection and training for both scientific research and general (including commercial) purposes. Wildly oversimplified, that means AI firms can conduct text and data mining on lawfully accessible copyrighted works but rightsholders can opt out of the general exception through a reservation of rights. Those legally enforceable opt-outs enable many of the licensing, tolling, or other restrictive mechanisms discussed above. Therefore, in an attempt to solve the public interest use challenge, the EU does not require research organizations or cultural heritage institutions to abide by the opt-outs, allowing them to freely conduct TDM for research purposes, including model training. Unfortunately, in practice, many public interest organizations and projects are having difficulty ensuring that their uses and projects will be covered – and in the meantime, the legal risk of copyright liability is too great a burden to take on. This is instructive as the U.S. charts its own course: we need broadly inclusive, legally certain, strongly defendable public interest protections to be included in any regime that aims to increase the cost or restrict the ability to use copyrighted content.

Conclusion 

We have outlined a number of potential solutions that may provide a middle ground between AI firms and publishers in the public interest. But all of them have potential risks and downsides as well as benefits. Our next steps are to consult with stakeholders, survey existing and developing variations on these solutions from other sectors, examine the relevant legal frameworks, learn from legal regimes abroad, and develop a policy position that addresses publishers’ challenges while allowing AI innovation, healthy information systems, and an Open Internet.