Last week saw two landmark fair use decisions on artificial intelligence from the Northern District of California: Judge William Alsup’s decision in Bartz v. Anthropic and Judge Vince Chhabria’s ruling in Kadrey v. Meta. These cases represent the first substantive legal decisions addressing copyright issues involving training large language models (LLMs) on copyrighted texts. This set of issues is being actively litigated in a number of cases besides these two, so litigants, policymakers, and the public have been waiting expectantly for results to start rolling in.
The headline is simple: both hold that AI training on copyrighted works is a protected fair use.
But nothing is simple when it comes to AI. To call the public debate around AI and copyright “contentious” would undersell the disruptive potential and intensity of feeling at play. Yet somewhere between reflexive revulsion and uncritical utopianism lies the perspective that using copyrighted works to train AI systems is a significant, protected, non-copyright-infringing activity. Not because big, well-financed AI companies need a hand-out, but because the alternative would mean threatening our rights to read, learn from, and build upon the works of others – and massively concentrating power over AI into the hands of vast media conglomerates and Big Tech.
For those of us who champion our rights to read, learn, and share knowledge, these decisions support essential elements of the arguments we have been presenting from the start. Both acknowledge that AI training is highly transformative, and that under the facts of their cases, it is a protected fair use of the underlying copyrighted works in the dataset. Both follow established precedents when evaluating model outputs for infringement. Both reject circular arguments related to theoretical licensing markets for AI training. And, finally, both decisions reject the conflation of how a given work is obtained with how it is used, thereby protecting innovative and transformative uses while preserving liability and accountability for irresponsible data practices that fall within widely understood copyright infringement territory.
But nothing is simple when it comes to the law, either. Before analyzing these four points of similarity further, there are three important limitations to flag.
First, fair use decisions are highly fact-specific, and the quality of judicial analysis can vary widely from judge to judge. That makes lower court decisions like these challenging to generalize from. Yet these rulings, as the first on these issues, provide an important legal baseline for future AI and copyright disputes, especially as they share some key conclusions. In such an uncertain and new area, any decision provides useful guidance about how to stay on the right side of copyright law.
Second, it is also significant that these two decisions came on summary judgment motions, not after a full trial. What that means is that all the facts and inferences in dispute in the case are held against the moving party (in both cases, the AI company), meaning that these decisions are made with everything taken in the light most favorable to the authors. In some sense, that bolsters the strength of the legal conclusions, but it also means that these decisions are not the best source for learning about how courts understand the technology or the factual landscape around AI – and how they will apply these legal principles going forward.
Third, Judges Alsup and Chhabria approach the fair use analysis somewhat differently, and also seem to have different attitudes towards AI as a technology. That makes it difficult to compare their reasoning one-to-one. But this diversity may strengthen our extrapolations; despite the judges’ diverging views on copyright and technology, these decisions share significant common ground. Understanding the similarities between these two decisions is especially important in light of their differences and the different factual contours of each case. Extracting the consistent conclusions from these two cases points the way towards future legal outcomes, and where key areas of uncertainty or confusion remain.
Similarity 1: AI training is transformative.
First and foremost, both decisions conclusively, overwhelmingly evaluate LLM training to be a highly transformative use of the works used in their training. This is the first factor of the four-factor test for fair use, and it was decisively ruled in favor of AI training. Judge Alsup calls Anthropic’s use of books “spectacularly transformative,” and notes that “[l]ike any reader aspiring to be a writer, Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them – but to turn a hard corner and create something different.” Similarly, Judge Chhabria – although ultimately giving the most weight to factor four, the effect on the market for the original work – agrees that “[t]here is no serious question that Meta’s use of the plaintiffs’ books had a “further purpose” and “different character” than the books – that it was highly transformative.”
The Supreme Court tells us that the “central purpose” of factor one is to see whether the secondary user is superseding the original or adding something new, with a further purpose or different character. When an LLM converts books into multidimensional vector mathematics so it can compose new, coherent text on a range of topics, it is doing the latter – just as the Google Books corpus transformed scanned pages into a searchable index. In First Amendment terms, fair use exists to keep copyright from chilling learning, research, and criticism; both courts emphasize that letting machines learn from texts is an extension of humans’ age‑old right to read, think, and synthesize.
Therefore, even though transformative use is one factor among four, it is often the one that strikes to the core of how fair use fulfills copyright’s First Amendment-sensitive goal: to promote the progress of knowledge and public access to information – not to lock up expression indefinitely.
Second: Outputs matter.
The plaintiffs tried to depict LLMs as massive plagiarism engines that memorize entire books and output statistical collages of the training data. The record did not back that up. Both judges accepted the premise that the models memorized the plaintiff’s works, given that the decisions were made on a motion for summary judgement. But Judge Alsup concluded that even if an LLM holds a compressed representation of the books on which it was trained, what matters is whether the service delivers infringing text to the public. There was no contention by the plaintiffs that had happened in the Anthropic case. In the case against Meta, after extensive adversarial testing, Judge Chhabria found that Meta’s LLM did not regurgitate more than 50 tokens from any of the plaintiffs’ works – and could do even that only 60% of the time under deliberately coaxing prompts.
Both decisions therefore clearly connect the permissiveness of memorization and learning with the question of model behavior and outputs. A small amount of verbatim overlap may occur (just as a student might quote a line in an essay), but unless the system is designed or permitted to regurgitate large sections of content (such that it becomes a tool for accessing the training data in a meaningful way), there is no harm. Neither minimal reproduction of texts nor stylistic mimicry raised serious copyright concerns.
These decisions clearly indicate that creating generalized models that avoid regurgitating copyrighted material is both possible and an important path to avoiding copyright liability. However, that path does presage a concerning risk: commercial model developers adopting maximalist output filtering to minimize liability risk. But these decisions stand just as easily for the conclusion that if the outputs are sufficiently limited in verbatim content and interspersed with commentary or otherwise obviously transformative, that should also be permissible. Strategic litigation avoidance might chill the ability of users to access tools that allow for the full scope of First Amendment-protected, transformative fair use that models may be capable of.
Third: No entitlement to a licensing market.
A centerpiece of the plaintiffs’ strategy in both cases was to claim that permitting AI training as a fair use would harm the nascent, potentially very lucrative, market for licensing books as AI training data. Significantly, both courts soundly rejected the notion that copyright holders have an inherent right to license their works specifically for AI training. As Judge Chhabria explains:
“In every fair use case, the ‘plaintiff suffers a loss of a potential market if that potential [market] is defined as the theoretical market for licensing’ the use at issue in the case… Therefore, to prevent the fourth factor analysis from becoming circular and favoring the rightsholder in every case, harm from the loss of fees paid to license a work for a transformative purpose is not cognizable.”
Judge Alsup similarly found no obligation to protect the plaintiff’s interest in such a licensing market stating definitively that “such a market for that use is not one the Copyright Act entitles Authors to exploit.”
As both a component of the fair use analysis, and as a broader comment on the steady drumbeat from media industries about their entitlement to such a market, this is a resounding rebuke. This connects back to transformative use: requiring licensing for transformative uses would violate the very core of the fair use doctrine. Fair use inherently contemplates use without permission because it is designed to ensure that a copyright holder cannot exercise control over uses “sufficiently orthogonal” to their own use, thereby freeing up the work to “promote the progress of science and the useful arts” as required by the Constitution.
Fourth: Shadow libraries complicate, but do not poison, fair use.
A big part of both cases, warranting its own separate analysis, regards copyright questions about how these companies went about acquiring the books they used for training and how they used them. In both cases, the companies downloaded and made use of shadow libraries (unofficial pirate repositories of digital books and articles) for training their models.
In Kadrey, plaintiffs argued that the court should not condone training an LLM on copies of books from a shadow library because, in part, it would support or incentivize other model developers to use them as well. In Bartz, plaintiffs similarly argue that the use of pirated works precludes training as a following fair use.
Yet, in both cases, the judges held that using shadow libraries as a data source did not poison the fairness of the training process – but that doesn’t mean it gets companies off the hook for run-of-the-mill book piracy. Judge Alsup is careful and clear in separating out his discussion of specific uses, with a parallel analysis of each factor for each, while Chhabria weaves everything into one fair use analysis, but both clearly uphold fair use for training even if pirated books were involved. Chhabria explains, “plaintiffs again beg the question – whether LLM developers should have to pay for the books they use as training data is the issue addressed in this opinion.”
The judges were not persuaded that the eventual highly transformative fair use of the shadow library works for AI training excused potential copyright liability for the very act of acquiring, keeping, or sharing (even inadvertently) those works when alternatives for purchase or lawful access existed. Judge Alsup expressed doubt that “any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use… Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.”
Note, however, that Judge Alsup is not asserting that the plaintiffs are always entitled to direct payment or compensation. Anthropic argues that, in theory, it could have found a reference library willing to loan copies for free. Anthropic also ultimately solved its book supply issue by spending millions of dollars buying and digitizing millions of used print books, and they argue they could have done that for the pirated books. Judge Alsup does not dismiss these ideas, but also does not excuse the piracy, noting pithily instead: “But Anthropic did not do those things – instead it stole the works for its central library by downloading them from pirated libraries.”
Ultimately, both cases require additional fact-finding about the potential piracy to decide on liability, but for now each company was found to have some use of the shadow library works that were not protected by fair use. So, what we can extract clearly is that how you access training data, and what else you do with it, is critically important.
One Key Difference: Market dilution theory.
The biggest point of departure between these two cases is on the fourth factor: the effect of the use upon the market for the original work. Judge Chhabria places a disproportionate weight on this factor, calling it “the single most important element of fair use” in contrast to Judge Alsup’s more balanced weighing of all the factors.
Judge Chhabria’s emphasis on this factor manifests in sharp dismissals of the two theories the plaintiffs in the case did advance, and a lengthy, speculative, nonbinding discussion of his favored theory for factor four. Judge Chhabria addresses a theory of market dilution – the idea that AI-generated content might flood markets, harming demand for original human-authored works. Despite acknowledging lack of any evidentiary support, legal precedent, or the parties even advancing this theory, it is notable because of the U.S. Copyright Office’s tentative, but similarly ungrounded, endorsement of the dilution concept in its recent policy report on AI training.
Judge Alsup also addresses this theory, but soundly rejects it – and rightly so. He writes:
“Authors’ complaint is no different than it would be if they complained that training schoolchildren to write well would result in an explosion of competing works. This is not the kind of competitive or creative displacement that concerns the Copyright Act. The Act seeks to advance original works of authorship, not to protect authors against competition.”This account of the “market dilution” argument maps closely with how copyright has always functioned, and is the understanding supported in the caselaw. As the Supreme Court recently affirmed in Warhol, “the fourth factor focuses on actual or potential market substitution.”
This principle of market substitution versus market dilution is key: substitution is the long-standing test for harm under the fourth factor and looks towards the replacement of the original work by the secondary work. In Campbell v. Acuff-Rose Music, the Supreme Court specifically noted that “creative works can compete with other creative works for the same market, even if their appeal is overlapping.”
More could be said about this nascent market dilution doctrine, but for now it remains a fringe and unsubstantiated theory. But the fact that Judge Chhabria chose to include it, unprompted, signals that it bears watching. We can expect plaintiffs (not only in AI cases, but in copyright generally) to seize on it in the hopes of normalizing it.
Looking Ahead on Policy: Public Datasets and Expanding Access.
These cases are instructive about how future litigation is likely to develop, and indicate clear areas of strength in the legal arguments that AI training is protected by fair use.
The decision’s rebuke of these companies over data acquisition and handling, as well the benefit of their relatively detailed analysis of fair use as applied also points the way to a clear policy need that is not contingent on the courts: developing robust, lawfully assembled, publicly accessible datasets to expand access to data for AI training. To truly meet the promise of promoting the “progress of science and the useful arts,” we need a diversity of AI developers and researchers, but not everyone will have millions to spend on buying their way into a digital library like Anthropic. These cases are potential huge wins for big companies like Anthropic and Meta, but just as fair use is meant to liberate ideas for use by all, AI training should not solely be left in the hands of those with the money to pay for access to that knowledge. Shared, public datasets offer a practical, efficient, and legally sustainable solution for bolstering innovation and democratizing AI development.
Conclusion
This is a rough moment for the rule of law in many areas, but shockingly, our flawed but flexible copyright system seems to be working in this instance. That’s not to say that AI companies should expect wins all along the way: Ironing out the permissible boundaries in this new area may result in some adverse decisions for companies engaged in AI research, especially when they become reckless or take shortcuts. But that is the risk of innovation Silicon Valley companies love to congratulate themselves for taking on. There will be more decisions, but regardless of how they come out, these early judicial efforts illustrate the value of careful, fact-grounded, and fair analysis. As society navigates this transformative technological moment, there is genuine reason to be hopeful about maintaining a healthy balance between copyright protection and the forward march of progress.