As tools like ChatGPT, image generators, and other AI systems rapidly enter the mainstream, they’ve also ignited heated debates about copyright, fair use, and the future of creativity. Central to the conversation is a question that keeps resurfacing: is training AI on copyrighted works the same thing as piracy?
Some policymakers want you to believe the answer is yes. To them, when an AI system ingests text, images, music, or code from copyrighted sources, it is no different from downloading a pirated movie from an illegal torrent site. But legally and practically, that comparison doesn’t hold up. While both involve copyrighted works in some way, they fall into entirely separate categories of use under copyright law. Understanding the difference matters, not just for copyright lawyers, but for anyone who cares about creativity, innovation, and how we set rules for emerging technologies.
Piracy: Clear-Cut Violations
First, let’s define piracy in the copyright context. Piracy is not a legal term, but simply refers to obvious instances of copyright infringement: the unauthorized reproduction, distribution, or public performance of copyrighted works without permission from the rights holder. That could mean selling bootleg DVDs on a street corner, running (but not using!) an illegal streaming website, or downloading the latest hit album from a peer-to-peer service without paying for it.
Colloquially, the key features of piracy are pretty straightforward:
- Wholesale unauthorized copying or distribution – The pirate makes complete, exact or near-exact copies of a copyrighted work without permission.
- Market substitution – The pirate’s actions provide consumers with the copyrighted work in a way that directly competes with legitimate sales or licenses.
- Commercial or personal use – Piracy can happen whether someone sells bootleg copies for profit or downloads them for free, but the goal is getting to enjoy or use the work.
- No obvious fair use defense – The use of the work isn’t transformative, copies the entire work, and has no clear publicly beneficial use like education.
Piracy is illegal because it infringes the exclusive rights of copyright holders to reproduce, distribute, or publicly perform their work. It is especially unambiguous in instances of commercial piracy, or where individuals violate copyright to avoid paying for commercially available works for personal enjoyment. If you burn 100 copies of a movie and sell them on the street, that’s piracy, plain and simple. If you torrent the latest Marvel movie to avoid buying a Disney+ subscription, that’s piracy, plain and simple.
AI Training: A Transformative Use
Now compare piracy (which, again, always involves making unauthorized exact or near-exact copies) to how AI systems are trained. Training a large language model (LLM) or image generator involves feeding vast amounts of data such as text, images, audio, or video into a machine learning system. The system processes these works to detect patterns, relationships, and statistical structures in the data.
The critical point: AI models do not store or distribute the copyrighted works they are trained on in the way pirates do.
Instead, the training process works more like this:
- The system ingests the training data and converts it into mathematical representations (vectors, weights, and parameters).
- The original works are not preserved in the final model. Instead, the system captures statistical information about how words follow one another, how images are composed, or how sounds combine into melodies.
- Once training is complete, the model generates new outputs based on probabilities, not by reproducing specific works or parts of them.
This training process almost always requires making temporary copies of copyrighted works during the process. But the end product is not a copy of those works; it is a statistical model capable of generating new, original content.
This distinction matters enormously under copyright law. Copying for training is transformative: it uses the works for a fundamentally different purpose from the original, much like indexing websites for search engines or scanning books for text analysis.
Fair Use
The legal doctrine that governs this distinction is fair use, a cornerstone of U.S. copyright law. Fair use allows certain unlicensed uses of copyrighted works when those uses are socially beneficial, transformative, and do not undermine the market for the original.
Courts weigh four factors when assessing fair use:
- Purpose and character of the use – Is the use transformative, and does it add new meaning or purpose?
- Nature of the copyrighted work – Is the work factual or creative?
- Amount and substantiality used – How much of the work is used, and is it reasonable for the purpose?
- Market effect – Does the use substitute for or harm the market for the original work?
AI training clearly checks many of these boxes in favor of fair use. The purpose of an AI developer making copies of copyrighted works during training is not to enjoy the expressive value of the work: it is to extract information about the composition of language, images, or sound or the relationships between ideas. The works are transformed into statistical data, not consumed as creative expressions. And training does not typically compete with the market for the original works, since people don’t use an AI system as a substitute for buying the latest novel or film.
This makes AI training much more like recognized fair uses in past cases:
- Authors Guild v. Google (2015) – Google scanned millions of books to make them searchable. Courts ruled this was fair use because it was transformative and did not replace the market for the scanned books.
- Kelly v. Arriba (2003) – A search engine’s use of copyrighted images as thumbnails was found to be fair use because it served a different function from the original artworks. Again, the use was transformative: to make content findable, not to substitute for it.
- Sony v. Connectix (2003) – Intermediate copying of protected code during reverse-engineering to build a non-infringing emulator was fair use. Here, making a copy of copyrighted material just to learn from a work’s unprotectable aspects can be transformative.
Two recent court decisions, Bartz v. Anthropic and Kadrey v. Meta, affirmed that AI training is a transformative fair use of copyrighted content. They also describe how AI training differs from piracy as in both cases, AI companies downloaded pirated books to train their AI systems. Both decisions found that the use of pirated materials for AI training did not negate fair use. However, the AI companies were potentially liable for copyright infringement for engaging in book piracy by acquiring and keeping the pirated books that they could have procured through legal means.
Addressing Common Counterarguments
“AI Outputs Can Resemble the Training Data.”
Critics worry that AI can sometimes reproduce near-verbatim excerpts of training material. While this can happen through substantial effort, it’s likely easily addressed by traditional copyright law. An infringing output is just that – infringing. Courts don’t need to get into the guts of training or make evaluations about models in their entirety in order to apply basic infringement analysis to specific instances of infringement.
“Creators Deserve Compensation.”
Some argue that even if AI training is fair use, it feels unfair for creators not to be paid when their works are used as data. This is a legitimate (and ongoing) policy discussion, but it’s distinct from the legal question of piracy. The law allows fair use even when it doesn’t involve licensing or payment – because copyright has always been balanced against the public interest in innovation and free expression. Copyright is not an absolute right to control every use of a work, and it is important to preserve distinctions between piracy as obvious violations of existing rights compared to the complexities raised by uses like AI training.
“They Should Just License the Works.”
Some argue that AI companies should simply license copyrighted works for training, but this overlooks how copyright law has long acknowledged (and even protected) unlicensed uses that serve the public good. Just as search engines don’t need permission to index the web and researchers can mine data without licensing every journal, AI training is transformative, non-substitutive, and provides broad benefits. A voluntary direct licensing market is already developing between AI firms and publishers. But forcing all developers to license potentially billions of works across creative sectors would be unworkable as the only pathway to AI development, and could give a few Big Tech companies outsized gatekeeping power, stifling new and transformative technology. Fair use exists to ensure copyright doesn’t become an absolute veto over socially valuable uses of information.
Conclusion
Words matter. Calling AI training “piracy” or “stealing” or “theft” may be rhetorically powerful, but it is legally inaccurate and dangerously misleading. Piracy is an obvious kind of copyright infringement, where there is clear unauthorized copying and distribution that substitutes for original works and without any kind of higher purpose than making a buck off the unaltered work of others. AI training, by contrast, is a transformative process that uses works to extract statistical information without substituting for them and resulting in a completely new piece of technology as the end product.
Copyright law, through fair use, has long recognized the importance of allowing transformative uses that enrich the shared commons of knowledge, creativity, and technology. Search engines, digital libraries, scientific research, fan fiction, and even YouTube reaction videos all rely on the principle that not every use requires permission and our society has benefited immensely from it. Whether its social value is more at the scale of tons of reaction videos or a dazzling scientific breakthrough, remains to be seen, but AI training certainly belongs in the tradition of permissible transformative uses.
The real debates about AI and copyright – how to ensure transparency, accountability, and fair compensation for artists – are worth having. But we can’t have them productively if we start with a flawed premise. Training AI is not the same thing as piracy. Understanding that distinction is the first step toward building a copyright system that both protects creators and enables innovation.