Progress in artificial intelligence seems to be something that happens, to paraphrase Ernest Hemingway, in two ways: gradually, then suddenly. A slow trickle of developments as well as surprising bursts of astonishing new capabilities have characterized the AI news cycle for as long as it has existed.
The latest burst of sudden developments relates to generative AI, or GAI — machine learning systems that are capable of outputting new material. GAI is a broad and descriptive term about what the system does, not a particular technology or approach to AI system design. GAI is being used to produce writing, conversation, art, music, video, and more. Systems like these have existed for years, developing gradually, but suddenly systems like ChatGPT, Stable Diffusion, and Midjourney are producing remarkably high-quality content that rivals human work.
We now need to develop policies to ensure that the benefits of these new technologies are broadly and equitably shared, while protecting the existing openness and vitality of the internet and our creative communities.
Unlike existing discriminative AI systems (those that analyze, sort, and process data), GAI raises unique concerns, especially about how human work, including the work used in training datasets, is recognized and valued. For photographers, visual artists, and musicians — especially freelance and independent creators who already struggle with economic precarity — the prospect of a flood of AI-generated content that rivals their work represents real economic risk. The people on the front lines of this disruption right now are artists and musicians, but thanks to large language models like ChatGPT, Sydney, and Bard, this wave of change could soon extend to many kinds of knowledge workers — especially if LLMs’ ability to go beyond simple text generation towards acting as independent agents continues to develop.
Concern regarding this disruption has people casting about for solutions to slow, mitigate, or prevent the potential harms of AI systems. One approach that is being considered — and litigated in two high-profile cases (see Andersen v. Stability AI and Getty Images v. Stability AI) — is how to apply the United States’ existing copyright system to limit GAI systems.
This could be a disastrous mistake. Applying our existing copyright laws to our best understanding of GAI systems leads to the conclusion that while this technology is new, and there may be flaws in design and execution that create instances of copyright infringement, its core elements and essential ideas do not run afoul of copyright law. Most importantly, reinterpreting or expanding copyright specifically to use it as a bludgeon against GAI will instead threaten the openness and accessibility of the internet, throw up barriers to creativity, and primarily benefit the biggest publishers, distributors, and content creators while leaving small and independent creators to struggle with economic precarity and disruption.
Many artists, creators, and consumers have already adopted stances on GAI ranging from skepticism to outrage. It is thus not enough to merely state the above conclusions; they must be backed up with informed and accessible public analysis that breaks down the copyright law, technology, and implications of proposed solutions.
Let’s start with copyright. Copyright is not a magic spell that can be invoked anytime someone interacts with a creative work; it is actually a bundle of specific, enumerated (and limited) rights. These rights vary slightly based on the nature of the work, but generally include the rights to control the reproduction, distribution, and public display of the work, as well as to authorize derivative works. Copyright only grants the holder the legal ability to defend those specific rights, and is further limited by other legal doctrines like “de minimis” usage (meaning so little of a work was used that there is no infringement) or fair use.
Turning to the technology, it is useful to analyze the copyright issues related to GAI systems in terms of inputs and outputs. The GAI systems have three inputs: the training datasets; the training process that creates the machine learning model; and user inputs including text prompts, uploaded images, or other user-uploaded or linked reference material. From all of those inputs, the GAI system generates an output: text, images, audio, video, and so on.
GAI systems are built on huge datasets. These vast swathes of data are used for training the machine learning models because more data results in more useful, well-rounded models; the more data the system can learn from, the better it gets. Most GAI systems use training datasets assembled through the crawling and scraping of the publicly accessible internet.
This isn’t a new practice or novel legal issue. Instances of commercial web crawling have been challenged — and consistently found permissible — under the Computer Fraud and Abuse Act. That’s because commercial web crawling is largely geared toward the collection of data and statistics, not the full copyrighted content of a website, so commercial web scrapers tend to avoid infringing the reproduction right of the copyright holders. In other instances, building large digital datasets of copyrighted material is unavoidable for the functionality of a project. A good example of this is the Google Books project and internet search more broadly. Google Books specifically came under fire from authors, publishers, artists, and photographers over the digitization and display of copyrighted content. The project was ultimately found to be a fair use that promotes copyright law’s very purpose: to expand public knowledge and understanding.
The datasets on which GAI systems like ChatGPT and Stable Diffusion rely are more like Google Books than commercial web crawlers that gather data; GAI systems need writing and art in its complete form to train from. And the comparison to Google Books doesn’t stop there. Many GAI systems are built using third-party public datasets like Common Crawl and LAION that are distributed under fair use principles and provide a real public good: archiving and making accessible the aggregated content of the internet for academics, researchers, and anyone else that may want it. These are free, non-commercial datasets collected by nonprofit organizations for use by researchers and the public. Web crawling and scraping also underlie the operation of search engines and archiving projects like the Internet Archive’s popular Wayback Machine.
In other words, the same practices that go into collecting data for GAI training are currently understood to be non-infringing or protected by fair use. Considering how vital these practices are for an open and accessible internet, we should ensure that they stay that way.
As a threshold matter, it is critical to understand that accessing, linking to, or interacting with digital information does not infringe any copyright. Reading a book, looking at a photograph, admiring a painting, or listening to music is not, and never should be, copyright infringement. This is not a “fair use” issue; the ability to use, access, or interact with a creative work is outside a copyright owner’s scope of control. Based on the best explanations of how GAI systems work, training a GAI system is generally analogous to these kinds of uses.
A GAI model does not contain an archive or directory of the works it trains on that the system queries, searches, or references. Rather, during training the system processes the work, deconstructing it using math to learn from it. To vastly oversimplify, this learning takes the form of building and modifying the strength of connections between a model of datapoints, resulting in a collection of mathematical weights that define the model. The work being analyzed is not reproduced or stored in the model; instead, the model uses each work to improve its overall understanding of whatever it is training on (art, photographs, writing, and so on).
In Andersen v. Stability AI, the class action lawsuit against Midjourney and Stable Diffusion (two of the most widely used GAI art systems), the plaintiffs conceptualize the training process differently. They characterize the training process as a mechanism for storing the training images in a compressed state. Ultimately, this interpretation will be tested in litigation, but the generalization process that training images go through does not result in any conventional kind of storage.
There are two facts that make this apparent. First, in an ideal model, none of the content used in training can be reproduced by the system with any reliable fidelity. Second, the simple reality of the quantity of information these systems are trained on, compared to the size of the models themselves, should make that obvious. One can download the fully trained Stable Diffusion model weights at the size of 4 GB, while the LAION-2B dataset it is trained on — the smallest version — contains around 80,000 GB of images; no amount of compression would allow for the model to contain all that information. Simply put, the model does not contain its training data within it.
The best understanding of the technology therefore points to the model (in its own way) really learning from training images and not merely reproducing and storing them in compressed or abstract mathematical form. Of course, this evaluation relies on information about a highly technical, insular, and rapidly developing field of study. It might be the case that the architecture of the existing or future systems is different from what is described here, but either way, the essence of the training process itself does not involve infringement of copyrighted work.
From a copyright liability perspective for GAI systems, user inputs are not a real concern. Naturally, users can misuse tools, even existing ones like image editing software, to infringe copyright or get up to other mischief, but that does not make the tool infringing. And the user inputs themselves, divorced from the outputs they may produce, do not violate any of the bundle of rights that copyright holders can claim. Even if a user were to feed copyrighted content into a GAI system as a prompt or priming material, this is merely use—not reproduction or distribution. In addition, while the outputs of GAI systems are not protected by Section 230, a user’s inputs to a GAI system certainly are.
GAI systems are defined by their outputs. The writing, images, and other content that they produce raise a collection of copyright-related concerns. These concerns can be distilled to a single question: Do the outputs of GAI systems pass the “substantial similarity” test to not qualify as reproductions, and if they do, do they still incorporate enough major copyrightable elements from existing works such that the output could be characterized as an unauthorized derivative work?
Image-generating diffusion models in particular suffer from “overfitting” and “memorization.” When a specific work appears in similar form many times throughout a training dataset, that image’s combination of characteristics can become overly well-connected in the model. The result is sometimes called “memorization” because the model is able to output a fairly good copy of that work if prompted to. This is an example of “overfitting” in training — where a model cleaves too closely to its training data instead of being able to generalize from its training. These overfit outputs, depending on their quality, can and should be considered infringing reproductions. While this may seem fatal, it is ultimately a red herring. Generalization is the ultimate design goal of these systems, and instances of memorization and overfitting — which raise copyright alarm bells — are essentially bugs, not features. After all, finding a copy of something that already exists online is a simple matter of search; the value of a GAI system is that it makes something new.
Relatedly, some might argue that since GAI systems are trained on copyrighted material and lack human agency and creativity, that all of the outputs are derivative works. A derivative work is a new work that is based on an original work, and shares major copyrightable elements with the original, but which has its own copyright protection because of the new elements in the work. But this is where the ability of these systems to genuinely generalize is the critical factor. When the model is training on, for example, a portrait, it is learning general compositional information about portraits, faces, and so forth. So when a user prompts it to produce a specific kind of portrait, it isn’t pulling from any work or works specifically, but from its whole corpus of training, so none of the works in the training data are original works for the purpose of a derivative works analysis. Put another way, in a well-generalized model, the influence of any one work in the training set should be regarded under the de minimis doctrine as insignificant in its overall contribution to the new work.
Another narrative echoing across creative communities is concern about the ability of a GAI system to produce new works in an artist’s style. Generally, copyright law doesn’t protect style, ideas, or concepts. For a work to be an infringing reproduction, there must be substantial similarity between two specific works in terms of expressive content. That means that a work just having a similar vibe, style, characteristics, or subject matter is not infringement. Another angle from which to approach the issue of style is that copyright holders have the right to control derivative works. The copyrightability of work created using GAI is a whole separate issue, so, the only element of the doctrine that is relevant here is whether some works produced using GAI systems pass the “substantial similarity” test to not qualify as reproductions, but still incorporate enough major copyrightable elements from an existing work such that the output could be characterized as an unauthorized derivative work. So, again, it cannot be a matter of similar style, subject matter, or general features — there must be specifically identifiable elements from particular works — and that still doesn’t describe the situation artists are worried about with GAI producing new works in existing styles.
In light of the threat of competition from these machines, and the value we have for art, there may be a temptation to deviate from the established copyright canon and expand the notion of what can be protected, but this is a dangerous idea.
All creative work is the product of a process of copying, transforming, and combining our experiences, inspirations, and influences. Every form of art, culture, and invention is rife with examples of celebrated works that reimagines, riffs on, or downright knocks off existing work. No one can, or should be able to, enforce a copyright on the basis of style, concept, or feeling alone. It is the free exchange of ideas and inspirations, and the ability to reimagine and remix existing work, that allows artists to build on existing ideas and push creative boundaries. Trying to enforce copyright protection on styles or ideas would short-circuit the creative process and stifle artistic development.
And while some small and independent creators are on the frontlines of the fight against GAI right now, it is important to remember that there are, and always have been, significant inequalities in what work is valued, whose work is respected, and who will have the power and resources to aggressively defend their work. For instance, hip-hop music was initially criticized for its use of sampling from other artists’ work and faced repeated copyright challenges which restricted the right of artists to use samples without paying for licenses. Anti-sampling cases resulting in mandatory licensing transformed a counter-cultural, democratized, method of making art into yet another tool that only established artists and record labels could leverage. Today, underground hip-hop artists using uncleared samples and mashup artists still find themselves locked out of key revenue streams like streaming platforms because of these restrictive copyright rules.
There are obvious parallels between sampling and artists using GAI tools today, but even artists working traditionally ought to consider how existing industry power dynamics would play out if there were vastly expanded “stylistic” copyright rules. Record labels, music studios, and established artists would have the resources and power to go after any creator that encroached on their “style,” potentially locking up whole genres and aesthetics behind licensing structures. Reducing every shred of human originality to copyrightable units that must be bought and sold is a far greater threat to creativity than new automated tools will ever be.
Expanding Copyright Law Won’t Protect Artists from GAI’s Disruption
Copyright is enshrined in the Constitution with the phrase: “To promote the progress of science and useful arts…” Taking that purpose seriously, it should be clear that copyright and GAI should not be set at odds with one another. Like the internet — another disruptive technology that forced us to grapple with copyright concerns — GAI is rich with possibilities for redrawing the lines of creativity, productivity, and prosperity.
This is not to say that there aren’t challenges: Overfitted, infringing reproductions of existing works and concerns about how to navigate consent, credit, and compensation for creators are uncannily reminiscent of early concerns about file sharing and web search indexing. Indeed, to some extent, these concerns about the internet have always lingered and are now crossing over and combining with concerns about GAI. But, as the Copyright Office already determined when it recommended against creating special ancillary copyrights for news content, even when there are serious concerns about the future of important human activities, expansions of copyright aren’t necessarily the effective solution — and would necessarily mean limiting fair use and free expression.
This impulse to expand or redefine copyright law to allow others to control the ability to use, access, interact, and learn from creative works cannot be the solution to these fears. If we lose the right to freely use and enjoy creative works, we open the way for a new world of extreme commercialization where artists must pay for the privilege to create, and people find the costs and barriers to information and culture rising ever upwards. Generative AI presents the possibility of disruption — let’s embrace this opportunity for change by finding solutions that enhance the ability to create freely, transform traditional economic structures to ensure prosperity, and lower the barriers to enjoying the fruits of our diverse and vibrant shared culture.