A few days ago, the New York Times filed a copyright infringement lawsuit against OpenAI (creators of ChatGPT) and Microsoft (who own part of OpenAI). The full complaint can be read here.
This is an extremely complex case, addressing arguably novel issues in copyright law. Meaning, depending on how you look at it, the court may have to consider legal issues that have never been previously addressed.
Note that the U.S. Copyright Act was written in 1976, before OpenAI, ChatGPT, or the World Wide Web existed. Ideally, Congress would step in and pass sensible legislation addressing copyright issues as they relate to AI and large language models (LLMs). But we don’t live in an ideal world, so it’s up to the courts.
There have been many articles discussing the case. Here is a very detailed summary by Zvi Moskowitz of some of the various articles, posts, and such.
In this post, I want to narrow the focus to the following: does training an AI dataset on copyrighted materials, without the permission of the copyright owner, constitute copyright infringement? (We can also ask, “and if so, is there a fair use defense?”)
As I understand what happened here – and please not that I am not a coder, nor an AI expert, I was told there would be no math – in the process of creating ChatGPT, OpenAI exposed the program to essentially everything on the internet, including, relevant to this case, the NYT archive. ChatGPT isn’t like Wikipedia, or a database that holds all of the information it’s exposed to, and just spits it out upon request. It operates more like an intelligent being (we are well into the realm of non-technical metaphor here).
Meaning, if I, David Lizerbram, read an article in the Times, and then you later ask me to say what it says, I can give you a summary, but I don’t have a photographic memory and I don’t retain the exact text of the article in my head. ChatGPT, I’m told, does more or less the same. Having been exposed to all of the NYT’s archives, it can write text in a style approximating the NYT’s house style, and it retains certain (mostly) factual information, so if you ask it to write an article in the style of the NYT as if it were the day after the election of Barack Obama, its output would be similar to those articles that actually appeared the day after the election.
(Technical experts: please feel free to correct anything I’m writing here.)
Part of the NYT’s complaint argues that the output of such a query will, in some cases, be nearly identical to an actual NYT article, but I’m not addressing that issue here – there are lively debates about whether or not that’s actually how ChatGPT works and whether the types of examples cited in the filing can be substantially reproduced in court.
Getting back to the key points. Nobody is arguing that my reading the NYT is a copyright infringement. But is there a copyright infringement when an AI is “exposed” to the NYT in the process of training the program?
I’d say the answer is yes. In order to expose/share/train an AI (whatever term you want to use, understanding that we are in the realm of metaphor), digital copies need to be made in the process of training. My eyes and brain don’t make digital copies, but there is no other way for a computer to be exposed to content. One of the “basket of rights” embodied in copyright is the right to grant or withhold permission to make copies of the work. A digital copy of the work is, without question, a “copy”, and even if the AI program deletes the article after it’s “trained”, the copy was still made without the authorization of the copyright holder (the NYT, in this case).
Of course, digital copies of copyrighted works are made all day long across the internet without the permission of the rights holders, but that doesn’t mean each of those actions aren’t copyright infringements, even if the vast majority never result in enforcement of the copyright holder’s rights.
Another way to consider this issue is the question of scale. In the article cited above, Zvi writes:
Scale matters. Scale changes things. What is fine at small scale might not be fine at large scale. Both as a matter of practicality, and as a matter of law and its enforcement.
So even if you can understand an AI “reading” a NYT article as being similar, on some level, to a human performing the same action, the results are very different. Even if I have a photographic memory, I can only personally reproduce so many copies of the original work; a digital application like ChatGPT can reproduce effectively infinite numbers of copies. That has to be taken into consideration, and I’m not sure the 1976 U.S. Copyright Act or the 1998 Digital Millennium Copyright Act really suffice to grapple with the issue.
Let’s assume the act of training ChatGPT on the collected works of the NYT is, in fact, a series of copyright infringements. Is the case closed?
Of course not, as there are fair use issues to consider. Fair use is a defense to a claim of copyright infringement. Without an infringement happening first, there is no question of “fair use”.
It’s a worthwhile exercise to apply a fair use analysis the facts in NYT v. OpenAI, considering the four factors enumerated in Section 107 of the Copyright Act, but that will have to wait for another post.
So where I am landing right now is as follows: to the best of my understanding, the act of training ChatGPT on the corpus of the NYT without the permission of the Times included a series of copyright infringements. If this conclusion is based on a misunderstanding of how the AI training process works, please let me know. We’re all just trying to learn here, whether the reader is an artificial intelligence or the old-fashioned kind.