The New York Times lawsuit against OpenAI could have major implications for the development of artificial intelligence

    <clase abarcada=Tada Images / Shutterstock” src=”https://s.yimg.com/ny/api/res/1.2/F02vCzZwG739o5HX3.xdHg–/YXBwaWQ9aGlnaGxhbmRlcjt3PTk2MDtoPTYzOQ–/https://media.zenfs.com/en/the_conversation_464/ba6aa7f90c672900fb 4026e582fcb91f” data- src=”https://s.yimg.com/ny/api/res/1.2/F02vCzZwG739o5HX3.xdHg–/YXBwaWQ9aGlnaGxhbmRlcjt3PTk2MDtoPTYzOQ–/https://media.zenfs.com/en/the_conversation_464/ba6aa7f90c672900fb4 026e582fcb91f”/>

In 1954, the Guardian’s science correspondent reported on “electronic brains”, which had a form of memory that allowed them to recall information, such as airplane seat allocation, in a matter of seconds.

Today, the idea of ​​computers storing information is so common that we don’t even think about what words like “memory” really mean. However, in the 1950s, this language was new to most people and the idea of ​​an “electronic brain” was full of possibilities.

In 2024, your microwave will have more computing power than anything called a brain in the 1950s, but the world of artificial intelligence is posing new challenges for language… and lawyers. Last month, the New York Times filed a lawsuit against OpenAI and Microsoft, the owners of the popular AI-based text generation tool ChatGPT, for their alleged use of Times articles in the data they use to train (improve). and try. their systems.

They claim that OpenAI has infringed copyright by using their journalism as part of the ChatGPT creation process. In doing so, the lawsuit claims, they have created a competing product that threatens their business. OpenAI’s response so far has been very cautious, but a key principle outlined in a statement released by the company is that its use of online data is governed by the principle known as “fair use.” This is because, OpenAI maintains, they transform the work into something new in the process: the text generated by ChatGPT.

At the heart of this issue is the question of data usage. What data do companies like OpenAI have the right to use, and what do concepts like “transform” really mean in these contexts? Questions like this, related to the data with which we train AI systems or models, such as ChatGPT, remain a fierce academic battleground. The law often lags behind industry behavior.

If you’ve used AI to answer emails or summarize your work, you might see ChatGPT as an end justifying the means. However, perhaps we should be concerned if the only way to achieve this is to exempt specific corporate entities from the laws that apply to everyone else.

Not only could this change the nature of the debate surrounding copyright lawsuits like this, but it also has the potential to change the way societies structure their legal system.


Read more: ChatGPT: What the law says about who owns the copyright to AI-generated content


Fundamental questions

Cases like this can raise thorny questions about the future of legal systems, but they can also call into question the future of AI models themselves. The New York Times believes ChatGPT threatens the newspaper’s long-term existence. On this point, OpenAI says in its statement that it is collaborating with news organizations to provide novel opportunities in journalism. He says the company’s goals are to “support a healthy news ecosystem” and “be a good partner.”

Even if we believe that AI systems are a necessary part of our society’s future, it seems like a bad idea to destroy the data sources they were originally trained on. This is a concern shared by creative initiatives such as the New York Times, authors such as George RR Martin and also the online encyclopedia Wikipedia.

Proponents of large-scale data collection, such as that used to power large language models (LLMs), the technology underlying AI chatbots like ChatGPT, argue that AI systems “transform” data with those that train by “learning” from their data sets and then creating something new.

Sam Altman

Effectively, what they mean is that researchers provide data written by people and ask these systems to guess the next words in the sentence, as they would when answering a real question from a user. By hiding and then revealing these answers, researchers can provide a binary “yes” or “no” answer that helps drive AI systems toward accurate predictions. It is for this reason that LLMs require a large amount of written texts.

If we copied articles from the New York Times website and charged people for access, most people would agree that this would be “systematic theft on a massive scale” (as the newspaper’s lawsuit puts it). But improving the accuracy of an AI by using data to guide it, as shown above, is more complicated than this.

Companies like OpenAI do not store their training data and therefore argue that the New York Times articles fed into the dataset are not actually being reused. However, a counterargument to this defense of AI is that there is evidence that systems like ChatGPT can “leak” textual excerpts from their training data. OpenAI says this is a “rare bug.”

However, it suggests that these systems store and memorize some of the data they are trained on (inadvertently) and can regurgitate it verbatim when specifically requested. This would prevent any paywalls that a for-profit publication might implement to protect its intellectual property.

The use of language

But what is likely to have a long-term impact on how we approach legislation in cases like these is our use of language. Most AI researchers will tell you that the word “learning” is very important and inaccurate to describe what AI is actually doing.

It is worth asking whether the law in its current form is enough to protect and support people as society undergoes a massive shift towards the age of AI. If something is based on an existing copyrighted work in a different way than the original it is called “transformative use” and is a defense used by OpenAI.

However, these laws were designed to encourage people to remix, recombine, and experiment with works already released to the outside world. In reality, the same laws were not designed to protect multibillion-dollar technological products that operate at a speed and scale many orders of magnitude greater than any human writer could aspire to.

The problem with many of the defenses of large-scale data collection and use is that they rely on strange uses of the English language. We say that AI “learns”, that it “understands”, that it can “think”. However, these are analogies, not precise technical language.

Just like in 1954, when people looked at the modern equivalent of a broken calculator and called it a “brain,” we are using ancient language to deal with entirely new concepts. No matter what we call it, systems like ChatGPT don’t work like our brains and AI systems don’t play the same role in society as people.

Just as we had to develop new words and a new common understanding of technology to make sense of computers in the 1950s, we may need to develop new language and new laws to help protect our society in the 2020s.

This article is republished from The Conversation under a Creative Commons license. Read the original article.

The conversationThe conversation

The conversation

Mike Cook does not work for, consult with, own shares in, or receive funding from any company or organization that would benefit from this article, and has disclosed no relevant affiliations beyond his academic appointment.

Leave a Reply

Your email address will not be published. Required fields are marked *