close
close

topicnews · October 14, 2024

Copyright and AI training: “An almost paradoxical situation”

Copyright and AI training: “An almost paradoxical situation”

Training generative AI models is not text and data mining. This is the result of the new open access study “Copyright and Training of Generative AI Models – Technological and Legal Basis”, which was commissioned by the Copyright Initiative. This is important because AI companies want to rely on it, among other things, so that they don’t have to pay creative people.

Advertisement




Tim W. Dornis and Sebastian Stober

Tim W. Dornis and Sebastian Stober: Professors in Hanover and Magdeburg.

(Image: private)

“We began our investigation completely unbiased and were also aware that the topic had not yet been comprehensively examined in an interdisciplinary manner,” say the study authors Tim W. Dornis and Sebastian Stober. Dornis is a legal scholar and professor at the Law Faculty of the University of Hanover and completed his JSM at Stanford, while Stober is Professor of Artificial Intelligence at the Otto von Guericke University Magdeburg.

In addition to lifting the so-called TDM barrier, the researchers also assume that training data and thus copyrighted works will be reproduced within the AI ​​models. “This is important for legal action against violations in AI training and use,” it says. Dornis and Stober explain what that could mean specifically in an interview with heise online.

heise online: Mr. Stober, Mr. Dornis, in connection with the question of whether AI training in its current form violates copyright law, there is always talk of so-called text and data mining, which this supposedly falls under. What is TDM and what possible uses are there?

Sebastian Stober: Data mining is about the automated extraction of new information, patterns and insights from large data collections. Text mining involves text collections. The possible uses are very diverse. The insights gained can form the basis for business models – for example when markets and customer behavior are analyzed. In politics, opinion analyzes are regularly used, for which text mining is carried out in the background on social media. Data mining is also an important tool in science for practically all data-based questions.

However, as with any technology, its intended use may not always be positive for society. For example, the information obtained can also be used to specifically manipulate people, as was shown in the Cambridge Analytica scandal. Here, society is required to set clear limits as to which uses are undesirable.

Many AI companies now think that their AI training falls under the so-called TDM barrier because they apparently adopted it from the AI ​​law. In other words, they believe this is regulated by law. But the roots of the AI ​​law lie years before the existence of generative AI systems.

Tim W. Dornis: I don’t see it that way either. If interpreted correctly, it cannot be deduced from the AI ​​law and the legal materials that generative AI training should fall under the TDM barrier. The wording alone raises doubts. Above all, however, a deeper examination of the principles and background would have been necessary – with a particular focus on copyright.

However, this was still missing at the time the AI ​​law was passed. In other words: As our study shows, a thorough examination of AI technology is required. This has simply been neglected in previous legal opinion formation. However, the debate should not stop at this point just for “convenience reasons” and with the argument that anything else could endanger “European AI innovation”. Nevertheless, this seems to be the trend in the current legal debate.

Why is AI training so much more important than TDM? It’s just computers scouring the Internet and drawing conclusions from it.

Stober: On the one hand, we have to distinguish between data collection and training, which are often carried out by different actors, and once a data collection has been created, it can be used by a wide variety of actors to train a wide variety of AI models.

On the other hand, the term AI training needs to be more clearly differentiated. In our study, we placed great emphasis on emphasizing that this involves training generative AI models. There are regulations for text and data mining that enable the collection of data for training AI models. However, in the report we come to the conclusion that the training of generative AI models does not in itself fall into the area of ​​text and data mining – among other things because no new insights are gained. The trained models can only generate additional data that is similar to the training data. So it’s a completely different purpose. So the exception doesn’t apply here, and that’s the problem.

Publications can be subject to a reservation regarding TDM within the framework of the legal provisions. Wouldn’t that be a simple solution?

Dornis: On paper it seems like a simple solution. However, the practical implementation is anything but effective. You don’t even have to ask how to deal with already published works (e.g. books). Should these be retrofitted with inserts somewhere in the world?

Even with digital publications, we have to assume that once things are “online”, they can hardly be provided with complete content – and, above all, understandable for crawlers & Co. – with reservations. Finally, the question remains (as always) whether the AI ​​developers (and their crawlers, etc.) actually stick to it.

Can you explain how the AI ​​industry could assume that training in its current form constitutes a form of fair use (which does not exist in Europe)? First do it, then apologize, as Mark Zuckerberg once recommended?

Dornis: From a legal perspective, it’s easy to explain: the mindset in Silicon Valley has always been “Don’t ask for permission, ask for forgiveness later.” It is less about the idea of ​​the legality of one’s own actions and more about the conviction that, in the interest of innovation as a “good thing”, even short-term disruptions with the associated legal violations should be possible.

In addition, Silicon Valley could apparently also rely on the legal analysis already described. At least in Germany, the hypothesis “training generative AI models = TDM” was put forward shortly after the capabilities and functionality of generative AI became known. Little by little, more and more publications followed.

Don’t miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.