close
close

topicnews · September 17, 2024

Copyright challenges in training generative AI models > Artificial Intelligence | Attorney Ferner

Copyright challenges in training generative AI models > Artificial Intelligence | Attorney Ferner

The use of generative AI models such as ChatGPT, DALL-E or Stable Diffusion has increased enormously in recent years. These models are able to generate creative content based on user instructions, such as texts, images or pieces of music. This ability for autonomous creativity is based on the fact that the AI ​​models have “learned” from large amounts of data how to create corresponding content. A significant part of these data sets is protected by copyright, which leads to considerable legal challenges.

Technological fundamentals

Generative AI models are based on machine learning, in particular deep artificial neural networks (ANNs), which are trained to recognize complex patterns in large amounts of data.

These models use learning processes such as supervised, unsupervised and reinforcement learning to improve their capabilities. An important aspect is the pre-training and fine-tuning of the models: The base model is first trained on a general dataset (pre-training) and then adapted to more specific tasks or styles (fine-tuning). This allows the models to be used flexibly for different applications.

Copyright aspects

According to a recent report by Dornis and Stober, numerous copyright-relevant actions arise during the training of generative AI models. These include:

  1. Collection, preparation and storage of training data: This reproduction of copyrighted works occurs in the course of creating corpora that serve as the basis for AI training.
  2. Training generative AI models: During the training process, especially during pre-training and fine-tuning, reproductions of the works “inside” the model occur. Even if the data is not explicitly stored, it can still be stored by the model, which is considered reproduction in the sense of copyright law.
  3. Using generative AI models: Users who apply generative AI systems generate new content through the models, which in turn could be based on the protected training data. This constitutes a use of the copyrighted works.
  4. Public accessibility: When generative AI models are made available for use, either through user applications or as a download, there is a public availability of the works used for training and reproduced in the model.

Legal barriers and challenges

The conventional limitations of copyright law only cover interventions caused by training generative AI models in a few cases, which are often practically irrelevant. It is particularly important to note that, according to the expert opinion, the limitation for text and data mining (TDM) is not applicable. Generative AI models use the training data more comprehensively than TDM, as they use not only semantic but also syntactic information and represent it in a vector space. Thus, according to the analysis there, there is a comprehensive duplication of content that goes beyond what would be covered by TDM.

DSM Directive

The DSM Directive, which forms the legal basis for TDM, was not geared towards the technological developments of creative-productive AI systems and therefore explicitly excludes their application. Likewise, the AI ​​Regulation did not take these specific differences into account, which leads to legal grey areas.

Relevant copyright limitations and their application

In German copyright law, there are various limitations that allow interventions in copyrighted works under certain conditions. In the context of training generating AI models, the following limitations are particularly relevant:

  1. § 44a UrhG – Temporary acts of reproduction: This limitation allows temporary copies that are fleeting or incidental and constitute an integral and essential part of a technical process, provided that they have no independent economic value. However, according to the report by Dornis and Stober, this limitation is of limited use when training AI models, as the copies are not merely fleeting but are often of a long-term nature and go beyond what is technically necessary.
  2. § 60d UrhG – Text and Data Mining (TDM): Section 60d of the Copyright Act allows reproduction of works for the purpose of text and data mining for non-commercial scientific research. However, this restriction is hardly relevant for generative AI models, since the commercial use of such models is not covered by Section 60d. The report also emphasizes that generative models not only extract semantic information, but also use syntactic structures, which goes beyond the TDM restriction.
  3. § 60a to 60c UrhG – Uses for teaching, science and institutions: These cabinets allow some use of copyrighted works for educational and scientific purposes. However, they are limited to non-commercial contexts and usually do not directly affect the training of generative AI models, as most models are also used commercially.
  4. § 44b UrhG – Temporary reproductions in the context of access to networks: Section 44b of the Copyright Act allows temporary acts of reproduction to be carried out that are necessary to enable access to networks and their contents, if these acts are technically necessary and without an independent economic purpose. The report assessed this limitation as particularly relevant, but only partially applicable to generative AI models. The main reason is that the reproductions that take place as part of AI training are often not only temporarily but permanently stored in the model and thus go beyond the scope of Section 44b. The models often memorize the structure and contents of the training data, which represents long-term use and not just a fleeting technical necessity.

This would mean that many of the copies would be in a legal grey area – or even clearly outside the legal limits, which would lead to considerable legal uncertainty.

Applicable law and international jurisdiction

The report emphasizes that making AI models publicly available for use by German users – e.g. via ChatGPT’s OpenAI website – can result in the application of German copyright law and the jurisdiction of German courts. Since the training data is protected by copyright and is reproduced “inside” the models, this represents a relevant exploitation within the meaning of copyright law.

Conclusion and outlook

The use of generating AI models brings with it considerable legal uncertainties, particularly with regard to copyright infringements during the training and application of these models. The report shows that the legal framework is currently inadequate to adequately meet the challenges posed by rapid technological development.

Copyright challenges in training generative AI models - Attorney Ferner

The issue of copyright when training AI with third-party data is well understood and usually makes up the majority of inquiries I receive. It is expected that this issue will become more acute in the coming years, which is why clear legal regulations are urgently needed to both protect the rights of authors and promote innovation in the field of AI.

In particular, the authors conclude that the traditional copyright restrictions, in particular Section 44b of the Copyright Act, are not sufficient to justify the extensive reproduction and use of copyrighted works by generative AI models. While some restrictions such as Section 44a of the Copyright Act and Section 60d of the Copyright Act allow short-term and specific uses, the specific requirements and long-term storage of the models remain unaffected by these regulations.

Attorney Jens Ferner (specialist in IT and criminal law)Attorney Jens Ferner (specialist in IT and criminal law)
Last article by lawyer Jens Ferner (specialist in IT & criminal law) (Show all)