First Significant EU Decision Concerning Data Mining and Dataset Creation to Train Artificial Intelligence


5 minute read | October.21.2024

A court in Hamburg, Germany, has decided a copyright infringement case in a way that sheds light on how European courts may apply the text and data mining (TDM) exemption to AI model developers.

The exemption is contained in EU Directive 2019/790 on copyright and related rights in the Digital Single Market Directive.

The TDM exemptions are detailed here. A link to the implementation by the Urhebergesetz (German Copyright Act) is here.

Key Takeaway

The Regional Court of Hamburg held that the TDM exemption in Germany copyright law applied in a case involving a not-for-profit organization copying a photo in creating a dataset to train AI models.

Facts of the Case

  • The defendant is a not-for-profit association. It creates datasets it makes freely available to the public to train AI models.
  • The dataset at issue consists of a spreadsheet with hyperlinks to images and image files.
    • The files are publicly available online, along with data, including descriptions of the image (5.85 billion image-text pairs).
    • The dataset was created using a dataset with URLs and text descriptions of the images.
    • The defendant extracted and downloaded URLs of the images. Some images were filtered out. The remaining images and the associated metadata were extracted and added to the new dataset.
    • During this process, the image at issue was captured, downloaded, analyzed and included in the dataset with its metadata. The image contained a photo agency’s watermark. It was posted on the agency’s website, downloaded and thereby reproduced.
    • The photo agency’s website says: “You may not…. use automated programs, applets, bots or the like to access the …website or any content thereon for any purpose, including, by way of example only, downloading content, indexing, scraping or caching any content on the website.”

The wording has been on the site since at least January 13, 2021. The dataset was created in the second half of 2021.

Plaintiff’s Arguments

The plaintiff alleged a violation of copyright in the form of an impermissible reproduction. The plaintiff also alleged that:

  • The TDM exemptions in German copyright law did not cover the reproduction.
  • Gathering data to train AI does not constitute text or data mining within the meaning of the law, and legislators did not contemplate that use when they introduced the exemption.
  • The mass incorporation of copyrighted works to train generative AI models impairs the normal exploitation of these works. As a result, the exemptions should not apply.
  • The reproduction was unauthorised due to the restriction on the agency website, and the restriction was machine-readable.
  • The defendant is not a research organization and therefore not entitled to rely on the unqualified exemption applicable to TDM activities undertaken for research.

Defendant's Arguments

The defendant maintained the TDM exemption covered the download and reproduction. The defendant also said:

  • Analysing the image files and extracting metadata to train AI is a main application of the TDM exemption.
  • The defendant did not create parallel digital archives since the downloaded images were not stored and only hyperlinks were included in the dataset.
  • The photo agency – not the rights holder – declared the reservation regarding restrictions on using the photo on the agency’s website.
    • The restriction was worded generally with no specific reference to text and data mining.
    • The wording was not machine-readable.
    • The defendant also argued it is a non-profit association committed to research. The fact that some board and association members may also work for tech companies does not change the association’s non-commercial status.

The Court’s Ruling

On September 27, the court ruled that the defendant did interfere on the plaintiff’s rights of exploitation by reproducing the photograph at issue – but that the TDM exemption for research organisations applied.

The court noted that the download was made for text and data mining within the meaning of the law.

Other noteworthy findings:

  • The defendant’s reproduction of the image was neither transient nor incidental.
  • The defendant probably could not have relied on the German copyright law equivalent of Article 4 of the DSM due to the valid reservation of rights on the website. The court, however, took the view that Section 44 b) of the German Copyright Act, which implements Article 4 of the DSM, generally applied to creating training data.
    • The court did not decide whether training an AI model is subject to the TDM exemptions. The court does, however, seem to consider that training could be subject to the exemption. The court noted that possible future applications of a rapidly evolving technology such as AI cannot be foreseen at the time a dataset is created.
    • As a result, there is no legal certainty as to a general intention to create AI-generated content using a given dataset. As such, this possibility cannot be used to assess the legality of creating the dataset in the first place.
  • The argument that legislators did not have generative AI in mind when drafting the TDM exemption is not a valid reason to narrowly interpret the exemption.
  • Additionally, the EU AI Act says that creating datasets to train AI machine learning models is subject to the TDM exemption.This is because providers of such models must have policies to comply with the reservation of rights asserted under Article 4(3) of the DSM Directive.
  • The plaintiff photographer could rely on the reservation of rights on the photo agency’s website to protect his own rights. The reservation of rights also was sufficiently clear. The natural language reservation on the photo agency’s website satisfies the requirements of machine-readability of a valid reservation of rights.

LEARN MORE

TDM exemptions from the DSM

Urhebergesetz (Germany Copyright Act)