Text and Data Mining Reservation

Since June 2021, German copyright law has established that copyrighted works may be reproduced for text and data mining purposes, even outside of scientific research. What exactly constitutes “text and data mining,” and whether every type of AI model training falls under this category, is actively debated elsewhere (see, for example, Käde, CR 9/2024, 598).

However, it is clear that § 44b of the German Copyright Act (UrhG), which declares such reproductions permissible even without a license, also provides that rights holders can declare a reservation against such reproductions (§ 44b para. 3 UrhG). For works available online, this must be done in “machine-readable form.” However, there is still no consensus on what “machine-readable” means and how this reservation should be implemented in practice. In the legislative reasoning, the German legislature assumes that such a reservation can also be declared in terms of use or in a website’s imprint, “if it is machine-readable there” (cf. BT-Drs. 19/27426, p. 89).

Without further discussing the terminology here, we are collecting possibilities and information on how such a reservation can be designed, as numerous practical approaches to implementation are already developing.

On September 27, 2024, the Hamburg Regional Court issued a decision in a case where the effectiveness of a reservation ultimately didn’t matter because the defendant could rely on § 60d UrhG (text and data mining for scientific research – no reservation can be declared against reproductions for this purpose). In an obiter dictum, the court also addressed the question of machine readability and assumes that a reservation formulated in natural language also meets the requirements of § 44b para. 3 UrhG. While this is favorable for rights holders, it could mean considerable effort in practice for those who want to rely on the limitation. The options presented below for declaring a reservation therefore particularly provide suggestions to rights holders on how to declare a reservation in a form that makes it easier for interested parties to consider the reservation.

We also point out that we naturally cannot say with certainty that the information collected here meets the possible court requirements for machine readability – the current information situation is still too sparse for that.

0. Why Machine Readability is Important

To train an AI model, large amounts of training data are typically required. For example, to train an AI that can analyze images, large quantities of corresponding images must be collected. If an existing, online available collection cannot or should not be used, one way to obtain a collection of images is to search the web for them. Since doing this manually would require too much effort, automated programs are developed to specifically search websites for images or links to images and download them. However, for these programs to consider the wishes of rights holders who don’t want their works – images – to be downloaded for text and data mining purposes, these programs must implement a function that checks each visited website (or ideally each requested image) for the presence of such a reservation.

1. Robots.txt

In the context of machine readability, the use of the robots.txt file is always mentioned. This file format has been used for many years for communication between website operators and search engine bots. The file allows specifying in a standardized text format (Robots Exclusion Protocol) which areas of a site may be searched by (search engine) bots and which should not be indexed (Example: jbb.de/robots.txt). However, these files do not enable enforcement of these rules.

Since bots are sometimes used for the automated collection of AI training data, this file can also be used to declare a corresponding reservation. And because robots.txt is a pure text file, any other text can basically be included, such as human-readable text explaining the reservation.

Disadvantages:

  • To effectively exclude bots, the bot’s name must be known. This can be determined for major known players, but as soon as new bots with new names appear, they cannot be directly addressed. A blanket exclusion of bots means that search engines will also no longer crawl the site.
  • When using human-readable texts, crawlers face the challenge of recognizing and analyzing them. This could lead to considerable delays in the crawling process.

Advantages:

  • The use of this file in automated systems is already an established standard.
  • Handling is simple because it’s sufficient to place this file in the website’s root directory.
  • Specific folder paths that should not be crawled can be specified.

 

2. X-Robots-Tags (HTTP Headers / HTML meta-tags)

Another possibility is defining “X-Robots-Tags” in server settings (HTTP headers) or in HTML (meta tags). More detailed information with examples can be found on the Google Developer websites. A simplified explanation with instructions can also be found in Käde, CR 9/2024, 598, 602. For these tags, any text values can basically be specified (such as freely invented specifications like “noTDMinGermany” or similar), but values like “noai”, “noimageai” or “noml” seem to be establishing themselves (see e.g., “A Survey of Web Content Control for Generative AI” and Opt-Out in “img2dataset”).

Using these tags may require adjusting server settings. Therefore, this way of declaring a reservation is not necessarily intuitive and not easy for all rights holders to manage – but it allows checking for the presence of a reservation when accessing each individual file.

Disadvantages:

  • To use these tags, HTML of the website or server settings may need to be adjusted
  • There is currently no known general list of valid values that crawlers can use for orientation
  • This method also offers no guarantee that crawlers will follow the specifications

Advantages:

  • The verification is particularly accessible for crawlers/scrapers because information is available in structured format and can be checked directly when downloading an image to see if a reservation exists
  • The declaration of a reservation is possible in a granular way; for example, the tag “noimageai” can specifically address crawlers that collect images
  • Initial plugins for major content management systems are already being offered to facilitate the use of X-Robots-Tags

 

3. TDM Reservation Protocol

The World Wide Web Consortium (W3C), which deals with standardization in other areas as well, now has a group addressing the topic of text and data mining reservation. They are currently working on a unified way to declare such a reservation (see also their GitHub repository).

Central to this is setting the value “tdm-reservation” to either “1” (reservation declared) or “0” (no reservation declared). The TDM Reservation Protocol then suggests various locations in the current version where this value can be set, such as in a structured file (in JSON format), in HTTP headers, in HTML files, or even in e-book file formats like epub.

Disadvantages:

  • See X-Robots-Tags – potentially not easily accessible, but website configuration interfaces and common content management systems are expected to follow soon if this standard becomes established
  • No guarantee that crawlers will follow the specifications
  • No differentiation possibilities

Advantages:

  • W3C is a recognized standardization body with corresponding reach
  • Protocol documents precisely specify how the declaration should be made – transparent for rights holders and crawling interested parties
  • Focus on simple Yes/No or 1/0 value should make verification more efficient for crawlers

This page is continuously updated. Last update: September 30, 2024.