Text and Data Mining Reservation

Since June 2021, German copyright law has established that copyrighted works may be reproduced for text and data mining purposes, even outside of scientific research. What exactly constitutes “text and data mining,” and whether every type of AI model training falls under this category, is actively debated elsewhere (see, for example, Käde, CR 9/2024, 598).

However, it is clear that § 44b of the German Copyright Act (UrhG), which declares such reproductions permissible even without a license, also provides that rights holders can declare a reservation against such reproductions (§ 44b para. 3 UrhG). For works available online, this must be done in “machine-readable form.” However, there is still no consensus on what “machine-readable” means and how this reservation should be implemented in practice. In the legislative reasoning, the German legislature assumes that such a reservation can also be declared in terms of use or in a website’s imprint, “if it is machine-readable there” (cf. BT-Drs. 19/27426, p. 89).

Without further discussing the terminology here, we are collecting possibilities and information on how such a reservation can be designed, as numerous practical approaches to implementation are already developing.

On September 27, 2024, the Hamburg Regional Court issued a decision in a case where the effectiveness of a reservation ultimately didn’t matter because the defendant could rely on § 60d UrhG (text and data mining for scientific research – no reservation can be declared against reproductions for this purpose). In an obiter dictum, the court also addressed the question of machine readability and assumes that a reservation formulated in natural language also meets the requirements of § 44b para. 3 UrhG. While this is favorable for rights holders, it could mean considerable effort in practice for those who want to rely on the limitation. The options presented below for declaring a reservation therefore particularly provide suggestions to rights holders on how to declare a reservation in a form that makes it easier for interested parties to consider the reservation.

We also point out that we naturally cannot say with certainty that the information collected here meets the possible court requirements for machine readability – the current information situation is still too sparse for that.

0. Why Machine Readability is Important

To train an AI model, large amounts of training data are typically required. For example, to train an AI that can analyze images, large quantities of corresponding images must be collected. If an existing, online available collection cannot or should not be used, one way to obtain a collection of images is to search the web for them. Since doing this manually would require too much effort, automated programs are developed to specifically search websites for images or links to images and download them. However, for these programs to consider the wishes of rights holders who don’t want their works – images – to be downloaded for text and data mining purposes, these programs must implement a function that checks each visited website (or ideally each requested image) for the presence of such a reservation.

1. Robots.txt

In the context of machine readability, the use of the robots.txt file is always mentioned. This file format has been used for many years for communication between website operators and search engine bots. The file allows specifying in a standardized text format (Robots Exclusion Protocol) which areas of a site may be searched by (search engine) bots and which should not be indexed (Example: jbb.de/robots.txt). However, these files do not enable enforcement of these rules.

Since bots are sometimes used for the automated collection of AI training data, this file can also be used to declare a corresponding reservation. And because robots.txt is a pure text file, any other text can basically be included, such as human-readable text explaining the reservation.

Disadvantages:

  • To effectively exclude bots, the bot’s name must be known. This can be determined for major known players, but as soon as new bots with new names appear, they cannot be directly addressed. A blanket exclusion of bots means that search engines will also no longer crawl the site.
  • When using human-readable texts, crawlers face the challenge of recognizing and analyzing them. This could lead to considerable delays in the crawling process.

Advantages:

  • The use of this file in automated systems is already an established standard.
  • Handling is simple because it’s sufficient to place this file in the website’s root directory.
  • Specific folder paths that should not be crawled can be specified.

 

2. X-Robots-Tags (HTTP Headers / HTML meta-tags)

Another possibility is defining “X-Robots-Tags” in server settings (HTTP headers) or in HTML (meta tags). More detailed information with examples can be found on the Google Developer websites. A simplified explanation with instructions can also be found in Käde, CR 9/2024, 598, 602. For these tags, any text values can basically be specified (such as freely invented specifications like “noTDMinGermany” or similar), but values like “noai”, “noimageai” or “noml” seem to be establishing themselves (see e.g., “A Survey of Web Content Control for Generative AI” and Opt-Out in “img2dataset“).

Using these tags may require adjusting server settings. Therefore, this way of declaring a reservation is not necessarily intuitive and not easy for all rights holders to manage – but it allows checking for the presence of a reservation when accessing each individual file.

Disadvantages:

  • To use these tags, HTML of the website or server settings may need to be adjusted
  • There is currently no known general list of valid values that crawlers can use for orientation
  • This method also offers no guarantee that crawlers will follow the specifications

Advantages:

  • The verification is particularly accessible for crawlers/scrapers because information is available in structured format and can be checked directly when downloading an image to see if a reservation exists
  • The declaration of a reservation is possible in a granular way; for example, the tag “noimageai” can specifically address crawlers that collect images
  • Initial plugins for major content management systems are already being offered to facilitate the use of X-Robots-Tags

 

3. TDM Reservation Protocol

The World Wide Web Consortium (W3C), which deals with standardization in other areas as well, now has a group addressing the topic of text and data mining reservation. They are currently working on a unified way to declare such a reservation (see also their GitHub repository).

There are two approaches to declaring the reservation of use according to the protocol: tdm-reservation and tdm-policy. The key to tdm-reservation is to set the value to either “1” (reservation declared) or “0” (no reservation declared). In the current version, the TDM Reservation Protocol then suggests various locations where this value can be set, for example in a structured file (in JSON format), in HTTP headers, in HTML files or even in e-book file formats such as epub. In addition, a more detailed and differentiating reservation can be declared with tdm-policy. In this respect, reference is made to the fact that the policy must have the format application/json or application/ld+json for a machine-readable declaration.

Disadvantages:

  • See X-Robots-Tags – potentially not easily accessible, but website configuration interfaces and common content management systems are expected to follow soon if this standard becomes established
  • There is no guarantee that crawlers will follow the specifications

Advantages:

  • W3C is a recognized standardization body with corresponding reach
  • Protocol documents precisely specify how the declaration should be made – transparent for rights holders and crawling interested parties
  • Focussing on simple Yes/No or 1/0 value should make verification more efficient for crawlers

 

4. Further Options

In addition to the options already mentioned, further possibilities for declaring a machine-readable reservation are occasionally being discussed:

A similar approach to Robots.txt is taken by the spawning.ai project. Through their website, an ai.txt file can be generated to allow or prohibit the use of content for AI models. According to their statements, this shall not affect discoverability in search engines (“Will ai.txt file impact my website’s SEO? No, the ai.txt file is specifically designed for AI miners and does not impact traditional search engine crawlers or your website’s SEO”).

So far, limited attention has been given to the TDM·AI protocol, introduced in May 2024, which links machine-readable opt-out statements for AI training data with digital media files. TDM·AI specifically addresses text and data mining uses for AI training. By leveraging the International Standard Content Code (ISCC ISO 24138:2024), a new ISO standard for identifying digital media content, along with Creator Credentials, it aims to ensure that verifiable and machine-readable statements contain proper attribution of rights holders’ preferences.

Moreover, the Internet Engineering Task Force (IETF), a leading organization for developing internet standards, has formed a working group in February 2025 (AI Preferences Working Group – AIPREF), which will focus on standardizing mechanisms that enable the expression of preferences regarding how content is collected and processed for the development, deployment, and use of AI models.

 

This page is continuously updated. Last update: March 11, 2024.