Ensuring the lawfulness of the data processing - Defining a legal basis

07 June 2024

An organisation that wishes to build a training dataset containing personal data and then use it to develop an AI system must ensure that the processing is lawful. The CNIL helps you determine your obligations based on your responsibility and the means of collecting or reusing the data.

This content is a courtesy translation of the original publication in French. In the event of any inconsistencies between the French version and this English translation, please note that the French version shall prevail.

The controller must in all cases define a legal basis and carry out, depending on the method of collection or re-use of the data, certain additional verifications.

There are several ways to build a training dataset, which can be used cumulatively:

data is collected directly from individuals;
data is collected from open sources on the Internet for this purpose;
data was initially collected for another purpose by the controller itself (e.g. in the context of providing a service to its users) or by another controller. This involves taking additional precautions.

Define a legal basis

The legal basis for consent

To be valid, the consent of the data subjects must meet four cumulative criteria: it must be freely given, specific, informed and unambiguous. The controller must be able to demonstrate the validity of the use of this legal basis by ensuring that each of these conditions, specifically defined by the GDPR, is met.

Example: an organisation wishes to film or photograph volunteers to create a dataset of images to train a system to detect certain specific gestures. It may base the processing on the basis of their consent.

When creating a dataset for, an organisation must ensure the validity of the consent collected.

Beyond the obligations of transparency, a certain amount of information must be provided to the data subjects before they consent, in order to enable them to make informed decisions and to allow them to withdraw their consent.

Consent must relate to a specific purpose (see how-to sheet 2 on the definition of the purpose).

The freedom of consent implies, in principle, the possibility for data subjects to give their consent in a granular way, where there are different purposes.

Example: the consent of individuals to the use of their image, collected at a company event for communication purposes, does not mean that they consent to a re-use of the data for building a training dataset or improving an AI system. In this case, two separate consents must be collected (e.g. via two check boxes).

The freedom of consent may also be impacted in the case of an imbalance of power in the relationship between the data subject and the controller, especially if the controller is a public authority or an employer.

Example: a company wants to use the data of its employees to develop an AI system. Their consent can only be validly collected in exceptional situations, where they are able to refuse to give their consent without fear or incurring negative consequences. As controller, the company must ensure, in any event, that the communications intended to present the device to employees are neither incentive nor binding. It must inform the volunteers of the possibility of no longer participating in the collection of their data at any time, without any consequence.

It does not seem possible to obtain valid consent in some cases. This is often the case when the controller collects data accessible online or reuses an dataset available online, especially given the lack of contact with the data subjects and the difficulty in identifying them. In these cases, the controller must rely on a more appropriate legal basis.

There may also be difficulties related to the right to withdraw consent, for example due to technical obstacles to the identification of data subjects. If it is not possible for the controller to guarantee the possibility of exercising this right, it is recommended to rely on another legal basis.

The legal basis for the legitimate interest

The controller may rely on its legitimate interest provided that it complies with the following conditions:

the legitimacy of the interest pursued by the controller. For example, the interest of an organisation in developing a model for the commercialisation of an AI system or in order to contribute to the improvement of scientific knowledge, for example by publishing the tools developed (code, model, experimentation protocol, etc.) and research results.
the necessity of the data processing. For example, processing for the purpose of creating up a training dataset containing images of people may be considered necessary for the interests of an organisation wishing to develop a pose estimation system, where anonymous or synthetic data are not sufficient.
the absence of a disproportionate impact on data subjects’ interests and rights and freedoms, taking into account their reasonable expectations. Balancing of the rights and interests at hand depends on the specific characteristics of the processing and in particular on the safeguards implemented to ensure the best possible balance between those interests and to limit the impact of the processing on the data subjects.

More often than not, creating a training dataset whose use is lawful can be considered legitimate. However, an analysis is necessary to determine whether the use of personal data for this purpose does not disproportionately infringe the privacy of the data subjects, even when the data is not nominative. To guarantee that its processing is proportionate, the controller may implement measures such as: pseudonymisation of the data, ensuring the absence of sensitive data, defining selection criteria to limit the collection to the relevant and necessary data, etc.

Examples: A company wants to develop an AI system that can predict a person’s psychological profile from online data that may relate to them. Its commercial interest in developing such a system is likely to be insufficient in the light of the interests, rights and freedoms of data subjects: another legal basis will have to be sought or the project abandoned.

An organisation creates a training dataset by collecting comments made public and freely accessible by online users on forums, blogs and websites. The purpose of this processing is to design an AI system to assess and predict the appreciation of works of art by the general public. In this case, its interest in developing and possibly marketing an AI system may be considered legitimate. The collection of feedback on the works may be considered necessary for the development of the model, especially given the amount of training data required . It should be noted that the legal basis of legitimate interest gives data subjects the right to object to the processing of their data (for reasons relating to their particular situation).

The legal basis of the task carried out in the public interest

The legal basis of the contract

The legal basis of the contract could be used for the creation of a training dataset for an AI system provided that a valid contract is concluded between the controller and the data subject and that the processing is objectively necessary for its performance.

Contracts concluded for this purpose must comply with other applicable rules, such as labour law or intellectual property.

Examples: A text editor software company offers an automated and personalised mail generation service, to which the user contractually subscribes, and for which the editor collects the data of the users of this service. The data processing for this personalisation service may be considered, subject to its specific characteristics, necessary for the performance of the contract.

Conversely, the operator of an online social network registered in its general terms and conditions of use that it intends to reuse the data of its users (provided by them, observed or inferred by the operator) to develop and improve new products, services and functionalities useful for its users. It cannot base the processing on the legal basis of the contract since such processing is not objectively necessary in order to offer them its online social network service (ECJ, 4 July 2023, Meta Platforms Inc. and a. c/Bundeskartellamt, C-252/21).

Sensitive data: prohibited processing, with exceptions

Sensitive data is a particular category of personal data defined in Article 9 of the GDPR. Sensitive data includes, for instance, data revealing the alleged racial or ethnic origin of the data subjects, or biometric data for the purpose of uniquely identifying a natural person, such as a facial template.

The GDPR prohibits the processing of such data, except in the cases listed in Article 9.2. of the GDPR. These exceptions include in particular:

processing operations for which data subjects gave their explicit consent (active, explicit and preferably written, freely given, specific and informed);
processing of personal data which is manifestly made public by the data subject;

In its Guidelines on targeting users of social networks, the EDPB provides a list of factors to be taken into account in determining whether the data is manifestly made public: the default setting of the social media platform, the nature of the platform, the accessibility of the page concerned, the visibility of the information about its public nature, whether the data subject has published the data himself or whether it has been published by a third party or deduced.

It is important to check whether the data subject wished, explicitly and by a clear positive act, on the basis of an informed setting, to make his or her personal data accessible to the general public or, on the contrary, to a more or less limited number of selected persons (ECJ, 4 July 2023, Meta Platforms, C ‑252/21).
processing necessary for reasons of substantial public interest, on the basis of EU or Member State law;
processing operations necessary for the purpose of scientific research on the basis of European Union or Member State law.

Particular attention should be paid to the collection of sensitive data when using web scraping tools that involve the processing of large volumes of data. The controller has to implement measures to automatically exclude the collection of irrelevant sensitive data, in particular by applying filters to exclude the collection of certain categories of data or to exclude certain sites that gather sensitive data by nature. If, despite the measures taken, the organisation processes incidentally and residually sensitive data that it had not sought to collect, it is not considered illegal. In particular, the Court of Justice of the European Union held that that prohibition applies to the operator of a search engine “in the context of its responsibilities, powers and possibilities” (ECJ, Grand Chamber, 24 September 2019, GC and Others, C-136/17). On the other hand, if the organisation comes to know that it is processing sensitive data, it has to proceed, as far as possible, to its immediate and automated deletion.

Please note:

A how-to sheet on bias management will be published at a later date. It will clarify the possibility of processing sensitive data for the purpose of detecting and correcting bias in the training dataset.
The CNIL is currently conducting work on the issue of AI in health, which will be published later.