Ensuring the lawfulness of the data processing
An organisation that wishes to build a training dataset containing personal data, and then use that dataset to build an AI system, must ensure that the processing is permitted by law. The CNIL helps you determine your obligations based on your responsibility and the means of collecting or reusing the data.
The controller must in all cases define a legal basis and carry out, depending on the means of collection or re-use of the data, certain additional checks.
There are several ways to build a dataset for training purposes :
- the data are collected directly from individuals;
- the data are indirectly collected from open sources on the Internet for this purpose;
- the data were initially collected for another purpose by the controller itself (e.g. in the context of providing a service to its users) or by another controller. This means taking additional precautions.
Define a legal basis
The principle
Like any personal data processing, the creation and use of a dataset containing personal data can only be implemented if it corresponds to one of the “legal bases” provided for in the GDPR.
Specifically, the legal basis is what gives an organisation the right to process personal data. The choice of this legal basis is therefore an essential first step to ensure compliance of the processing. Depending on which one is selected, the obligations of the organisation and the rights of individuals may vary.
The most relevant legal bases for conceiving an AI system are detailed below.
In practice
The determination of the legal basis must be carried out in a manner appropriate to the situation and the type of processing. In order to create a dataset for the training of an AI system, the following legal bases may be envisaged, depending on the characteristics of the processing.
The legal basis of consent
To be valid, the consent of the data subjects must meet four cumulative criteria: it must be freely given, specific, informed and unambiguous. The controller must be able to demonstrate the validity of the use of this legal basis by ensuring that each of these conditions, precisely defined by the GDPR, is met.
Example: an organisation wishes to film or photograph volunteers to create a dataset of images to train a system to detect certain specific gestures. He may base the processing on the basis of their consent.
When building a dataset for training an AI model, an organisation must notably ensure that the consent is freely given.
In principle, this means ensuring the possibility for data subjects to give their consent on a case-by-case basis (granularly) where the intended purposes are distinct.
Example: the consent of individuals to the use of their image, collected at a company event for communication purposes, does not mean that they consent to a re-use of the data for building a training dataset or improving an AI system. In this case, two separate consents must be collected (e.g. via two check boxes).
The freedom of consent must also be subject to some vigilance in the event of an imbalance of power between the data subject and the controller, in particular if the controller is a public authority or an employer.
Example: to develop an AI system, a company wants to use the data of its employees. Their consent can only be validly collected in exceptional situations, where they are able to refuse to give it without fear or incurring negative consequences. As the controller, the company must ensure, in any event, that the communications intended to present the device to employees are neither incentive nor binding. It must inform the volunteers of the possibility of no longer participating in the collection of their data at any time, without any consequence.
It does not appear possible to obtain valid consent in some cases. This is often the case when the controller collects publicly accessible data online or reuses an open dataset, especially given the lack of contact with the data subjects and the difficulty in identifying them. In these cases, where the conditions for obtaining valid consent are not met, the controller must rely on another, more appropriate, legal basis.
The legal basis of the legitimate interest
The legitimate interest of the controller may only be considered under the following conditions:
- the legitimacy of the interest pursued by the controller. For example, the interest of an organisation in developing a model for the commercialisation of an AI system or contributing to the improvement of scientific knowledge may be considered legitimate, for example by publishing the tools developed (code, model, experimental protocol, etc.) and research results.
- the need for data processing to meet this legitimate interest. For example, processing for the purpose of creating a training dataset containing images of people may be considered necessary for the interests of an organisation wishing to develop a pose estimation system, where anonymous or synthetic data may not be sufficient.
- the absence of disproportionate interference with the interests and rights of data subjects, taking into account their reasonable expectations of such processing. The balancing of the rights and interests in question depends on the specific characteristics of the envisioned processing and in particular on the safeguards implemented to ensure the best possible equilibrium between those interests and limit the impact of the processing on the data subjects.
More often than not, creating a training dataset whose use is itself lawful can be considered legitimate. However, a case-by-case analysis is necessary to determine whether the very use of personal data for this purpose does not disproportionately infringe the privacy of the data subjects, even when the data are not personally identifiable in a direct manner. To ensure that its processing is proportionate, the controller may rely on measures such as anonymisation or pseudonymisation of the data, ensure the absence of sensitive data, define selection criteria to limit the collection to the data relevant and necessary for processing, etc.
Examples:
- A company wants to develop an AI system that can predict a person’s psychological profile from online data that may be linked to them. Its commercial interest in developing such a system is likely to be insufficient in the light of the interests, rights and freedoms of data subjects: another legal basis will have to be sought or the project abandoned.
- An organisation builds a training dataset by collecting comments made public and freely accessible by online users on forums, blogs and websites. The purpose of this processing is to design an AI system to evaluate and predict the appreciation of works of art by the general public. In this case, the interest of the organisation in developing and possibly marketing an AI system may be considered legitimate. The collection of feedback on the works may be considered necessary for the development of the model, especially given the amount of data required for training. It should be noted that the legal basis of legitimate interest gives data subjects the right to object to the processing of their data (on grounds relating to his or her particular situation).
The legal basis of the task carried out in the public interest
The possibility to rely on that legal basis presupposes:
- that the task is provided for in a normative text applicable to the controller;
- that the use of the data makes it possible to carry out this task specifically (this will not be the case if it pursues an objective which is unrelated to it or is too far removed from its particularities), in a relevant and appropriate manner.
Examples:
- Researchers from a public research laboratory working on the French language wish to analyse the evolution of the use of the language online. For this, they constitute a dataset based on comments published freely online on various social networks (anonymised at short notice) in order to train a model that automatically detects and analyses the occurrence of certain expressions or spelling forms.
To the extent that the controller is a public laboratory, in this case the researchers may base the data processing on the task carried in the public interest. This legal basis can be used, in general, for data processing carried out by public or private research laboratories entrusted with a task of public interest, the processing of which is necessary for their research activity.
- The Pôle d’Expertise de la Régulation Numérique (PEReN) is authorised to reuse, under certain conditions, publicly accessible data from certain platforms in order to carry out experiments aimed in particular at designing technical tools for the regulation of online platform operators, in accordance with Article 36 of Law No 2021-1382 of 25 October 2021 and Decree No 2022-603 of 21 April 2022.
For more information:
- Use case sheet No 4 of the Guide on the re-use of publicly accessible data (open data)
- What legal basis for research processing?
The legal basis of the contract
The legal basis of the contract could be used for the creation of a training dataset provided that a valid contract is concluded between the controller and the data subject and that the processing is objectively necessary for its performance.
Examples:
An organisation can call on professional actors to perform certain stagings and collect specific images for training an AI system.
If the purpose of the contract concluded is precisely to collect images with a view to constituting a training dataset, then it is possible to consider, subject to the specific characteristics of the data processing, that it is necessary for the performance of the contract. The controller must also ensure that there is a valid contract between the organisation and the actors that must be a party to the contract. Contracts concluded for this purpose must comply with other applicable rules, such as labour law or intellectual property.
On the contrary, the operator of an online social network registers in its general conditions of use that it intends to reuse the data of its users (provided by them, observed or deducted by the operator) to develop and improve new products, services and functionalities, useful for its users. It cannot base the processing on the legal basis of the contract since such processing is not objectively necessary in order to offer them his online social network service (CJUE, 4 July 2023, Meta Platforms Inc. and a. c/Bundeskartellamt, C-252/21).
Sensitive data: prohibited processing, with exceptions
Sensitive data is a particular category of personal data defined in Article 9 of the GDPR. For example, sensitive data reveal the alleged racial or ethnic origin of the data subjects, or biometric data for the purpose of uniquely identifying a natural person, such as a facial template.
The GDPR prohibits the processing of such data, except in the cases listed in Article 9.2. These exceptions include in particular:
- the processing operations for which the data subject has given his or her explicit consent (active, explicit and preferably written, which must be free, specific and informed);
- the processing of personal data which are manifestly made public by the data subject;
- the processing necessary for reasons of substantial public interest, on the basis of EU or Member State law;
- the processing operations necessary for the purpose of scientific research on the basis of European Union or Member State law.
In case of re-use of data, carry out the necessary additional tests and checks
The principle
In some cases, depending on the means of collection and the source of the, the controller that wants to create a training dataset is required to carry out certain checks to ensure that the processing of data is authorised by law. These checks must be conducted in addition to the identification of the legal basis for the data processing.
In practice
The provider reuses the data it originally collected for another purpose
A data controller may wish to reuse the data it has collected for an initial purpose (e.g. in the context of providing a service to individuals) in order to create dataset for the purpose of training an AI system.
In that case, the controller must determine whether that further processing is compatible with the purpose for which the data were originally collected, if the processing is not based on the data subject’s consent or on Union or Member State law.
The obligation to carry out this “compatibility test” applies to subsequent processing of data, i.e.:
- which have not been foreseen or brought to the attention of data subjects when collecting the data;
- which are carried out by the same controller who decides to reuse data for a purpose distinct from the purpose for which it was collected, including when it comes to publishing it on the Internet or sharing it with third parties for re-use for another purpose.
Please note: no compatibility test is required for the intended purposes and brought to the attention of the data subjects as soon as they are collected in accordance with the principle of transparency, including where some of them may appear secondary or accessory. For example, the sharing of data by a controller with its processor for the improvement of the performance of its algorithm does not require a compatibility test, if this purpose was intended and brought to the attention of the data subject (subject to compliance with the conditions of legality for this purpose of improving the algorithm).
In order to carry out this ‘compatibility test’, it must take into account in particular:
- any link between the initial purpose and the purpose of the intended further processing;
- the context in which the personal data have been collected, in particular regarding the reasonable expectations of the data subjects, depending on the relationship between the data subjects and the controller;
- the type and nature of the data, in particular according to its sensitivity (biometric data, geolocation data, relative to children, etc.);
- the possible consequences of the intended further processing for the data subjects;
- the existence of appropriate safeguards (such as encryption or pseudonymisation).
Examples:
- The provider of a text editor launches a generative AI feature to complete certain sentences or paragraphs. Sometime after deploying this functionality, it intends to re-use the manual corrections made by users to the content of the texts generated in this way, in order to offer each user a personalised version of its recommendation service (for example, to better understand and anticipate their writing style) based on their respective data.
- A consumer video streaming platform is now considering reusing the history and playlists it has saved as part of the provision of the service to offer each user a personalised version of their referral service (e.g. to better anticipate and understand their preferences) based on their respective data.
In both cases, the new purpose may be considered compatible with the original purpose of the provision of the service, provided that the guarantees implemented are sufficient (e.g. through the possibility of opposing such re-use without having to provide any reason) on the basis of their respective data.
Where the re-use of the data pursues statistical or scientific research purposes, the processing is presumed to be compatible with the original purpose if it complies with the GDPR and if it is not used to make decisions regarding the data subjects. The ‘compatibility test’ is therefore not necessary.
It should be noted that in order to pursue a statistical purpose within the meaning of the GDPR, the processing must only aim at the production of aggregated data for themselves: the sole purpose of the processing must be the calculation of the data, their display or publication, their possible sharing or communication (and not to the taking of subsequent decisions, individual or collective). The statistical results obtained must constitute aggregated and anonymous data within the meaning of the data protection regulations.
The notion of “scientific research” is broadly understood in the GDPR. In summary, the aim of the research is to produce new knowledge in all areas in which the scientific method is applicable. Any processing of data for scientific research purposes must be subject to appropriate safeguards for the rights and freedoms of the data subject, such as anonymisation or pseudonymisation (article 89 GDPR).
Learn more:
- Scientific research (excluding health)
- The re-use of publicly available data for (non-health) scientific research, extract from the guide for public consultation
Please note: even when the further processing is compatible, a valid legal basis must always be identified.
The provider reuses a publicly accessible dataset
In particular in the field of AI, datasets containing personal data may be freely made available on the Internet outside the French or European legal framework for the open data. Most often, they are data that were already publicly accessible and that constitute a dataset or corpus disseminated on the website of a university or a platform dedicated to sharing datasets, to facilitate their reuse.
Checking the lawfulness of putting online a dataset falls first and foremost with the controller who makes such a posting. However, in order to be able to rely on a legal basis under the GDPR, the controller who reuses the data must ensure that he or she is not reusing a dataset whose creation was manifestly unlawful (e.g. from a data leak).
In addition, the person who downloads or reuses a manifestly illegal dataset may be guilty of the offence of concealment (Article 321-1 of the French Criminal Code).
If the possibility of reusing a dataset made freely available on the Internet is not subject to in-depth checks on compliance with all GDPR rules or other applicable legal rules (copyright, data covered by business secrecy, etc.), checks which are primarily the responsibility of the body that uploads the data, the CNIL recommends that re-users ensure that:
- the description of the dataset mentions their source.
Example: a dataset the description of which would explain that it was made from publications on a professional social network identified by name.
On the contrary, if a dataset containing video surveillance images does not specify the source, any re-user of such a dataset should then refrain from reusing it before obtaining further details enabling it to remove its doubts as to the conformity of its constitution and dissemination;
- the creation or publication of the dataset is not manifestly the result of a crime or an offence or has been the subject of a public conviction or sanction by a competent authority which involved the deletion or prohibition of subsequent use of the data;
Examples: a company wishes to build a dataset for the development of a recommendation AI system that it intends to use with its consumers. If it acquires for this purpose a dataset on the dark web from, for example, an infringement of an automated processing system punishable by law (within the meaning of Article 323-1 of the French Criminal Code), it cannot ignore its criminal origin. In this case, the illegality of the dataset would then be obvious.
The same applies to a company wishing to reuse a dataset for which a court decision has found an infringement of an intellectual property right such as that of dataset producers (within the meaning of Article L. 342-1 of the French Intellectual Property Code);
- there is no clear doubt that the dataset is lawful (in particular that the source processing is not manifestly lacking of a legal basis when the data are so intrusive that they cannot be processed without the consent of the individuals), ensuring in particular that the conditions for collecting the data are sufficiently documented;
Examples:
- On a hosting platform for ML practitionneers, a company identifies a dataset home-to-work journeys of thousands of people. Its description explains that it is accurate geolocation data, not anonymous, without detailing the source. In that case, it cannot ignore that there is a serious doubt as to the lawfulness of the dissemination of such a dataset without the consent of the persons.
- On the contrary, it would be possible to reuse a dataset whose description leaves no clear doubt as to its lawfulness. For example, a pseudonymised dataset of data initially made public by data subjects on an identified website, and which does not contain sensitive data.
- The same applies to the re-use of an aggregated dataset that the broadcaster would present as anonymous. For example, an organisation that wishes to build a dataset to train an AI system to predict the socio-economic impact of population ageing could reuse anonymous aggregated datasets containing demographic information (number of active persons, age of persons, fertility rate or elderly dependency rate).
- the dataset does not contain sensitive data (e.g. health data or political opinions) or infringement data (as defined in Articles 9 and 10 GDPR), or, if it contains such data, it is recommended that additional checks be carried out to ensure that such processing is lawful (mainly for sensitive data to ensure explicit consent of data subjects, or that the data have been manifestly made public by the data subjects as specified below and for data relating to infringements that such use is made possible by the French data protection law (loi “Informatique et Libertés”).
Example: on an online forum, a researcher discovers a non-anonymous dataset that would contain, according to his description, the care pathways of a hundred patients with a particular pathology and who would come from French hospitals. In this case, the researcher should seriously doubt whether the dissemination of this dataset is lawful in view of the supervision of health data provided for by the GDPR and the French data protection law.
Certain failures committed by the controller to set up and disseminate a dataset do not systematically and irreparably affect the lawfulness of the processing carried out by the re-user. Thus, a re-user may use a dataset whose illegalities are minor, provided that the reuse meets the requirements of the GDPR.
Example: the provision of incomplete information when creating or publishing the dataset, or a lack of adequate documentation of the compliance of these processing operations (which it is necessary to verify with the diffuser or publisher of the dataset).
Learn more:
This list of additional checks to be carried out before re-using a publicly accessible dataset is reproduced in the Practical Guide on Opening and Reusing Publicly Accessible Data (page 65). Readers wishing to contribute on this subject are called upon to do so in the context of the public consultation of this guide.
The provider reuses a dataset acquired from a third party (data brokers, etc.)
Some providers wish to create training dataset based on resources owned by third parties.
For the third party who shares personal data, this means ensuring the lawfulness of this transmission
- Case 1: the data was precisely collected for sharing with the goal of creating a dataset for AI system training
The third party will have to ensure that the processing of data transmission complies with the GDPR (definition of an explicit and legitimate purpose, requirement of a legal basis, information to individuals and management of the exercise of their rights, etc.) for which he or she assumes responsibility.
- Case 2: the third party did not initially collect the data for this purpose
Where the third party initially collected the data for other purposes (e.g. in the context of the provision of a service to data subjects), it is also for the third party to ensure that the transmission of such data is for a purpose compatible with the purpose(s) which justified its collection. It will therefore have to carry out a ‘compatibility test’ (see above).
Note that the original owner of a dataset sometimes authorises its use under a license agreement that provides for its terms and conditions (in particular under intellectual property law). This license agreement can, for example, regulate this compatibility by limiting possible re-use.
For the re-user, this usually involves a series of checks of the initial controller’s processing.
Indeed, as in the case of re-use of publicly accessible datasets, the controller who reuses the data must ensure that he is not reusing a dataset whose constitution or sharing was manifestly unlawful (for example, in the absence of an indication as to its source, in the case of blatant doubt as to its lawfulness, in particular in the case of the processing of sensitive data, etc.). This follows from the general principle of lawfulness of processing in Article 5.1(a) GDPR, in addition to the risk of being guilty of the offence of concealment (Article 321-1 of the French Criminal Code).
The re-user of a dataset transmitted by a third party may be all the less unaware that it is constituted or shared in breach of the GDPR or of more general rules (such as those prohibiting breaches of the security of information systems or infringements of intellectual property rights) since its relationship with that third party allows it to remove any doubts that it may have.
The conclusion of an agreement between the original data holder and the re-user is thus recommended in order to enable the latter to ensure the lawfulness of its own processing, even if it is not explicitly required by the GDPR.
In this regard, the CNIL recommends providing a number of indications in the contract such as:
- the source, the context of the data collection, the legal basis for the processing and the data protection impact assessment (see in particular how-to sheet 5 on the DPIA) if necessary, in order to avoid the risks of having an unlawful dataset;
- the information brought to the attention of the persons (in particular as regards the purpose and the recipients);
- any guarantees as to the lawfulness of this data sharing by the original data holder (e.g.: the compatibility of the purpose, the lawfulness of sharing, etc.).
The CNIL provides a documentation model that can usefully be used for this purpose.
Please note: if the re-user wishes to base his or her processing of personal data on a consent collected by a third party, he or she must be able to prove that valid consent has indeed been collected from the data subjects. The obligation to provide proof of consent cannot be fulfilled by the mere presence of a contractual clause requiring one of the parties to obtain valid consent on behalf of the other party. Such a clause does not allow the organisation to guarantee, in all circumstances, the existence of valid consent (see the deliberation of CNIL no. SAN-2023-009 of 15 June 2023). The contract may, on the other hand, be used to frame:
- the mechanisms put in place to demonstrate the collection of valid consent;
- the provision of evidence for the benefit of the body wishing to base its processing on such consent;
- where applicable, the conditions under which such evidence must be retained, in particular in order to maintain its probative value.
Example: the provider of an image generative AI system approaches a data broker to build a dataset for training purposes that includes photographs.
They enter into a contract to ensure the lawfulness of the shared data to the provider, and regulates the provision of crucial indications for the compliance of its processing (e.g.: evidence of the context of the collection of data in order to assess its legitimate interest, guarantees in relation to other regulations such as that governing the assignment of intellectual property rights, etc.)
In addition to these prior checks, and regardless of the method of collection used, re-users must fully analyse the conformity of their own processing operations, including when they reuse datasets whose constitution and sharing are outside the scope of French or European law (contrary to their re-use by an entity established on French or European territory which is subject to the GDPR). In particular, the re-user must ensure compliance with the requirements regarding the persons whose data are present in the dataset thus obtained: the re-user must inform them of the processing that he wishes to implement, and allow them to exercise their rights.
Please note: a how-to sheet on information and management of people’s rights will be published at a later date.
Previous : Determining the legal qualification of AI system providers | Table of contents | Next : Carrying out a data protection impact assessment when necessary |