Relying on the legal basis of legitimate interests to develop an AI system
05 January 2026
Controllers will most commonly rely on their legitimate interests for the development of AI systems. However, it may only be used if its conditions are fulfilled and sufficient safeguards are in place.
Legitimate interest is one of the six legal bases provided for in Article 6 of the GDPR.
It is often adapted for the development of AI systems by private bodies, espacially when the dataset used is not based on the consent of individuals (it is often difficult to obtain the consent of individuals at a large scale or when personal data are collected indirectly).
With regard to public bodies, they may rely on their legitimate interests when a public authority wishes to develop an AI system only when the activities concerned are not strictly necessary for the performance of its specific tasks but for other activities legally implemented (such as, for example, human resources management processing).
For more information on the use of legitimate interest by a public body, see in particular the use case illustrated in “How to choose the legal basis for processing? Practical cases with certain treatments implemented by the CNIL” (French version)
Reliance on legitimate interests is, however, subject to three conditions:
- The interest pursued by the body must be “legitimate”;
- The processing must fulfill the condition of “necessity”;
- The processing must not disproportionately affect the rights and interests of the data subjects, taking into account their reasonable expectations. It is therefore necessary to “balance” the rights and interests at stake in the light of the specific conditions for its implementation.
The controller is bound to examine the compliance of its processing with these three conditions. To this end, it is recommended, as a good practice, to document it. In any event, where a DPIA is necessary, the safeguards provided to limit the possible impacts on the rights of individuals must be described by the controller (see the how-to sheet “Carrying out a data protection impact assessment when necessary”).
Other legal bases may also be considered for the development of AI systems (see the how-to sheet “Ensuring the lawfulness of the data processing - Defining a legal basis”).
First condition: the interests pursued must be “legitimate”
The interest pursued, although closely linked to the purpose of the processing, should not be confused with it. The purpose is the specific reason for which the data are processed, whereas the interest refers to the broader benefit that the data controller or a third party may seek to obtain.
For more information on defining the purpose during the development phase : see the how-to sheet 2
Legitimate interests can be broadly defined. There is no exhaustive list of interests considered legitimate, but the interests pursued may be presumed as legitimate if they are cumulatively:
- manifestly lawful under the law,
- determined in a sufficiently clear and precise manner,
- and real and present (i.e., not hypothetical or speculative) for the organization concerned.
Thus, in the case of AI system development, the following interests could be considered a priori legitimate:
- carrying out scientific research (espacially for organizations which cannot rely on a task carried out in the public interest);
- facilitating public access to certain information ;
- developing new systems and features for users ;
- offering a chatbot service to assist users ;
- improve a product or service to increase its performance ;
- develop an AI system to detect fraudulent content or behavior.
A commercial interest constitutes a legitimate interest, provided that it is not contrary to the law and that the processing is necessary and proportionate (CJUE, 4 october 2024, Tennisbond, C-621/22). By contrast, certain interests cannot be considered legitimate, particularly when the envisaged AI system has no connection to the organization’s mission and activities, or if it cannot be legally deployed.
The interest pursued must be sufficiently precise and brought to the attention of individuals as part of the data controller's transparency obligations. Thus, with regard to the development and improvement of a general-purpose AI system, even when the specific use of the model is not known, it is recommended to refer to the objective pursued by the development of the model (indicating in particular whether it is commercial, public, scientific research, and whether it is internal or external to the organization).
In some cases, the consent of the individual may be required under other regulations. This may be the case, for example, where the controller is also a gatekeeper within the meaning of the Digital Markets Act (DMA) and the processing for the creation of the training database involves implementing one of the practices listed in Article 5.2 of the DMA (cross-use of personal data from the essential platform service in the context of other services provided by the access controller, for example).
Second condition: the processing must be "necessary"
The condition of necessity means that the controller must ensure that the intended processing is capable of achieving the interest pursued and that there is no other less intrusive way of achieving that objective than to carry out the intended processing.
As such, if the development of the AI system requires the use of personal data, the controller must ensure, with regards to the information at his disposal, that the development of that system is indeed necessary to achieve its objective, whether it is a research objective, a commercial objective, an objective of fraud prevention, etc. If the development of the system does not require the processing of personal data, the GDPR is not applicable to that development and the question does not arise.
This condition relating to the necessity of the processing is also to be examined in connection with the principle of data minimization (see the how-to sheet 6 « Taking into account data protection when designing the system »). This means, in particular, that the data controller must ensure that it is necessary to process personal data or to store it in a form that allows direct or indirect identification of individuals, as well as the need to use, where appropriate, a technical solution that involves processing a large volume of personal data. In this regard, technological developments should be taken into account, as they may enable the development of models that require less personal data to be processed. Data controllers are encouraged to participate in the development of such technologies.
Third condition: ensure that the objective pursued does not threaten the rights and freedoms of individuals
It is necessary to ensure that the legitimate interests pursued do not disproportionately affect the interests, rights, and freedoms of the individuals concerned.
The controller must therefore balance their legitimate interests against the data subject’s interests, rights, and freedoms. To do this, the controller must measure the benefits of its processing (anticipated benefits, including those presented below) but also the impacts on individuals. If necessary, additional measures must be put in place to limit these risks and protect the rights and freedoms of individuals.
This analysis must be carried out on a case-by-case basis, taking into account the specific circumstances of the processing.
The benefits provided by the AI system help justify the processing of personal data
The greater the anticipated benefits of the processing, the more likely the legitimate interest of the controller is to prevail over the rights and freedoms of individuals.
The following factors make it possible to measure the positive impact of the interests pursued :
- The extent and nature of the expected benefits of the processing, for the controller but also for third parties, such as the end-users of the AI system or the interest of the public or society. The diversity of applications implementing AI systems shows that there can be many benefits, such as improved healthcare, better accessibility of certain essential services, facilitation of the exercise of fundamental rights such as access to information, freedom of expression, access to education, etc.
Example: a voice recognition system that allows users to automatically transcribe their words and, for example, helps with filling in administrative forms, can have significant benefits in making certain services accessible to people with disabilities. The importance of those benefits may be taken into account in the balancing of interests when developing such a system.
In general, the fact that a data controller acts not only in their own interest but also in the interest of the wider community may give more « weight » to that interest.
Example: a private company wants to develop an AI system to fight against online real estate fraud. The commercial interest it pursues is reinforced by the convergence with the interest of users and the interest of the community to reduce fraudulent activities.
- The usefulness of the processing carried out in order to comply with other legislation.
Example: the provider of a very large online platform or search engine that develops an AI system to better meet the provisions of Article 35.1 of the Digital Services Act (DSA) on the adaptation of online content moderation processes, may take this objective into account when assessing its interest.
- The development of open-source model, which, provided sufficient safeguards are in place (see the dedicated article in French) may have significant benefits for the scientific community, research, education and the adoption of these tools by the public. It may present benefits in terms of transparency, reduction of bias, of the AI system provider or peer review. This may reflect the controller’s objective of sharing the benefits of its processing to participate in the development of scientific research.
- The specification of the interests pursued : the more precisely an interest is defined, the more weight it will be have in the balancing exercise, since it makes it possible to specifically apprehend the reality of the benefits to be taken into account. Conversely, an interest defined too broadly (e.g. “providing new services to its users”) is less likely to prevail over the interests of individuals.
Negative impacts on people need to be identified
Those benefits must be balanced against the impact of the processing on data subjects. Specifically, the controller must identify and assess the consequences of all kinds, potential or actual, that the development of the AI system and its subsequent use could have on the persons concerned: their privacy, data protection and other fundamental rights (freedom of expression, freedom of information, freedom of conscience, etc.) as well as other concrete impacts of the processing on their situation.
The actual impacts of the processing on people, as listed below, are to be assessed according to the likelihood that the risks materialise and the severity of the consequences, which depend on the particular conditions of the processing, as well as the AI system developed.
To this end, it is necessary to take into account the nature of the data (sensitive data, highly personal data), the status of the data subjects (vulnerable persons, minors, etc.), the status of the company or administration developing and/or deploying the AI system (the effects may be multiplied in the event of very wide use of AI), the way in which the data are processed (data combination, etc.) or the nature of the AI system and the intended operational use. In some cases, the impact on individuals will therefore be limited, either because the risks are low or because the consequences present little severity with regard to the data used, the processing carried out and the interest pursued (for example, the development of an AI system used for the personalisation of an auto-completion feature of a text editor software presents little risk for data subjects).
The following impacts on people should therefore be considered and the level of associated risks should be assessed in the case at hand. Three types of risks can be distinguished :
-
Impacts related to the development of the AI model
- The risks related to the collection of publicly available online data, especially through scraping tools, which may infringe on individuals’ privacy and the rights guaranteed under the GDPR. Such practices may also impact other rights, including intellectual property rights, certain types of trade or professional secrecy, or freedom of expression — due to the surveillance-like effect that widespread and systematic data scraping can produce (for more information, see the focus sheet on web scraping).
- Risks of loss of confidentiality of the data contained in the dataset or in the model: the risks related to the security of training datasets are likely to increase the risks for data subjects linked to misuse, in particular in the event of a data breach, or the risks related to attacks specific to AI systems (attack by poisoning, backdoor insertion or model inversion).
For more information: see the article « Small taxonomy of attacks on AI systems » (in French).
- Risks related to the difficulty of ensuring the effectiveness of the data subject rights, in particular due to technical obstacles to the identification of data subjects or difficulties in transmitting requests for the exercise of rights when the dataset or model is shared or available in open source. It is also complex, if not technically impossible, to guarantee data subject rights on certain objects such as trained models.
- Risks associated with the difficulty of ensuring transparency towards data subjects: these risks may result from the technicity inherent to these topics, rapid technological developments, and the structural opacity of the development of certain AI systems (e.g. deep learning). This complicates the availability of intelligible and accessible information on the processing.
-
Impacts on persons related to the use of the AI system
Certain risks, which may arise when using the AI system, must be taken into account at the training stage because of their systemic nature. It is indeed necessary to anticipate, as soon as the design phase, the guarantees necessary to effectively limit these risks. These risks depend on the uses of the AI system. As a general matter, the following risks can be mentioned in particular :
- There are risks of memorization, extraction, or regurgitation of personal data (particularly in the case of generative AI systems) during the use of certain AI models, potentially infringing on privacy. In some instances, personal data contained in training datasets can be inferred—whether accidentally or through attacks (e.g. membership inference, data extraction, or model inversion) from the use of AI systems (see notably the LINC article « Small Taxonomy of Attacks on AI Systems »). This creates privacy risks for individuals whose data may resurface through the system’s use (e.g., reputational damage, security risks depending on the nature of the retained data, etc.).
- Risks of reputational damage, spread of false information or identity theft, where the AI system (particularly generative AI) produces content on an identified or identifiable natural person (e.g. a generative image AI system may be used to generate false pornographic pictures of real persons whose images are contained in the dataset). Note that this risk can also occur with AI systems that have not been trained with personal data.
Example: a news article generated by an AI system may present defamatory information about a real person, although the dataset does not contain information about that person, in particular where the text was generated at the request of a user who specifies the identity of the data subject in the prompt.
- Risks of infringement of certain rights or secrets provided for by law (e.g. intellectual property rights, such as copyright, business or medical secrecy) in the event of memorisation or regurgitation of protected data.
Example: a text-generative AI system trained on copyright-protected literary works may generate content that constitutes infringement, in particular where that content results from the regurgitation of the content that would have been memorised by the AI system.
- Serious ethical risks, which may impact certain general legal principles or the proper functioning of society as a whole, related to the development of certain AI systems. These risks must be taken into account in the assessment (e.g. discrimination, the safety of people in case of malicious use, incitement to hatred or violence, disinformation, which may undermine the rights and freedoms of individuals or democracy and the rule of law). The development of AI systems can thus harm certain fundamental rights and freedoms during the deployment phase if safeguards are not anticipated by design (e.g. amplification of discriminatory biases in the training database, lack of transparency or explainability, lack of robustness or automation biases, etc.).
Taking into account the IA Act :
When the data controller is a provider of high-risk AI systems within the meaning of Article 6 of the AI Act, they may usefully take into account the identified risks as part of the risk management system they are required to implement under Article 9 of the AI Act. Likewise, when acting as a provider of a general-purpose AI model presenting systemic risks within the meaning of Article 51 of the AI Act, they may consider the identified risks in fulfilling their obligations under Article 55 of the AI Act.
Reasonable expectations of individuals are a key factor in assessing the legitimacy of treatment.
The controller must take into account the reasonable expectations of the data subjects when assessing the impact of the processing on individuals. Relying on legtimate interests requires that individuals are not surprised by the modalities as well as the consequences of the processing.
Reasonable expectations are a contextual aspect that the controller must consider when balancing the rights and interests at stake. To this end, information to individuals may be taken into account to assess whether data subjects can reasonably expect their data to be processed. However, it will only be an indicator.
During the development of an AI system, some data processing operations may exceed the reasonable expectations of data subjects. The controller must conduct this assessment, taking into account notably the following set of indicators :
- For data collected directly from data subjects :
- The relationship between the controller and the data subject.
- The privacy settings of the data shared by the data subject.
Examples: A platform that provides an online coaching service wants to use the exchanges between individuals and their interlocutors to fine-tune a generative AI model in order to develop a chatbot capable of answering users' questions. In this case, people who communicate with their online interlocutors expect a certain level of confidentiality, particularly given the sensitivity of the information that may be shared, and cannot reasonably expect the data to be used for training purposes. It will therefore be necessary to obtain their consent.
- The use of private exchanges between two people using an online virtual meeting service for the development or improvement of an AI model for summarizing meetings does not fall within the reasonable expectations of individuals.
- The context and nature of the service where the data was collected (e.g., whether the service was provided through an AI system);
- Whether the processing of user data only affects the service provided to the user in question or whether it is used to improve the service as marketed (e.g., if a company collects data from its customers in order to develop a tool for its sole use but which is not otherwise marketed).
- The relationship between the controller and the data subject.
- When reusing online accessible data: in light of the recent technological development (big data, new AI technologies, etc.), data subjects may be aware that the information they share online might be accessed, collected, and reused by third parties. However, they cannot expect such processing to take place in all situations and for all types of publicly accessible data about them. Several factors must be taken into account, including :
- the publicly accessible nature of the data ;
- the context and nature of the source websites (e.g., social networks, online forums, dataset repositories)
- the restrictions set by these websites, such as in their terms of us or through technical safeguards like exclusion files (e.g., robots.txt) or blocking mechanisms like CAPTCHAs. In this regard, the CNIL considers that processing cannot fall within the reasonable expectations of data subjects if the controller does not exclude from collection websites that explicitly object to scraping via robots.txt or CAPTCHAs
Example: Where an individual uploads their data to a content-sharing platform that explicitly prohibits scraping via robots.txt files and clearly indicates that user data will not be used for the development of AI models, it is not reasonable to expect that third parties would collect such data for that purpose.
- the type of publication (for example, an article published on a freely accessible blog is not private, whereas a post on a social network published with access restrictions may remain private, since the internet user less aware of the risk of it being collected and reused by third parties).
- the nature of the relationship between the data subject and the controller.
- the publicly accessible nature of the data ;
- It may be difficult to fully anticipate the wide range of potential uses of a dataset or model, particularly when it is shared or disseminated.
Yet, some of these uses may fall outside the reasonable expectations of data subjects, especially in the case of unlawful reuse, since a person cannot reasonably expect their data to be used to develop AI systems that are later reused for other purposes.
Example: data subjects could not expect their data to be used to develop an open-source image classification model, which would then be used to classify people based on their sexual orientation.
Additional measures to limit the impact of processing
The organization may put in place compensatory or additional measures in order to limit the impact of processing on data subjects. These measures will often be necessary to reach a balance between the rights and interests at stake and will allow the controller to rely on this legal basis.
These measures are in addition to those necessary to comply with other obligations laid down by the GDPR, and should not be confused with them: compliance with these provisions is imperative, regardless of the legal basis for the processing (data minimization, data protection by design and by default, data security, etc., see the dedicated practical how-to sheets). Compensatory measures consist of additional safeguards to the other requirements of the GDPR.
The measures may be technical, organizational, or legal and must be able to limit the risk of harm to the interests, rights, and freedoms previously identified.
The following measures have been identified as relevant to limit the impact on data subject rights and freedoms. They must be adapted to the risks posed by the different processing of the development phase.
For more details on the measures to be taken in the event of web scraping, see the dedicated focus sheet.
-
Measures to limit the collection or storage of personal data
- Provide for the anonymization of collected data within a short period of time or, failing that, the pseudonymization of collected data. In some cases, anonymization of data will be necessary when anonymous data is sufficient to achieve the objectives defined by the controller.
Example: if a company wants to build a dataset from comments accessible online to develop an AI system to assess the satisfaction of customers who have purchased its products, pseudonymisation of data collected at short notice after collection may be an additional measure to limit the risks associated with data collection that may reveal a lot of information about the person making the comments.
- Where it does not adversely affect the performance of the model developed, synthetic data should be used. This may also have a number of advantages, such as making certain data available or accessible and modelling certain specific situations, avoiding the use of real data, particularly sensitive data, increasing the volume of data for training or minimising the risks associated with data confidentiality, etc. It should be borne in mind that synthetic data are not systematically anonymous.
Example: if a supplier wishes to develop an image classification system that automatically detects the carrying or use of a weapon, the use of computer-generated images allows, for example, to avoid the collection of data likely to presume the commission of an offence, to help varying the possible configurations or to improve the representativeness of the dataset, in particular because of the possibility of describing the characteristics of the synthetic image of the person (size, weight, skin colour, etc.) and of the weapon to be detected (shape, colour, etc.).
- Put in place measures to limit the risks of memorization, extraction, and regurgitation in generative AI, or attacks on AI models or systems. Without prejudice to technological developments that may lead to other measures, the CNIL recommends implementing the following measures :
- Measures to limit the risks of memorization:
- The deletion of rare or outlier data;
- The deduplication of training data;
- The reduction of the ratio of the number of model parameters to the volume of training data ;
- The explicit regularisation of the cost function;
- Using algorithms that provide formal guarantees of confidentiality for adjustment (for example, in terms of differential privacy);
- Any measures aimed at limiting overfitting ;
- Measures limiting the risks of extraction or regurgitation, in the context of generative AI, or attack :
- Measures that limit likelihood :
- Restrictions of access to the model;
- Modifications to the model’s outputs (such as filters, for example, or limitations on output accuracy) ;
- Security measures aimed at preventing or detecting attack attempts (which may, however, be required under other GDPR obligations) ;
- Measures to limit severity :
- Provide for legal or technical recourse in the event of extraction, regurgitation in the context of generative AI, or a successful attack, such as opening a help desk with the provider where individuals can report regurgitation.
- Measures that limit likelihood :
- Measures to limit the risks of memorization:
In certain cases, the implementation of these measures may lead the controller to conclude that it is impossible to extract personal data or regurgitate it, in the context of generative AI, and therefore that the model or system developed is anonymous (see the AI how-to sheet on the status of an AI model with regard to the RGPD). The anonymity of the model or system will constitute a particularly strong guarantee to limit the harm to individuals whose data is processed for training the AI model.
If the data controller is unable to conclude that the model or system developed is anonymous, these measures will still constitute additional guarantees.
-
Measures enabling individuals to retain control over their data
- Implement technical, legal and organisational measures in addition to the obligations laid down in the GDPR in order to facilitate the exercise of rights :
- Provide for a discretionary and prior right of objection in order to strengthen data subjcts’ control over their data.
- Exercising this right could be made easier by informing the data subject on the controller's website, and providing for a simple checkbox that's easy to find. For example, users of an online service whose data is used for the development or improvement of an AI system should be able to quickly access the page allowing them to object to the collection of their data for this purpose. The controller must ensure that the individual can object to this processing without their use of the service being affected;
- In the case of online data collection, the CNIL encourages the development of technical solutions that facilitate compliance with the right to object prior to data collection. In addition to the opt-out mechanisms put in place for intellectual property (see the focus sheet on web scraping), « rejection list » mechanisms could, for example, be implemented when appropriate for the processing. This would enable the data controller to respect individuals' objections by refraining from collecting their data;
- Exercising this right could be made easier by informing the data subject on the controller's website, and providing for a simple checkbox that's easy to find. For example, users of an online service whose data is used for the development or improvement of an AI system should be able to quickly access the page allowing them to object to the collection of their data for this purpose. The controller must ensure that the individual can object to this processing without their use of the service being affected;
- Include a discretionary data deletion right for information stored in the database ;
- Implement measures to enable or facilitate the identification of data subjects: Technical and organizational measures should be considered to retain certain metadata or other information about the source of data collection, in order to facilitate the retrieval of a person or a specific piece of data within the database. This will be particularly relevant when the information is publicly accessible and its retention does not pose additional risks to the individuals concerned.
Example: In the case of an image dataset created through web scraping of freely accessible online data from a limited number of websites, retaining the display name and URL source of each collected image would help facilitate the identification of individuals. Indeed, individuals could directly provide the relevant URLs by locating the data concerning them via a general search engine, or through a specific website or web archive library.
- Implement measures to ensure and facilitate the exercise of data subjects’ rights when the model is subject to the GDPR (see how-to sheet on the status of AI models and AI systems) such as observing a reasonable delay between the dissemination or collection of a training dataset and its use (particularly when exercising rights on the model is difficult), and/or planning for periodic model retraining to allow for the effective consideration of data subjects’ rights when the data controller still holds the training data.
For more details on the steps to take to exercise your rights, see the how-to sheet 10 "Respecting and facilitating the exercise of individuals’rights".
- When the model open source, identify and implement measures to ensure that rights are exercised throughout the chain of actors, in particular by including in the terms and conditions the obligation to extend the effects of the exercise of the rights of opposition, rectification, or erasure to systems developed subsequently.
- Facilitate the notification of rights: for example, where possible, the CNIL recommends the use of application programming interfaces (APIs) (particularly in the most high-risk cases), or at the very least techniques for managing download logs.
- Communicate more widely about dataset or model updates, for example in the dataset documentation or on the providers' website, to let data subjects know to what extent their requests have been met. This also involves encouraging recipients of previous versions to delete them or replace them with the latest version.
- Provide for a discretionary and prior right of objection in order to strengthen data subjcts’ control over their data.
- Ensure increased transparency regarding the processing activities carried out for the development or improvement of the AI system, in addition to the obligations set out in Articles 12 to 14 of the GDPR, by implementing the following measures:
- Provide information on the risks related to data extraction or regurgitation, in the context of generative AI, when the AI model or system is subject to the GDPR (see thehow-to sheet on the status of models):
- the nature of the risk associated with data extraction from the model or system, such as the risk of data regurgitation in the case of generative AI;
- the measures taken to mitigate these risks, and the available recourse mechanisms should these risks arise, such as the possibility to report an instance of regurgitation or extraction to the organization.
- Provide for the publication of the Data Protection Impact Assessment (DPIA), where applicable (this publication may be partial when certain sections are subject to protected secrets, such as trade secrets);
- Provide for the publication of any documentation related to the dataset (for example, based on the model proposed by the CNIL), the development process, or the AI system and its functioning;
- Provide for the publication of information that supports greater public understanding of how these technologies work: the CNIL considers that public acceptance of AI technologies cannot be achieved without such efforts. It therefore encourages stakeholders, both developers and users, to make an effort to be transparent and popularise their practices, as well as explain how AI work and its associated risks. This may be achieved by implementing recommended transparency practices in the field, such as:
- embracing open-source development approaches (e.g., publishing model weights, source code).
- promoting transparency around non-data-protection related aspects, such as :
- fundamental concepts of machine learning, such as training, inference, memorization, and different types of attacks on AI systems ;
- measures taken to mitigate malicious or harmful uses of the system ;
- Carry out media campaigns to ensure the widest possible dissemination of information to individuals, particularly when the development involves large-scale data collection, such as with large language models (LLMs), and implement multiple alternative methods of informing data subjects.
- Provide information on the risks related to data extraction or regurgitation, in the context of generative AI, when the AI model or system is subject to the GDPR (see thehow-to sheet on the status of models):
- Implement measures and procedures to ensure the transparent development of the AI system, in particular to enable its auditability during the deployment phase. This includes a comprehensive documentation of the development process, activity logging, management and monitoring of the different versions of the model, recording of parameters used, and performance and documentation of evaluations and tests. Such measures may also be essential to prevent automation or confirmation bias during the deployment.
-
Measures to mitigate risks during the operational phase
- In the case of general-purpose AI systems, limit the risk of illicit reuse of the AI system by implementing technical measures (e.g. digital watermarking of outputs to prevent deceptive uses, or restriction of functionalities by design to exclude those that could lead to illicit uses) and/or legal measures (e.g. contractual prohibition of certain illicit or unethical uses of the database or AI system, which data subjects could not reasonably expect).
- Implement licenses restricting uses aimed at re-identifying individuals.
- Implement measures to address certain serious ethical risks.
For example, ensure the quality of the training dataset to limit the risk of discriminatory biases during the operational phase, notably by ensuring data representativeness and by verifying and correcting the presence of biases in the dataset or resulting from annotations performed (see how-to sheet 11 "Annotating Data").
-
Other measures
- In view of the severity and likelihood of the identified risks, appoint an ethics committee or, depending on the size and resources of the organization, an ethics officer. This ensures that ethical considerations and the protection of individuals’ rights and freedoms are taken into account from the outset and throughout the development of such systems (for more information see how-to sheet 7 "Taking into account in data collection and management").
Examples
Example where legitimate interest cannot be relied upon :
A controller wishes to develop a generative AI system for images. He build the training dataset by indiscriminately collecting images online from numerous websites, without ensuring the exclusion of certain categories of sites, such as those containing sensitive data, like pornographic websites, and without implementing safeguards to limit the risks of data memorization or extraction by the model. The purpose, as defined in the controller’s privacy policy, is vaguely stated as « the provision of new services ». In such cases, it is unlikely that the balancing test could be considered as satisfied.
Examples where legitimate interest may be relied upon :
A social network offering an online forum, whose core purpose is to make exchanges between users freely accessible, wishes to develop a conversational agent to facilitate searches across user posts, notably by synthesizing search results to directly answer users' queries. To train the model, the platform collects only data that users have made freely and manifestly public, explicitly excluding private user data (such as private conversations or account information). Strong safeguards are implemented, including a discretionary and advance opt-out mechanism and enhanced transparency (e.g., a notice on the homepage linking directly to the opt-out option). In such a case, the balancing test will generally be considered as met.
Note: If this processing purpose was not initially communicated to users, a compatibility assessment under Article 6(4) of the GDPR must be conducted (see how-to sheet n° 4, part 2/2 for more details).
An organization aims to develop a generative AI system for text. It exclusively uses data from freely and publicly accessible online sources, where the data subjects have manifestly made the content public. It also excludes any content protected by copyright (i.e., using only content in the public domain or for which rights holders have not objected to text and data mining as permitted under Directive 2019/790 on copyright and related rights in the Digital Single Market). In addition, it implements a range of safeguards to limit data memorization and regurgitation, restricts problematic content generation through technical or contractual measures, facilitates the exercise of data subjects’ rights when reidentification is possible, and clearly indicates data sources in a publicly available privacy policy. In such a case, the balancing test may generally be considered as met.
A retail company using a self-checkout equipped with an algorithmic videosurveillance system that automatically detects customer errors at checkout wishes to reuse the collected data to improve the AI system. To this end, it retains only data in a form that limits reidentification, ensures that customers are informed of the system’s operation, and allows them to object freely and easily. In this case, the balancing test may generally be considered as met.