Carrying out an data protection impact assessment if necessary

07 June 2024

Creating a dataset for the training of an AI system can create a high risk to people’s rights and freedoms. In this case, a data protection impact assessment is mandatory. The CNIL explains how and in which cases it should be realised.

This content is a courtesy translation of the original publication in French. In the event of any inconsistencies between the French version and this English translation, please note that the French version shall prevail.

The Data Protection Impact Assessment (PDIA) is an approach that allows to map and assess the risks of a personal data processing and to establish an action plan to reduce them to an acceptable level. This approach, facilitated by the tools provided by the CNIL, is particularly useful to control the risks associated with a processing before it is implemented, but also to ensure their follow-up over time.

In particular, a DPIA makes it possible to carry out:

an identification and assessment of the risks for individuals whose data could be collected, by means of an analysis of their likelihood and severity;
an analysis of the measures enabling individuals to exercise their rights;
an assessment of people’s control over their data;
an assessment of the transparency of the data processing for individuals (consent, information, etc.).

The DPIA must be carried out prior to the implementation of the processing and should be changed iteratively as the characteristics of the processing and risk assessment evolve.

The realisation of a DPIA for the development of AI systems

Identifying when a DPIA is needed

The development of AI systems requires, in some cases, the realisation of a DPIA if the envisaged processing is likely to create a high risk to the rights and freedoms of natural persons (Article 35 GDPR).

In its guidelines on the DPIA, the European Data Protection Board (EDPB) has identified nine criteria to assist data controllers, i.e. the AI system providers, in determining whether a DPIA is required. Any processing of personal data fulfilling at least two criteria on this list should be presumed to be subject to the obligation to carry out a DPIA. Some of these criteria are particularly relevant for processing taking place during the development phase:

the collection of sensitive data or highly personal data (e.g. categories of data that can be considered to increase the risk of harm to the rights and freedoms of individuals, such as location data or financial data);
the large-scale collection of personal data;
the collection of data from vulnerable persons, such as minors;
the crossing or combination of data sets;
innovative uses or application of new technological or organisational solutions.

In all cases, it is necessary to consider the existence of risks for persons as a result of the establishment of a training dataset and its use: if there are significant risks, in particular due to data misuse, data breach, or where the processing may give rise to discrimination, a DPIA must be carried out even if two of those criteria are not met; conversely, a DPIA does not have to be carried out if several criteria are met but the controller can establish with sufficient certainty that the processing of the personal data in question does not expose individuals to high risks.

On the basis of these criteria, the CNIL has published a list of personal data processing for which the realisation of a DPIA is mandatory (for more information, see the CNIL’s website). Of these, several may rely on artificial intelligence systems, such as those involving profiling or automated decision-making: in this case, a DPIA is always required.

Is the use of an artificial intelligence system an “innovative use”?

Innovative use is one of the 9 criteria that can lead to the realisation of a DPIA: it is assessed in the light of the state of technological knowledge and not only of the context of the processing (a processing can be very “innovative” for a given organism, because of the technological novelty it brings to it, without, however, being an innovative use in general). The use of artificial intelligence systems is not systematically a matter of innovative use or the application of new technological or organisational solutions. All processing using an AI system will therefore not meet this criterion. In order to determine whether the technique used falls within such uses, it is necessary to distinguish between two categories of systems:

Systems that use AI techniques that have been experimentally validated for several years and tested in real-life conditions. These systems are not part of the innovative use or application of new technological or organisational solutions.

Example: certain regression or clustering techniques or model architectures such as random forests, in cases where the risks associated with their use are known;

Systems that use new techniques, such as deep learning, and whose risks are just beginning to be identified today, but are still poorly understood or mastered. These systems are part of innovative use.

Example: generative AI systems trained on large amounts of data whose behaviour cannot be anticipated in all situations.

By way of illustration, a research project aimed at developing automatic language processing tools for clinical applications in the medical field, based on large volumes of data (transcript of audio data, clinical studies, medical results, etc.), can be an innovative use, especially given the uncertainty as to the results to be obtained.

Is the training of an artificial intelligence system a “large-scale” processing?

Large-scale collection is one of the 9 criteria that can lead to the implementation of a DPIA: while the development of an AI system often relies on the processing of a large amount of data, this does not necessarily fall within the scope of large-scale processing which aims to “process a considerable amount of personal data at regional, national or supranational level [and which may] affect a significant number of data subjects” (recital 91 GDPR). For AI systems, in particular, it will be necessary to determine whether the development involves a very large number of people.

Examples: A research organisation wants to build a large dataset of landscape photos (mountain, ocean, desert, cities, etc.) to improve the performance of computer vision systems. Some of these images feature images of individuals, sometimes recognizable.

Even if the dataset has millions of images covering the entire surface of the planet, if the number of images containing recognizable individuals (and therefore personal data) is limited (for example to a few thousand), the processing will not be called “large-scale processing”. However, it is not excluded that a DPIA may be required according to the other criteria to be verified.

Where a provider of a conversational agent constitutes a dataset to train its language model (LLM) from a considerable volume of publicly accessible personal data on the Internet collected through web scraping techniques, the processing can be described as “large-scale processing”.

Risk criteria introduced by the EU AI Act

The European AI Act aims to provide a legal framework for the development and deployment of AI systems within the European Union. It distinguishes several categories of systems according to their level of risk: prohibited systems, high-risk systems, systems requiring transparency guarantees and minimum risk systems. The CNIL considers that for the development of all the high-risk systems covered by the AI Act, the realization of a DPIA will be presumed necessary when their development or deployment involves the processing of personal data.

The realization of the DPIA may be based on the documentation required by the AI Act provided that the elements required by the GDPR (Article 35 GDPR) are included. The elaboration of more precise rules on the relationship between these requirements is the subject of European work in which the CNIL actively participates and which will be the subject of subsequent publications. This work will aim, in particular, to avoid any duplication of obligations on actors by prioritising the reuse of the elements constituted from one framework to another.

Moreover, the CNIL considers that the development of a foundation model or a general-purpose AI system, in that their uses cannot be exhaustively identified in the majority of cases requires the realisation of a DPIA when it involves the processing of personal data. Indeed, although these models and systems are not considered to be high risk by default by the AI Act, their dissemination and their future uses could entail risks for those whose data were processed during development, or for the persons concerned by their use.

The realization of a DPIA for foundation models and general purpose AI systems will facilitate the compliance of the processing implemented by their users. In this respect, the sharing or publication of the realized DPIA may facilitate the compliance of all the actors involved, in particular in the case of the dissemination of open source models, or the provision of systems for all.

Defining the scope of the DPIA

The scope of the DPIA may differ depending on the provider’s knowledge of the use that will be made, by itself or by a third party, of the AI system it develops.

Where the operational use of the AI system in the deployment phase is identified from the development phase

If the system provider is also the data controller for the deployment phase and if the operational use of the AI system in the deployment phase is identified from the development stage, it is recommended to carry out a general DPIA for the entire processing. The supplier will then be able to supplement this DPIA with the risks associated with both phases.

If the provider is not the data controller for the deployment phase but identifies the purpose of use in the deployment phase, it may propose a model of DPIA accordingly. This may allow it, in particular, to take into account certain risks that are easier to identify during the development phase. However, the user of the AI system, as controller, remains obliged to perform a DPIA, for example on the basis of the provider’s template.

It should be noted that, in some cases, it is not possible to determine precisely and in advance the supervision of the deployment phase. For example, some risks can be reassessed after a calibration phase of the AI system under its deployment conditions. The DPIA will then have to be modified iteratively as the characteristics of the processing are defined at the deployment stage.

Where the operational use of the AI system in the deployment phase is not clearly identified in the development phase

In this case, the provider of the system will only be able to carry out its impact assessment on the development phase. It will then be up to the controller of the deployment phase to analyse, with regard to the characteristics of the processing, whether a DPIA is necessary for that phase. If the purposes of the deployment phase are multiple, the controller may decline the same general DPIA for each of the specific use cases

AI Risks to consider in a DPIA

Processing of personal data based on AI systems present specific risks that should be taken into account:

the risks to data subjects related to misuse of the data contained in the training dataset, in particular in the event of a data breach;
the risk of automated discrimination caused by the AI system introduced during development, for example linked to a lower performance of the system for certain categories of people;
the risk of producing fictitious content on a real persons, which is particularly important in the case of generative AI systems, and may have consequences for their reputation;
the risk of automated decision-making caused by automation or confirmation. This risk may arise in particular if the necessary explanatory measures are not taken during the development of the solution (such as the use of a trust score, or intermediate information such as saliency map), thus limiting the ability of the agent using the system to verify its performance under real conditions. This risk may also arise when the staff member is unable to take a decision contrary to the outputs of the system without prejudice to them (due to hierarchical pressure, for example);
the risks associated with known attacks specific to AI systems such as attacks by data poisoning, by inserting a backdoor, or by model inversion;
the risks related to the confidentiality of the data that could be extracted from the AI system;
Systemic and serious ethical risks related to the deployment of the system, such as impacts on the democratic functioning of society, or respect of fundamental rights (e.g. in cases of discrimination), which can be taken into account during the development phase.
Finally, the risk of users losing control over their published online and freely accessible data, as large-scale collection is often necessary for training an AI system, in particular when it is collected by web scraping;

When several data sources are used for the development of the AI system, the risks mentioned here are to be taken into account for each source, but also for the overall set thus constituted. Moreover, where the system is developed on the basis of a pre-trained model provided by a third party, the model must still be subject to the risk analysis described above, for example on the basis of information provided by the body providing the model.

Finally, analyses from benchmarks published by the CNIL or by third parties may be integrated or associated with the DPIA. Among these benchmarks, the CNIL recommends using:

the self-assessment guide published by the CNIL;
the benchmarks and frameworks identified by the CNIL on the page “Other guides, tools and good practices ”;
the EU AI Act, and in particular its Annex IV detailing the technical documentation to accompany the placing on the market of high-risk AI systems.

Link between the documentation requirements of the AI Act and the implementation of a DPIA

While both documents are part of a risk anticipation logic and can overlap, there are significant differences between the DPIA and the documentation of compliance of the proposed AI Regulation.

On the one hand, they differ in their scope. Since some AI systems that are not classified as high-risk will rely on processing operations that pose risks to the protection of personal data, these will require the implementation of a DPIA.

On the other hand, it will be up to the controller in question, whether the latter concerns the development or deployment of the system, to carry out a DPIA, whereas the documentation requirements of the AI Act will essentially weigh on the provider of the AI system.

However, it is foreseen that in cases where an AI system provider subject to the documentation requirements of the AI Regulation is also required to carry out a DPIA, it is encouraged to include elements from the first document in the second document. The elaboration of more precise rules on the relationship between these requirements is the subject of European work in which the CNIL actively participates and which will be the subject of subsequent publications. They will explore the possibility of working only on a single document incorporating the requirements of the DPIA and AI Act documentation.

Find out more: The CNIL DPIA Guides

Actions to be taken on the basis of the results of the DPIA

The DPIA is an exercise that first determines the level of risk associated with the processing of personal data. Based on this level, a set of measures should be devised in the DPIA to reduce and maintain it at an acceptable level. These measures must incorporate the CNIL’s recommendations that apply, whether they relate to AI techniques used or not.

Learn more: DPIA – Knowledge Bases

In addition, certain measures specific to the AI field – in particular of a technical nature – may be implemented, including:

security measures, such as homomorphic encryption or the use of a trusted execution environment;
minimisation measures, such as the use of synthetic data;
anonymisation or pseudonymisation measures, such as differential privacy;
data protection measures such as federated learning ;
measures to facilitate the exercise of rights or remedies for persons, such as machine unlearning techniques, or measures to explain and trace outputs from AI systems;
audit and validation measures, for example based on fictitious red teaming attacks, in particular to identify and correct biases or errors against certain persons or categories of persons.

Other, more generic measures may also be applied:

organisational measures, such as supervision and limitation of access to training datasets, which may allow for a modification of the AI system, the limitation of access to data by third parties and subcontractors;
governance measures, such as the setup of an ethical committee;
measures of traceability of actions carried out in order to identify and explain abnormal behaviour;
measures providing for internal documentation, such as the drafting of an IT charter.

These measures should be selected on a case-by-case basis in order to reduce the risks specific to the processing of data in question. They will need to be integrated into an action plan and monitored. In addition, being intended to protect data during the development of the AI system and in particular when setting up the dataset, they may be complemented by other AI-specific measures to be applied during the deployment phase. In particular, a description of the specific measures for the deployment of generative AI will be provided at a later stage.

Finally, the publication of all or part of the DPIA is recommended to improve transparency: while some parts of the DPIA do not have to be published to the extent that they may be covered by business secrecy or give confidential information on the security of the system, others present the risks and measures taken to limit them and their publication is of interest to system users and the general public.

< Previous: Ensuring the lawfulness of the data processing - In case of re-use of data

Table of content

Next: Taking into account data protection when designing the system >