Taking data protection into account in the system design choices
To ensure the development of a privacy-friendly AI system, it is necessary to give careful thought to the design of the system. This sheet details the steps involved.
When considering the design choices for an AI system, data protection principles, and in particular the principle of minimisation, must be respected. This approach is based on five levels. A data controller must therefore ask himself/herself about:
- the goal of the system he/she wishes to develop;
- the system’s technical architecture, which will influence the characteristics of the dataset;
- the data sources to be used (see the how-to sheet on the legal compliance, open sources, third parties, etc.);
- from these sources, the selection of strictly necessary data, having regard to the usefulness of the data and the potential impact of their collection on the rights and freedoms of the persons concerned;
- the validity of the choices previously made. Such validation may take different (non-exclusive) forms, such as a pilot study or the opinion of an ethical committee.
Specification of the objective pursued
The aim of this stage is to design a system based on the identified purpose (see how-to sheet 2), in compliance with a set of specifications, while limiting the potential consequences for the people concerned.
In specifying the use of the AI system in the deployment phase (whether implemented directly by the supplier or by a third party), the system provider must determine:
- the type of result/output expected;
- acceptable performance indicators for the solution, whether quantitative (e.g. F1-score, mean squared error, computational load and time) or qualitative (e.g. from human feedback);
- the context in which the system is used to identify essential information for its operational use;
- excluded contexts of use and information not relevant to the envisaged main use case(s) of the system.
Some AI techniques can perform complex tasks that go beyond the initial objectives of the service providers. By precisely defining the expected functionalities, it is possible to avoid the risk of over-collection.
Example: for the purpose of training an AI system to allow the counting of persons standing in a tram or a metro from video protection camera images, the following systems are technically feasible:
- a neural network used for the detection of the presence of people in a wagon, without posture analysis, integrated into an algorithm making a count of standing people (the number of standing people can be inferred from the number of seating positions);
- a neural network performing an analysis of the posture of people in a wagon integrated into an algorithm counting standing passengers.
The first network could provide less information (including the standing count). However, if the estimate provided is sufficient for the intended use case, particularly for calculating occupancy statistics, it is preferable to use this model. This is because it will require a smaller amount of data to be trained, while still meeting the intended objective, whereas the latter requires the collection and annotation of specific and more extensive data. The principle of minimisation then converges with the reduction of system design costs, without prejudice to the accuracy of the system.
Definition of the technical architecture of the system
Quite often, the same task can be performed using different AI model architectures. However, not all are equivalent. They may not achieve the same level of performance, present various challenges in terms of explainability, be subject to different operational constraints (such as the cost of calculation) or may not require the same amount of data for development.
Examples:
- Semantic analysis of a text could be carried out by a neural network based on annotated textual data, by ensemble methods such as random forest or by an unsupervised algorithm, such as a clustering algorithm.
- A plant recognition system can be developed using a supervised learning algorithm from a large dataset or from a similarity prediction algorithm driven from a very small number of data (few shot learning).
While considering the specifications defined in the previous step, the system provider must therefore choose the architecture that best respects the rights and freedoms of individuals, in order to comply with the principle of data minimisation in relation to the intended purpose. In other words, if it is possible to obtain the same result with less personal data, this architecture should be preferred.
At the model training stage, it is also necessary to consider any uncertainty about the performance of a given architecture: compliance with the principle of minimisation is assessed on the basis of available scientific knowledge.
According to the advances in the field concerned, this reflection must be based on several factors for each of the architectures under consideration. This technical analysis can be done through:
- a state of the art, for example by means of:
- a study of scientific literature (study of academic or private publications, specialised conferences, etc.);
- a survey of practices followed by professionals in the field: the process of opening up the computer code (including by placing it under a free license) of certain players in the sector helps in the comparison of the techniques;
- exchanges with the specialised community (online competitions, online forums, conferences and dedicated meetings, etc.);
- a comparison of the results obtained after the implementation of several architectures in the form of ‘proof of concept’;
- a comparison of the results obtained by using an existing and pre-trained model (which may require adaptation, or fine-tuning), and a model entirely developed by the provider.
While the choice of AI models and algorithms to be used may limit data collection, other design choices should also be taken into account, notably with a view to data protection by design. The choice of the learning protocol used in particular may make it possible to limit access to the data only to authorised persons, or to give access only to encrypted data. Two techniques, applicable in certain situations, are particularly interesting :
- Decentralised learning protocols, such as federated learning. These techniques makes it possible to train an AI model from several datasets, and thus for each party in the chain to keep the hand on their data. However, this technique comes with certain risks, concerning the security of decentralised datasets, as well as trust between parties among whom a mali-cious actor could lead a poisoning attack for example.
- The resources offered by cryptography. Recent scientific advances in the field of cryp-tography can provide strong safeguards for data protection. Depending on the use cases, it may, for example, be relevant to explore the possibilities offered by secure multiparty com-putation (SMPC), or fully homomorphic encryption (FHE). The techniques used in this field make it possible to train an AI model on data that remains encrypted throughout the learn-ing process. However, they remain limited in that they cannot be applied to all types of models and because of the additional calculation time they induce. In addition, some of them, such as fully homomorphic encryption for training neural networks, are still being studied. As technical developments are frequent in this area, it is advisable to keep close watch on this subject.
This list of measures is not exhaustive, additional ones could be cited such as the use of a trusted execution environment (TEE), differential privacy applied during the learning phase or machine unlearning should be considered. More generally, due to the rapid evolution of the technology, it is recommended to conduct a technological watch on the privacy practices applicable when developing AI systems.
Identification of the necessary data
The principle
The principle of minimisation provides that personal data must be adequate, relevant and limited to what is necessary for the purposes for which they are processed. Particular attention must be paid to the nature of the data and this principle must be applied in a particularly rigorous manner when the data processed is sensitive (i.e. special categories of data within the meaning of Article 9 GDPR).
In practice
The principle of minimisation does not mean that it is forbidden to train an algorithm with very large volumes of data: it involves having a reflection before training so as not to resort to personal data that would not be useful for the development of the system. In order to identify the personal data necessary for the development of an AI system, four dimensions should be taken into account:
- Volume: number of persons concerned, historical depth, accuracy of data, distribution of data according to situations and populations and cover, etc. It may be justified, for example, by the limited computing capacities of the servers used for learning, the needs in terms of representativeness of the dataset, the practices commonly accepted by the scientific community, a comparison of the results obtained by varying the volume of data, a statistical analysis demonstrating that a minimum amount of data is necessary to achieve meaningful results, etc.;
- Categories: age, gender, face image, social network activity, etc. The presence of sensitive data or highly personal data should be examined and justified (see Sheet 3). This analysis may be based on the need to train the model on counterfactual data (likely to give rise to false positives in practice), a study of the usefulness of the data categories concerned (see box below), etc.;
- Typology: real, synthesised, augmented, simulation data, anonymised or pseudonymised data, etc.;
- Sources: as explained in Fact Sheet 3, identification of the data sources that are envisaged to be used, whether initial collection or re-use (data available in open source, previously collected by the provider or from data providers).
Although data selection is generally a necessary phase in designing an AI system based on quality data, in some cases and as an alternative, it may be possible to process a set of data indiscriminately. In such cases, the need to do so must be justified.In addition to taking into account these technical dimensions, particular attention will have to be paid to the nature of the data within the meaning of the GDPR, and in particular in the case of sensitive or highly personal data.
Issues relating to data distribution and representativeness should also be addressed at this stage. They are essential in order to minimise the risk of biases of discrimination.
This question arises, in particular, of the inclusion of “true negative” data in the learning dataset (in particular for testing and validation in order to verify the absence of certain edge or learning effects).
As these questions are particularly important, a dedicated sheet will soon be published.
Validation of design choices
At the end of the previous three stages, design choices are theoretically validated and data collection can begin. In order to validate the design choices quantitatively and qualitatively, several measures are recommended as good practice.
Conducting a pilot study
The objective of the pilot is to ensure that the choices of a technical nature and those relating to the types of data identified are relevant. To do this, a small-scale experiment is therefore carried out. Fictitious, synthetic, anonymised or otherwise personal data collected in accordance with the GDPR may be used.
Examples:
The use of data from social networks on the personal pages of persons who have consented to the collection of their data.
This type of experimentation does not always offer a representative view of the activity encountered on social networks, but it can be adapted to certain use cases such as the identification of hate content or the study of advertising targeting on these networks. This practice is beneficial because it offers a much higher level of transparency than certain harvesting practices (web scraping).
The design of a film recommendation system
An organisation may collect from voluntary users the list of films viewed over a week and those viewed in the following days, either by declarative data or by collecting their viewing history on dedicated sites. It can conduct its pilot study on the data thus collected by anonymising the identifiers of each user.
Interview an ethical committee
The association of an ethical committee with the development of AI systems is a good practice to ensure that ethical issues and the protection of human rights and freedoms are taken into account upstream.
The Ethical Committee has several tasks which can be:
- the formulation of opinions on all or part of the organisation’s projects, tools, products, etc. which may be subject to ethical problems;
- the facilitation of reflection and the elaboration of an internal doctrine on the ethical aspects of the development of AI systems by the organisation (e.g. what conditions for the use of subcontracting);
- the uncovering of collective and individual attitudes and the recommendation of certain principles, behaviours or practices.
The composition and role of this committee may vary depending on the situation, but several good practices are recommended. The Committee should:
- be multidisciplinary: the profiles of the members of the committee – employees of the organisation and/or external persons – must be diversified. Staff members contribute to the committee’s missions and can update issues that the development teams had not considered. A good practice is to assign certain committee seats to the employees of the organisation who will each occupy their turn. In addition, the diversity of the members of the committee in terms of gender, age and ethnic and cultural origin is strongly encouraged;
- be independent: the opinions delivered by the Committee may have important implications, for example for the commercial management of a company and thus favour or disadvantage some of its projects. Thus, the persons on the committee must not be motivated by any gain (whether financial or other) to be derived from the decision rendered. Similarly, when employees sit on the committee, decisions rendered must not have consequences for them;
- have a clearly defined role: in order to ensure the systematic integration of the Committee, a procedure must be established to determine the conditions under which the Committee meets and must be associated. Depending on the situation, the committee may simply be advisory or adopt binding opinions: both approaches have advantages and disadvantages. If the committee delivers binding opinions, its inclusion in corporate governance must be particularly well defined in terms of the body’s statutes, in order to avoid its instrumentalisation. If the committee is advisory, its impact must be guaranteed, in particular by ensuring mandatory referral to the committee according to precise criteria and broad transparency of its opinions, at least within the body and possibly other measures such as the obligation for the project owner to reply in writing to the committee’s comments;
- be notified: the Committee is encouraged to inform itself, document its opinions and share its knowledge. The risks associated with the use of AI evolve with technical development and new uses in this field, and it is necessary to inform oneself, in particular through the academic literature and publications of entities competent in this field (such as the Defender of Rights, or the National Pilot Committee on Digital Ethics). The dissemination of acquired knowledge will support advice and spread some good practices.
In the case of the development of an AI system, the opinion of the Ethical Committee could be sought on several issues:
- Do the data used for development meet the ethical criteria of the organisation?
- Could the intended operational uses for the AI system have serious individual or societal consequences? Can these consequences be avoided? Can these operational uses be excluded?
- Could the potential misuse of the AI system (whether voluntary or accidental, in particular for open source models) have serious consequences for people or society? What measures would prevent them?
- Are the technical choices sufficiently controlled by the body (in the case of the use of radically new approaches)?
- Are transparency measures sufficient for the exercise of the rights of persons or to enable them to exercise a possible remedy?
- Are the discriminations that may result from the use of the system identified and have the necessary means been put in place to avoid their occurrence?
- Is the organisation organised in such a way as to prevent risks by design (whether as regards discrimination, data protection, copyright protection, computer security, etc.)?