The legal basis of legitimate interests: Focus sheet on open source models

02 July 2024

In view of their potential benefits, open source practices should be considered when assessing legitimate interests of an AI system provider. However, it is necessary to adopt safeguards to limit the harm they can cause to individuals.

This content is a courtesy translation of the original publication in French. In the event of any inconsistencies between the French version and this English translation, please note that the French version shall prevail.

Open source in AI

Given the absence of a commonly accepted definition for open source models, the CNIL observes that, in this field, it encompasses a variety of practices. While the publication of model parameters is a minimum condition to talk about open source, other practices can also be beneficial in many cases. These practices can be categorized as follows:

Transparency in model development, including the publication of:
- the documentation of the procedure followed to develop the model (including the data collection phase for training), possibly in the form of a scientific publication,
- the code used to train the model,
- the training data.
Transparency of the model obtained, including the publication of:
- the model documentation, detailing for example its architecture, performance, and limitations, possibly in the form of a descriptive sheet (often called a model card);
- the model weights.
Access to the model, including publication of:
- a library allowing its use,
- an API for its use,
- the code to use the model,
- the model under a licence allowing its use, modification, or redistribution.

Some of these practices, such as the publication of the dataset used for the training, may nevertheless entail risks for data subjects, and therefore cannot be recommended in all cases. Open source practices may have benefits, even if not all of the elements listed above are disclosed. The publication of certain elements may, however, be necessary to ensure significant gains in terms of transparency or peer review. In these cases, additional measures are recommended to limit the impact on individuals’ rights.

The benefits of open source in AI

Open source in AI can have significant benefits for the controller, allowing them to leverage the community contributions or enhance the attractiveness of their models by facilitating its adoption by other stakeholders.

It also brings numerous advantages to research activities and scientific innovation. For instance, it fosters knowledge sharing among developers in the open source community, improves accessibility for students, encourages the design and publication of related tools using open source models, and promotes the harmonisation of practices and interoperability of models and systems.

Open source can also have benefits for individuals whose data is used in the development phase, for model users, or for those whose data is used in the deployment phase. This could:

increase the transparency of the model and its functioning, thereby facilitating individuals’ exercise of their rights;
enable verification of the model’s capabilities and limitations (such as its theoretical performance on training data or other datasets, as well as its behavior in edge cases or under specific conditions);
facilitate the verification and detection of biases in order to reduce or correct them. However, the level of control varies depending on which elements are published: for instance, opening just the model allows studying biases during its use, whereas opening weights and training data, along with documenting the database creation process, can enable the identification of biases introduced during data collection, annotation, or preprocessing;
facilitate the detection of vulnerabilities in the model to improve its security (for example, whether personal data is memorised during training, the possibility of generating illegal content, such as hateful content or inciting dangerous behavior, or analyzing the performance of the model), although this may sometimes require access to or knowledge of the data used for training.

The possibility of taking into account these benefits, as well as their magnitude, will depend on the scope of the elements made open-source among the list of elements listed above.

The benefits of open source models may be considered in the assessment of the legitimate interest of the controller in the development phase. It may:

strengthen the legitimate interest of data controllers, with regard to the benefits for scientific research and to guarantee the quality of the model against risks of illegality in the use phase, in particular with regard to discrimination where the model contains biases;
provide an additional guarantee for processing operations in the development phase, in particular where it has benefits in terms of transparency, accountability or peer review.

The risks associated with open source in AI

Additional guarantees to be provided

As a result, open source publication can be taken into account in the assessment of legitimate interests or as an additional safeguard only if certain appropriate safeguards are implemented. The following measures are thus recommended:

Ensure that the published elements allow for a sufficient level of transparency, effective peer review and a real contribution to the open source community or scientific research, at minimum by opening the following elements:
- the model parameters (including weights corresponding to reinforcement learning when their publication does not jeopardize the developer’s business model, for which these weights generally hold significant value);
- the code required to use the model;
- a model description sheet, including information on its architecture, performance and limitations;
- a descriptive sheet of the data used for training, fine-tuning or improving the model on the basis of the one proposed in the annex to the sheet “Taking data protection into account in data collection and management”;
- the publication of the training dataset would allow for increased peer review, in particular for detecting and correcting possible biases. However, this is only possible if it does not disproportionately impact the rights and freedoms of individuals, especially by implementing the necessary measures (data security, such as anonymisation/pseudonymisation of data, increased information of individuals, measures to ensure the exercise of rights along the chain of stakeholders, etc.);
- distributing the model under a license allowing contribution to the open source community, and authorising in particular its download, modification, and reuse in compliance with the conditions listed below;
  
  These recommendations are in line with the European AI At, which provides for certain derogations for AI systems and models published under a sufficiently open license, underlining the importance of opening up parameters (including weights), as well as information on the architecture and use of the models in question.
Implement legal measures (e.g. restrictive licenses) to limit model reuse and technical measures (e.g. digital watermarking) to trace and control certain reuses. For example, the use of BLOOM models is governed by a reuse license which restricts potential uses;
Implement technical data security measures such as anonymisation or pseudonymisation of data or carrying out analyses to measure the risks of regurgitation or data leakage;
Implement measures to ensure data subjects’ information and the exercise of their rights, such as increased information or technical measures to ensure the transmission of rights exercise along the chain. For the latter case, it may involve:
- specifying in the terms and conditions the obligation to pass on the effects of the exercise of rights of opposition, rectification or erasure to systems developed subsequently;
- ensuring the traceability of downloads in high-risk cases (e.g. by keeping contact information of individuals or organizations downloading them), which may allow to better implement the rights of individuals, although this is not strictly adherent to open source principles;
- disabling downloads of previous versions of an AI model, for example if it has been modified following a request to exercise opposition, rectification, or erasure rights. Again, traceability of downloads allows to inform reusers of the existence of this new version and thus to encourage them to use it.

Find out more:

For more information on the benefits and risks of open source, see our in-depth analysis on “Open source practices in artificial intelligence”;
For more information on the legal basis for open data publication, see fact sheet 2 of the Practical Guide on Openness and Reuse of Publicly Available Data (in French).

< Previous : Relying on the legal basis of legitimate interests to develop an AI system

Table of content

Next : The legal basis of legitimate interests: Focus sheet on measures to implement in case of data collection by web scraping >