A much-anticipated Opinion from the European Data Protection Board (EDPB) on AI models and data protection has not resulted in the clear or definitive guidance that businesses operating in the EU had hoped for. The Opinion emphasises the need for case-by-case assessments to determine GDPR applicability, highlighting the importance of accountability and record-keeping, while also flagging ‘legitimate interests’ as an appropriate legal basis under specific conditions. In rejecting the proposed Hamburg thesis, the EDPB has stated that AI models trained on personal data should be considered anonymous only if personal data cannot be extracted or regurgitated.
Introduction
On 17 December 2024, the EDPB published a much-anticipated Opinion on AI models and data protection. The Opinion includes the EDPB’s view on the following key questions: does the development and use of an AI model involve the processing of personal data; and if so, what is the correct legal basis for that processing?
As is sometimes the case with EDPB Opinions, which necessarily represent the consensus view of the supervisory authorities of 27 different Member States, the Opinion does not provide many clear or definitive answers. Instead, the EDPB offers indicative guidance and criteria, calling for case-by-case assessments of AI models to understand whether, and how, they are impacted by the GDPR. In this context, the Opinion repeatedly highlights the importance of accountability and record-keeping by businesses developing or using AI, so that the applicability of data protection laws, and the business’ compliance with those laws, can be properly assessed.
Whilst the equivocation of the Opinion might be viewed as unhelpful by European businesses looking for regulatory certainty, it is also a reflection of the complexities inherent in this intersection of law and technology.
In summary, the answers given by the EDPB to the four questions in the Opinion are as follows:
- Can an AI model, which has been trained using personal data, be considered anonymous? Yes, but only in some cases. It must be impossible, using all means reasonably likely to be used, to obtain personal data from the model, either through attacks which aim to extract the original training data from the model itself, or through interactions with the AI model (i.e., personal data provided in responses to prompts / queries).
- Is ‘legitimate interests’ an appropriate legal basis for the training and development of an AI model? In principle yes, but only where the processing of personal data is necessary to develop the AI model, and where the ‘balancing test’ can be resolved in favour of the controller. In particular, the issue of data minimisation, and the related issue of web-scraping / indiscriminate capture of data, will be relevant here.
- Is ‘legitimate interests’ an appropriate legal basis for the deployment of an AI model? In principle yes, but only where the processing of personal data is necessary to deploy the AI model, and where the ‘balancing test’ can be resolved in favour of the controller. Here, the impact on the data subject of the use of the AI model is of predominant importance.
- If an AI Model has been found to have been created, updated or developed using unlawfully processed personal data, how does this impact the subsequent use of that AI model? This depends in part on whether the AI model was first anonymised before being disclosed to the deployer of that model (see Question 1). Otherwise, the deployer of the model may need to assess the lawfulness of the development of the model as part of its accountability obligations.
Background
The Opinion was issued by the EDPB under Article 64 of the GDPR, in response to a request from the Irish Data Protection Commission. Article 64 requires the EDPB to publish an opinion on matters of ‘general application’ or which ‘produce effects in more than one Member State’.
In this case, the Irish DPC asked the EDPB to provide an opinion on the above-mentioned questions – a request that is not surprising given the general importance of AI models to businesses across the EU, but also in light of the large number of technology companies developing those models who have established their European operations in Ireland.
In order to understand the Opinion, it helps to be familiar with certain concepts and terminology relating to AI.
First, the Opinion distinguishes between an ‘AI system’ and an ‘AI model’. For the former, the EDPB relies on the definition given in the EU AI Act. In short: a machine-based system operating with some degree of autonomy that infers, from inputs, how to produce outputs such as predictions, content, recommendations, or decisions. An AI model, meanwhile, is a component part of an AI system. Colloquially, it is the ‘brain’ of the AI system – an algorithm, or series of algorithms (such as in the form of a neural network), that recognises patterns in data. AI models require the addition of further components, such as a user interface, to become AI systems. To take a common example – the generative AI system known as Chat GPT is a software application comprised of an AI model (the GPT Large Language Model) connected to a chatbot-style user interface that allows the user to submit queries (or ‘prompts’) to the model in the form of natural language questions. Whilst the Opinion is notionally concerned only with AI models, at times the Opinion appears to blur the distinction between the model and the system, in particular, when discussing the significance of model outputs that are only rendered comprehensible to the user through an interface that sits outside of the model.
Second, the Opinion relies on an understanding of a typical ‘AI lifecycle’, pursuant to which an AI model is first developed by training the model on large volumes of data. This training may happen in a number of phases which become increasingly refined (referred to as ‘fine-tuning’). Only after an AI model is developed can it be used, or ‘deployed’, in a live setting, as part of an AI system. Often, the developer of an AI model will not be the same person as the deployer. This is relevant because the Opinion variously addresses both development and deployment phases.
The significance of the ‘Hamburg thesis’
With respect to the key question of whether AI models can be considered anonymous, the Opinion follows in the wake of a much-discussed paper published in July 2024 by the data protection authority for the German state of Hamburg. The paper took the position that AI models (specifically, Large Language Models) are, in isolation, anonymous – they do not involve the processing of personal data.
In order to reach that conclusion, the paper decoupled the model itself from: (i) the prior training of the model (which may involve the collection and further processing of personal data as part of the training dataset); and (ii) the subsequent use of the model, whereby a prompt/input may contain personal data, and an output may be used in a way that means it constitutes personal data.
Looking only at the AI model itself, the paper decided that the tokens and values which make up the ‘inner workings’ of a typical AI model do not, in any meaningful way, relate to or correspond with information about identifiable individuals. Consequently, the model itself was found to be anonymous, even if the development and use of the model involves the processing of personal data.
The Hamburg thesis was welcomed for several reasons, not least because it resolved difficult questions such as how data subject rights could be understood in relation to an AI model (if someone asks for their personal data to be deleted, then what can this mean in the context of an AI model?), and the question of the lawful basis for ‘storing’ personal data in an AI model (as distinct from the lawful basis for collecting and preparing data to train the model).
However, as we go on to explain, the EDPB Opinion does not follow the relatively simple and certain framework presented by the Hamburg thesis. Instead, it introduces uncertainty by asserting that there are, in fact, scenarios where an AI model contains personal data, but that this must be determined on a case-by-case basis.
Are AI models anonymous?
First, the Opinion is only concerned with AI models that have been trained using personal data. Therefore, AI models trained using solely non-personal data (such as statistical data, or financial data relating to businesses) can, for the avoidance of doubt, be considered anonymous. However, in this context the broad scope of ‘personal data’ under the GDPR must be remembered, and the Opinion does not suggest any de minimis level of personal data that needs to be involved in the training of the AI model for the question of GDPR applicability to arise.
Where personal data is used in the training phase, the next question is whether the model is specifically designed to provide personal data regarding individuals whose personal data were used to train the model. If so, the AI model will not be anonymous. For example, an AI model that is trained to provide a user, on request, with biographical information and contact details for directors of public companies, or a generative AI model that is trained on the voice recordings of famous singers so that it can, in turn, mimic the voices of those singers. In each case, the model is trained on personal data of specific individuals, in order to be able to produce other personal data about those individuals as an output.
Finally, there is the intermediary case of AI models that are trained on personal data, but that are not designed to provide personal data related to the training data as an output. It is this use case that the Opinion focuses on. The conclusion is that AI models in this category may be anonymous, but only if the developer of the model can demonstrate that information about individuals whose personal data was used to train the model cannot be ‘obtained from’ the model, using all means reasonably likely to be used. Notwithstanding that personal data used for training the model no longer exists within the model in its original form (but rather it is “represented through mathematical objects“), that information is, in the eyes of the EDPB, still capable of constituting personal data.
The following question then arises: how does someone ‘obtain’ personal data from an AI model? In short, the Opinion posits two possibilities. First, that training data is ‘extracted’ via deliberate attacks. The Opinion refers to an evolving field of research in this area and makes reference to techniques such as ‘model inversion’, ‘reconstruction attacks’, and ‘attribute and membership inference’. These are techniques that can be deployed to trick the model into revealing training data, or otherwise reconstruct that training data, in some cases relying on privileged access to the model itself. Second, is the risk of accidental or inadvertent ‘regurgitation’ of personal data as part of an AI model’s outputs.
Consequently, a developer must be able to demonstrate that its AI model is resistant both to attacks that extract personal data directly from the model, as well as to the risk of regurgitation of personal data in response to queries: “In sum, the EDPB considers that, for an AI model to be considered anonymous, using reasonable means, both (i) the likelihood of direct (including probabilistic) extraction of personal data regarding individuals whose personal data were used to train the model; as well as (ii) the likelihood of obtaining, intentionally or not, such personal data from queries, should be insignificant for any data subject“.
Which criteria should be used to evaluate whether an AI model is anonymous?
Recognising the uncertainty in its conclusion that the AI models ‘may or may not‘ be anonymous, the EDPB provides a list of criteria that can be used to assess the likelihood of a model being found to contain personal data. These include:
- Steps taken to avoid or limit the collection of personal data during the training phase.
- Data minimisation or masking measures (e.g., pseudonymisation) applied to reduce the volume and sensitivity of personal data used during the training phase.
- The use of methodologies during model development that reduce privacy risks (e.g., regularisation methods to improve model generalisation and reduce overfitting, and appropriate and effective privacy-preserving techniques, such as differential privacy).
- Measures that reduce the likelihood of obtaining personal data from queries (e.g., ensuring the AI system blocks the presentation to the user of outputs that may contain personal data).
- Document-based audits (internal or external) undertaken by the model developer that include an evaluation of the chosen measures and of their impact to limit the likelihood of identification.
- Testing of the model to demonstrate its resilience to different forms of data extraction attacks.
What is the correct legal basis for AI models?
When using personal data to train an AI model, the preferred legal basis is normally the ‘legitimate interests’ of the controller, under Article 6(1)(f) GDPR. This is for practical reasons. Whilst, in some circumstances, it may be possible to obtain GDPR-compliant consent from individuals authorising the use of their data for AI training purposes, in most cases this will not be feasible.
Helpfully, the Opinion accepts that legitimate interests is, in principle, a viable legal basis for processing personal data to train an AI model. Further, the Opinion also suggests that it should be straightforward for businesses to identify a lawful legitimate interest. For example, the Opinion cites “developing an AI system to detect fraudulent content or behaviour” as a sufficiently precise and real interest.
However, where businesses may have more difficulty is in showing that the processing of personal data is necessary to realise their legitimate interest, and that their legitimate interest is not outweighed by any impact on the rights and freedoms of data subjects (the ‘balancing test’). Whilst this is fundamentally just a restatement of existing legal principles, the following sentence should nevertheless cause some concern for businesses developing AI models, in particular Large Language Models: “If the pursuit of the purpose is also possible through an AI model that does not entail processing of personal data, then processing personal data should be considered as not necessary“. Technically speaking, it may often be the case that personal data is not essential for the training of an AI model – however, this does not mean that it is straightforward to systematically remove all personal data from a training dataset, or otherwise replace all identifying elements with ‘dummy’ values.
With respect to the balancing test, the EDPB asks businesses to consider a data subject’s interest in self-determination and in maintaining control over their own data when considering whether it is lawful to collect personal data for model training purposes. In particular, it may be more difficult to satisfy the balancing test if a developer is scraping large volumes of personal data (especially including any sensitive data categories) against their wishes, without their knowledge, or otherwise in contexts that would not be reasonably expected by the data subject.
When it comes to the separate purpose of deploying an AI model, the EDPB asks businesses to consider the impact on the data subject’s fundamental rights that arise from the purpose for which the AI model is used. For example, AI models that are used to block content publication may adversely affect a data subject’s fundamental right to freedom of expression. However, conversely the EDPB recognises that the deployment of AI models may have a positive impact on a data subject’s rights and freedoms – for example, an AI model that is used to improve accessibility to certain services for people with disabilities). In line with Recital 47 GDPR, the EDPB reminds controllers to consider the ‘reasonable expectations’ of data subjects in relation to both training and deployment uses of personal data.
Finally, the Opinion discusses a range of ‘mitigating measures’ that may be used to reduce risks to data subjects and therefore tip the balancing test in favour of the controller. These include:
- Technical measures to reduce the volume or sensitivity of personal data at use (e.g., pseudonymisation, masking).
- Measures to facilitate the exercise of data subject rights (e.g., providing an unconditional right for data subjects to opt-out of the use of their personal data for training or deploying the model; allowing a reasonable period of time to elapse between collection of training data and its use).
- Transparency measures (e.g., public communications about the controller’s practices in connection with the use of personal data for AI model development).
- Measures specific to web-scraping (e.g., excluding publications that present particular risks; excluding certain data categories or sources; excluding websites that clearly object to web scraping).
Notably, the EDPB observes that, to be effective, these mitigating measures must go beyond mere compliance with GDPR obligations (for example, providing a GDPR compliant privacy notice, which a controller would in any case be required to do, would not be an effective transparency measure for these purposes).
When are companies liable to non-compliant AI models?
In its final question, the DPC sought clarification from the EDPB on how a deployer of an AI model might be impacted by any unlawful processing of personal data in the development phase of the AI model.
According to the EDPB, such ‘upstream’ unlawful processing may impact a subsequent deployer of an AI model in the following ways:
- Corrective measures taken against the developer may have a knock-on effect on the deployer – for example, if the developer is ordered to delete personal data unlawfully collected for training purposes, the developer would not be allowed to subsequently process this data. However, this raises an important practical question about how such data could be identified in, and deleted from, the AI model, taking into account the fact that the model does not retain training data in its original form.
- Unlawful processing in the development phase may impact the legal basis for the deployment of the model – in particular, if the deployer of the AI model is relying on ‘legitimate interests’, it will be more difficult to satisfy the balancing test in light of the deficiencies associated with the collection and use of the training data.
In light of these risks, the EDPB recommends that deployers take reasonable steps to assess the developer’s compliance with data protection laws during the training phase. For example, can the developer explain the sources of data used, steps taken to comply with the minimisation principle, and any legitimate interest assessments conducted for the training phase? For certain AI models, the transparency obligations imposed in relation to AI systems under the AI Act should assist a deployer in obtaining this information from a third party AI model developer. While the opinion provides a useful framework for assessing GDPR issues with AI systems, businesses operating in the EU may be frustrated with the lack of certainty or definitive guidance on many key questions relating to this new era of technology innovation.