Key takeaways
- Legitimate interest can serve as a lawful basis for web scraping to train AI, but it demands robust safeguards. It is not a blanket permission.
- The distinction between legacy data (collected before the AI era) and future data shapes both user expectations and legal compliance.
- Data minimisation and transparency are not optional extras. They are essential for balancing innovation with privacy rights under GDPR.
Every major AI company scrapes the web to build its training datasets. But is web scraping for AI training actually legal under GDPR? The answer matters enormously, both for organisations developing AI systems and for the billions of people whose data feeds them.
To dig into this question, I spoke with Tainá Baylão, Global Privacy Counsel at Roche, whose Master’s thesis at Maastricht University’s European Center on Privacy and Cybersecurity examines exactly this: can legitimate interest provide a workable lawful basis for AI training under GDPR?
Watch the full interview.
Legitimate interest can work for AI training, but it is not a free pass
The landscape shifted in 2023 when Italy’s Garante blocked ChatGPT and rejected OpenAI’s reliance on performance of contract for training data. The Garante pointed to two alternatives: consent and legitimate interest. Since then, the EDPB’s December 2024 guidelines on legitimate interest and AI models have developed this position further.
The critical point from our conversation: legitimate interest is not a magic formula. It requires demonstrable, practical measures across four areas: transparency, data minimisation, handling data subject requests, and accountability. Skip any of these, and the legal basis collapses.
The Meta case: why legacy data and future data need different treatment
Meta’s approach to AI training offers a useful case study. The company published newspaper advertisements and triggered public discussion about opt-out mechanisms. But the real question lies in how legacy data should be treated differently from new data.
The distinction matters. Content posted before AI training was widely understood was shared under different expectations. Users did not anticipate their posts would train language models. For this legacy data, an opt-in approach may be more appropriate. For future data, collected after users have been informed about AI uses, legitimate interest combined with a clear opt-out mechanism becomes more defensible, because users can now make a conscious choice about their participation.
This dual framework acknowledges something regulators are increasingly emphasising: context and user expectations evolve, and lawful processing must evolve with them.
Data minimisation: defining no-go zones before you start scraping
One of the most practical discussions in the interview concerned data minimisation. Several DPA guidelines, notably the CNIL’s, outline concrete strategies. Even for general-purpose AI models, organisations can and should draw clear boundaries around what they collect. The principle is straightforward: define your exclusions before your inclusions.
Practical no-go zones include:
- Financial information and bank details
- Geolocation data
- Health forums and patient discussions
- Adult content websites
- Any site that blocks crawlers through robots.txt or the emerging ai.txt protocols
A clear purpose definition makes these boundaries enforceable, and it strengthens the balancing test that legitimate interest assessments require.
Transparency: publish your reasoning, not just your policy
Here is a practical idea that came up in the interview: publish your Legitimate Interest Assessment. Redact the trade secrets, but show the analytical work. This kind of openness builds trust with both regulators and users, and it signals that your compliance effort goes beyond a checkbox exercise.
The AI Act’s Article 53(1)(d) template from the AI Office offers a workable framework for this. Rather than listing every website scraped, it suggests highlighting major data sources, giving users a reasonable picture of whether their data might be included.
Model unlearning: the technical wall that GDPR has not solved
Perhaps the most uncomfortable truth in this space: current technology cannot effectively “untrain” specific data points from a model. Research shows that even manual corrections can be circumvented when queries are rephrased. This creates a real compliance gap for deletion requests under Article 17, because retraining an entire model for each individual erasure request would be both environmentally and economically unfeasible.
This is not a reason to abandon GDPR compliance, but it does mean the field urgently needs better privacy-preserving training techniques and genuine model unlearning capabilities.
The regulatory balancing act: pragmatism over prohibition
Supervisory authorities face a trilemma. They can declare current web scraping practices broadly non-compliant, which risks disrupting innovation. They can push for legislative reform, which takes years. Or they can develop pragmatic interpretations of existing law. So far, the trend favours the third approach.
This pragmatism is not about giving AI companies a pass. It is about finding compliance pathways that protect fundamental rights while acknowledging the technical and economic realities of how large language models are built.
Special category data: when scraping collects more than you intended
When AI systems scrape the web indiscriminately, they inevitably collect sensitive data protected under Article 9 GDPR: health information, political opinions, religious beliefs. The scholarly debate centres on how to handle this. Some argue that all collected data should be treated as sensitive when it cannot be separated from ordinary personal data. Others worry that this approach dilutes the protections meant for truly sensitive information.
As Tainá put it memorably during our conversation: “If everything is special, then nothing is.” The tension remains unresolved, and it is one of the harder questions regulators will need to address.
What organisations should do now
Based on the research and regulatory guidance discussed, organisations that scrape the web to train AI models should take these steps:
- Build a comprehensive legitimate interest assessment with meaningful safeguards, not a templated formality.
- Define data minimisation boundaries upfront. Decide what you will not collect before deciding what you will.
- Invest in transparency. Consider publishing redacted versions of your assessments and data source summaries.
- Distinguish between legacy and future data. Different collection eras may warrant different legal approaches.
- Fund technical research into privacy-preserving training and model unlearning. Compliance cannot wait for the technology to catch up on its own.
The bottom line
Web scraping for AI training can be GDPR-compliant. Legitimate interest provides a viable path forward, and regulators appear willing to work within existing legal frameworks rather than shut the door entirely. But that path demands substance: real safeguards, genuine transparency, and a willingness to treat data protection as more than a compliance exercise.
The question is no longer whether AI and GDPR can coexist. It is whether organisations are willing to do the work required to make that happen.
Tainá Baylão’s thesis was published in late 2024.
Frequently asked questions
Is web scraping legal under GDPR?
Web scraping is not inherently illegal under GDPR, but it must have a valid lawful basis. Legitimate interest under Article 6(1)(f) is the most commonly discussed option for AI training. The EDPB’s 2024 guidelines confirm this can work, provided the organisation implements robust safeguards around transparency, data minimisation, and data subject rights.
Can you use legitimate interest as a lawful basis for AI training?
Yes, but with conditions. Legitimate interest requires a documented balancing test that weighs the organisation’s interest in training the model against the rights and freedoms of the data subjects whose data is scraped. Practical measures such as data minimisation strategies, opt-out mechanisms, and transparency about data sources are essential for the assessment to hold up.
What is the difference between legacy data and future data in AI training?
Legacy data refers to content published before AI training was widely understood by the public. Users who posted this content had no reason to expect it would train AI models. Future data is content shared after users have been informed about AI uses. The legal treatment may differ: legacy data may call for opt-in consent, while future data collected with proper notice can more readily rely on legitimate interest with opt-out options.
Can personal data be deleted from a trained AI model?
Not reliably, with current technology. Research indicates that “model unlearning” techniques cannot yet effectively remove specific data points from trained models. This creates a compliance gap for erasure requests under Article 17 GDPR, and it is an area where further technical research is needed.