Podcasts

Can You Scrape the Web to Train AI Under GDPR?

Legitimate Interest Can Work (But It's Complicated)

Key Takeaways:

  • Legitimate interest can work for AI training, but requires robust safeguards and careful implementation – it’s not a free pass
  • The distinction between legacy data (pre-AI era) and future data matters for user expectations and legal compliance
  • Data minimization and transparency aren’t just nice-to-haves – they’re essential for balancing innovation with privacy rights

The question keeping privacy professionals awake at night is deceptively simple: Can you legally vacuum the internet to train AI models? I recently interviewed Tainá Baylão, Global Privacy Counsel at Roche, about her Master’s thesis on this exact topic. Her research at Maastricht University’s European Center on Privacy and Cybersecurity examines legitimate interest as a lawful basis for AI training.

Watch the full interview.

The Conditional Yes: Legitimate Interest Can Work (But It’s Complicated)

The conversation around legitimate interest for AI training shifted significantly after the Garante’s ChatGPT decision two years ago. When Italy’s DPA blocked ChatGPT, they rejected OpenAI’s reliance on performance of contract for training data, pointing instead to consent and legitimate interest as potential alternatives. The EDPB’s December 2024 guidelines on legitimate interest and AI models have further developed this position.

The critical point that emerged from our discussion: legitimate interest isn’t a magic phrase that makes compliance problems disappear. The assessment demands practical measures that make a real difference – robust approaches to transparency, data minimization, data subject requests, and accountability. Without these safeguards, the legal basis simply doesn’t hold up.

The Meta Case: Different Approaches to Different Data

Meta’s approach to AI training provides an interesting case study. While they published newspaper advertisements and triggered widespread public discussion about opt-out mechanisms, the treatment of legacy data remains a complex issue.

One approach discussed in the interview distinguishes between two categories:

Legacy data (pre-AI era): Content posted before AI training was widely understood might warrant different treatment, possibly through opt-in mechanisms, given users’ reasonable expectations at the time.

Future data (post-notification): Once users are informed about AI uses, legitimate interest with opt-out options becomes more viable as users can make conscious platform participation choices.

This dual approach acknowledges that context and expectations evolve over time in the digital landscape.

Data Minimization: Drawing Practical Boundaries

The discussion highlighted practical data minimization strategies referenced in various DPA guidelines, particularly the CNIL’s guidance. Even general-purpose AI models can establish clear boundaries:

  • Financial information and bank details
  • Geolocation data
  • Health forums and patient discussions
  • Adult content websites
  • Respecting robots.txt and emerging ai.txt protocols

Having a clear purpose definition helps establish these no-go zones, which becomes essential for the balancing test in legitimate interest assessments.

The Transparency Challenge: Building Trust Through Openness

An interesting concept discussed was the potential for organizations to publish redacted versions of their Legitimate Interest Assessments. By removing trade secrets but showing the analytical work, companies could build trust with both regulators and users.

The AI Act’s Article 53(1)(d) template from the AI Office offers a practical framework. Rather than listing every website scraped, it suggests highlighting major data sources – giving users reasonable expectations about whether their data might be included.

The Technical Challenge: Model Unlearning

One of the most challenging aspects discussed was the current technical limitations around data subject rights, particularly erasure. Research indicates that current technology can’t effectively “untrain” specific data points from models. Even manual corrections can be circumvented when queries are rephrased differently.

This creates compliance challenges for deletion requests, as retraining entire models for individual requests would be environmentally and economically unfeasible. The field needs continued research on privacy-preserving training techniques and model unlearning.

The Regulatory Balancing Act

Supervisory authorities face a trilemma: declare current practices broadly non-compliant (potentially disrupting innovation), push for legislative reform, or develop pragmatic interpretations of existing law. The trend appears to favor the third approach – finding workable solutions within the current framework.

This isn’t about exempting AI from regulation, but rather about developing practical compliance pathways that protect fundamental rights while acknowledging technical and economic realities.

The Special Category Data Question

The interview touched on an ongoing scholarly debate about special category data under Article 9 GDPR. When AI systems collect data indiscriminately, mixing personal and sensitive information, how should protections apply?

Different scholarly positions exist – some argue for treating all collected data as sensitive when it cannot be separated, while others worry this approach might dilute protections meant for truly sensitive information. The memorable phrase from our discussion:

“If everything’s special, then nothing is.”

Practical Steps Forward

Based on the research discussed, organizations training AI models should consider:

  1. Developing comprehensive legitimate interest assessments with meaningful safeguards
  2. Implementing clear data minimization strategies – defining exclusions before inclusions
  3. Enhancing transparency measures – considering publication of assessment methodologies
  4. Recognizing temporal context – different data may warrant different treatment based on collection era
  5. Investing in technical solutions for privacy-preserving training and model governance

Looking Ahead

The conversation around AI and GDPR compliance continues to evolve. What’s clear from current research and regulatory guidance is that legitimate interest can provide a path forward, but only with serious commitment to implementing meaningful safeguards and respecting data protection principles.

The challenge isn’t whether AI can be GDPR compliant, but whether organizations are willing to do the substantive work required to achieve that compliance.

Tainá Baylão’s thesis will be published in late September/early October 2024.

icon_smile

Try Digibeetle with your team for free

Start your discovery of data protection documents with Digibeetle.