Web scraping is a technique that involves automatic extraction of web data. It is used, among other things, for market analysis (e.g. tracking product prices), but also to prepare data for training AI models. Web scraping involves the reproduction and retrieval of data, which may be personal data or copyrighted works. On top of that, there is the issue of copying databases which infringes the rights of the database producer. Therefore, the key legal issues to be addressed are protection of personal data (GDPR), copyright and database rights.

Copyright and database protection in the context of web scraping

However, there are exceptions to this:

Do these exceptions give us the right to teach machine learning models using scraped data protected by copyright or other laws? There is no consensus and opinions vary.

The described exceptions regarding text and data mining (TDM) are provided for in Articles 3 and 4 of the Digital Single Market (DSM) Directive. According to the definition in the DSM, text and data mining means an automated analytical technique for analysing digital texts and data in order to generate information including, inter alia, patterns, trends and correlations.

Data protection vs scraping

It is important to keep in mind that personal data is any information that allows for identifying a person, even if it is publicly available (e.g. forename, surname, email address). The mere collection of data through web scraping is already considered data processing.

On what legal basis is personal data collected?

Any collection of data must comply with one of the legal bases set out in the GDPR (e.g. consent of the data subject, data collection necessary to perform a contract). In practice, the only available basis for web scraping is legitimate interest. But also here doubts arise:

Given the above, what prerequisites must be fulfilled to legally process personal data when scraping?

1 Establishment of the existence of a legitimate interest

The interest must be specific, real and legitimate. For example, the French data protection authority (CNIL2) allows commercial interests (e.g. development of AI-based services), provided that they do not infringe the rights of data subjects. The Dutch data protection authority limits this to legally protected interests only (e.g. prevention of fraud).

2 Necessity of the processing

Data must be adequate, relevant and kept to a minimum. The EDPB recommends, among other things to:

3 Balancing of interests (balancing test)

A proportionality analysis should be carried out, taking into account the expectations of the data subjects. The CNIL stresses that the processing of data from public forums may be permissible if it is limited to pseudonyms and the content of comments.

Additionally, the data protection authorities recommend:

excluding the collection of data from pre-defined sites containing sensitive information, such as pornographic sites, health forums, social networks used mainly by minors, genealogy sites or sites with extensive personal data;

avoiding obtaining data from sites that explicitly prohibit scanning via robot.txt or ai.txt files,

putting together a blacklist for those who object to the collection of data from certain websites, even before data collection begins,

ensuring that people have the right to object to data collection (as this may not be feasible, you can provide a mechanism on your site to check if you have a person’s data and remove it from the collection, provided you have the tools to do so),

limiting data collection to only publicly available information and explicitly public user data, thus preventing the loss of control over private information, for example, excluding private posts on social networks,

using data anonymisation or pseudonymisation immediately after data collection to increase data security,

informing on the website about scraped sites and data collection practices through web scraping alerts,

preventing the linking of personal data with other identifiers, unless necessary for the development of artificial intelligence systems,

registering contact details with the data protection authority (e.g. French CNIL) to inform individuals and enable them to exercise their rights under the GDPR towards the data controller.

Practical recommendations

Practical recommendations: use APIs

Using data via an API is much easier and safer, therefore this method should be used whenever possible.

  1. Respect opt-outs and terms and conditions
  1. Limit scraping of personal data

Use technical methods to exclude personal data during web scraping, such as:

  1. Document the scraping process

Record data sources, date of scraping and safeguards used. Such documentation may come in handy in case of a dispute or an inspection by the President of the Personal Data Protection Office.

  1. Draw up appropriate legal documentation

A register of processing activities, a document confirming the balancing test, information about the processing of scraped data and, if you provide a model or an application based on it, a licence or terms and conditions for the application will be useful.

Summary

Web scraping is a powerful tool, but there are a number of legal challenges to its use. Understanding the legal framework and applying appropriate technical safeguards and practices can help you use web scraping effectively without breaking the law.


Contact

Michał Pietrzyk – radca prawny (Attorney-at-law) | Senior Associate in the Transactional Team, the German Desk Team, the IP/IT Team and the Competition and Consumer Protection Team.