Web scraping is a technique that involves automatic extraction of web data. It is used, among other things, for market analysis (e.g. tracking product prices), but also to prepare data for training AI models. Web scraping involves the reproduction and retrieval of data, which may be personal data or copyrighted works. On top of that, there is the issue of copying databases which infringes the rights of the database producer. Therefore, the key legal issues to be addressed are protection of personal data (GDPR), copyright and database rights.
Copyright and database protection in the context of web scraping
In principle, scraping infringes rights to scraped elements of websites (e.g. graphics, texts, page layouts). In addition, database rights are infringed by taking from databases, even publicly available ones, a substantial – in terms of quality or quantity – part of its content.
However, there are exceptions to this:
- scraping publicly available data and further analysing its results for the purpose of scientific research by research and cultural heritage organisations;
- commercial scraping for any purpose if the content owner does not object to it – but the scraped data can only be kept for as long as necessary for text and data mining purposes.
Both these situations can occur in the case of lawful access to data.
Do these exceptions give us the right to teach machine learning models using scraped data protected by copyright or other laws? There is no consensus and opinions vary.
The described exceptions regarding text and data mining (TDM) are provided for in Articles 3 and 4 of the Digital Single Market (DSM) Directive. According to the definition in the DSM, text and data mining means an automated analytical technique for analysing digital texts and data in order to generate information including, inter alia, patterns, trends and correlations.
Data protection vs scraping
It is important to keep in mind that personal data is any information that allows for identifying a person, even if it is publicly available (e.g. forename, surname, email address). The mere collection of data through web scraping is already considered data processing.
Web scraping must therefore be compliant with the GDPR. If we consider a scraper to be a controller of personal data, a number of obligations arise for that person.
On what legal basis is personal data collected?
Any collection of data must comply with one of the legal bases set out in the GDPR (e.g. consent of the data subject, data collection necessary to perform a contract). In practice, the only available basis for web scraping is legitimate interest. But also here doubts arise:
- a balancing test has to be carried out and it has to be assessed whether the interests of the data subjects do not override our legitimate interest;
- this basis is not suitable as a basis for processing (collection) of sensitive data;
- it is not feasible to comply with the information obligation, i.e. to inform each person whose data have been collected (this is the latter problem, but here the exception provided for in Article 14(5)(b) GDPR might apply eliminating thereby this obligation).
Given the above, what prerequisites must be fulfilled to legally process personal data when scraping?
1 Establishment of the existence of a legitimate interest
The interest must be specific, real and legitimate. For example, the French data protection authority (CNIL2) allows commercial interests (e.g. development of AI-based services), provided that they do not infringe the rights of data subjects. The Dutch data protection authority limits this to legally protected interests only (e.g. prevention of fraud).
2 Necessity of the processing
Data must be adequate, relevant and kept to a minimum. The EDPB recommends, among other things to:
- define precise criteria for data collection (e.g. excluding geolocation or sensitive data),
- use technical filters to remove unnecessary information after identification.
3 Balancing of interests (balancing test)
A proportionality analysis should be carried out, taking into account the expectations of the data subjects. The CNIL stresses that the processing of data from public forums may be permissible if it is limited to pseudonyms and the content of comments.
Additionally, the data protection authorities recommend:
excluding the collection of data from pre-defined sites containing sensitive information, such as pornographic sites, health forums, social networks used mainly by minors, genealogy sites or sites with extensive personal data;
avoiding obtaining data from sites that explicitly prohibit scanning via robot.txt or ai.txt files,
putting together a blacklist for those who object to the collection of data from certain websites, even before data collection begins,
ensuring that people have the right to object to data collection (as this may not be feasible, you can provide a mechanism on your site to check if you have a person’s data and remove it from the collection, provided you have the tools to do so),
limiting data collection to only publicly available information and explicitly public user data, thus preventing the loss of control over private information, for example, excluding private posts on social networks,
using data anonymisation or pseudonymisation immediately after data collection to increase data security,
informing on the website about scraped sites and data collection practices through web scraping alerts,
preventing the linking of personal data with other identifiers, unless necessary for the development of artificial intelligence systems,
registering contact details with the data protection authority (e.g. French CNIL) to inform individuals and enable them to exercise their rights under the GDPR towards the data controller.
Practical recommendations
To summarise the theoretical aspects of scraping:
- make sure you fall within one of the TDM exceptions to avoid infringing copyrights,
- to legally collect personal data, identify your legitimate interest in data processing, limit the processing to a minimum, balance your interest with the interests of the persons whose data you are processing, and preferably take a number of further steps to safeguard the above.
Practical recommendations: use APIs
Using data via an API is much easier and safer, therefore this method should be used whenever possible.
- Respect opt-outs and terms and conditions
- Check the robots.txt file and page metadata before scraping,
- Avoid sites with an explicit TDM prohibition (e.g. tdm-reservation: opt-out),
- Put in place mechanisms that ensure the above.
- Limit scraping of personal data
Use technical methods to exclude personal data during web scraping, such as:
- filtering data patterns and keywords – identify and exclude information that may contain personal data,
- using anonymisation and pseudonymisation – transforming personal data so that it cannot be linked to a specific person,
- respecting opt-out tags and files – if a site provides such tools, use them,
- limiting the scope of web scraping – collect only this data that is necessary to achieve the purpose, leave out e.g. PESEL numbers, email addresses.
- Document the scraping process
Record data sources, date of scraping and safeguards used. Such documentation may come in handy in case of a dispute or an inspection by the President of the Personal Data Protection Office.
- Draw up appropriate legal documentation
A register of processing activities, a document confirming the balancing test, information about the processing of scraped data and, if you provide a model or an application based on it, a licence or terms and conditions for the application will be useful.
Summary
Web scraping is a powerful tool, but there are a number of legal challenges to its use. Understanding the legal framework and applying appropriate technical safeguards and practices can help you use web scraping effectively without breaking the law.
Contact
Michał Pietrzyk – radca prawny (Attorney-at-law) | Senior Associate in the Transactional Team, the German Desk Team, the IP/IT Team and the Competition and Consumer Protection Team.