We have all copied and pasted pieces of information or images in our lives and it never occurred to us that we are doing something wrong or breaching any copyright law. But copying on a larger scale, which we have already identified as web scraping takes it all to the next level making it possible to acquire virtually any information out there. As long as it is online – it is reachable and scrapeable and depending on the receiving end of the information it can be a problem for the content owner and their main purpose of work at the same time.
It is important to mention that scraping in general if done sustainably, is not evil at all: before we go ahead and point fingers, let’s note that Google is the largest scraper out there and I have never heard anyone complaining about it indexing their content. As many things in life, practice becomes a problem when it’s done in excessive amounts and by the wrong people.
In addition, the problem is that web data scrapers take what many companies spent enormous amounts of man-hour and funds to accumulate, for free. Consequently, it gives rise to such problems like customer confidence with a brand, uniqueness of online brands and spreading of sensitive information. As a matter of fact, any behavior that a browser makes can be copied by a knowledgeable web scraper given the intention: that’s why many content creators and site owners get understandably anxious about the thought of a web harvester copying all their data from the website.
If the question here is “how to completely stop web scraping in order to protect your data?” – I’m afraid the answer is “you cannot”. However, there are still a number of things you can do to make it difficult for crawlers to manipulate your data.
In this blog post we will try to expand on the issue of data protection by giving useful tips on how to minimize data theft along the way.
Note: As it was the case in the past, protecting data through isolation is no longer an option, and it is impossible to eliminate the risk of scraping by adding additional security tools as the scraping tools used by professionals have also become more sophisticated and the methods – more anonymous and stealthy.
However, here are some things that you can do:
1. Protect the content by requiring a Login
Taking into consideration that there is no information preserved for one request to another in HTTP, the scraper has no need to identify itself while accessing a page or website, which makes it literally impossible to trace web scrapers. That’s why it makes sense to make the scraper send identifying information along with each request in order to gain access to the content, i.e. protect the page by a login. This way you can track and prevent scrapers in the future. Note that this action will not prevent scraping: it will simply give you information on who gets access to your content in bulk and maybe help you find out the reason behind it.
2. CAPTCHAs and “Honey Pot” pages
We are all familiar with the idea of a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) which creates problems generally easy for humans to solve but difficult for computers to deal with. This serves as a tool to prevent automatic access to your content. The only problem with CAPTCHAs is that they are quite annoying. Many actual visitors get frustrated with them popping out frequently: that’s why they should be used sparingly.
There is another, more sophisticated way to detect automated use of your website: you can create so-called honey pots (pages that human visitors of your website will hardly click on but a computer (clicking on every webpage) will most certainly click it open). The concept is specifically created for web harvesters: bots that don’t know all of the URLs they’re going to visit ahead of time, and must simply click all the links on a site to traverse its content. This way, once you detect a client visiting a honey pot page, you can be sure you are dealing with a computer and not a human visitor and block all their requests.
3. Change your website’s HTML regularly
Think about it: The way scrapers generally access your data is through finding patterns in your website’s HTML mark-up and using those patterns to find the right data.
If your website’s mark-up is inconsistent (you change it every once in a while) they will not be able to carry on scraping long enough. Of course this is not to say that you need to redesign your website thoroughly: simply changing the class and id in your HTML (and the corresponding CSS files) should be enough to break most scrapers.
4. Include information in media objects
Web scrapers usually copy texts as a way of accessing information out of an HTML file, right? So it is only logical that if you embed content on your website inside an image or a pdf file (or any other non-text format) it will be difficult for a harvester to get that.
There are, however, drawbacks to this as well. Firstly, it makes your website slower to load and less accessible for disabled users (such as users with visual impairment). Secondly, it is a lot of hustle to update the website when you are dealing with non-text formats.
Here we are. Just a few simple tricks to protect your data from being harvested regardless of the sophistication of your website. While steps discussed here can prove to be very useful, they also can harm the experience of the average web viewer: so, we advise you to take extra care while choosing which prevention method to deploy.
Do you have any other tips on how to successfully secure your data from scraping? We will be happy to read them in the comments section below!