For those of you who are new to this, I’ll start with the basics – web scraping is the process of using bots to systematically lift content from a website. It is the process of taking unstructured information from web pages and turning it into structured information that can be used in a subsequent stage of analysis: this is the shortest way to put it and it’s as simple as it gets.
Some call it theft, others call it legitimately gathering business intelligence – and everyone is doing it. Small companies love it because it requires the minimum expense and is a powerful way to get data without the unnecessary partnership, large companies are more than happy to use web scraping to achieve competitive intelligence but at the same time, try really hard to ban others from doing the same. You got it, it’s a messy and confusing online world and determining where legally gaining content ends and dodgy practice begins or when sharing RSS content becomes plagiarism can be a little hard at times.
This was all about scraping specific content for the target audience. How about taking it a step further and scraping entire RSS feeds to get even more content? What advantages can that have? Before we go any further, it would be useful to talk a little about RSS feeds and understand the basic meaning of them.
Firstly, let’s break down the abbreviation. RSS stands for Really Simple Syndication. Essentially, it is a program that formats parts of a website’s content so it is easy to share elsewhere (social media, search engines, etc.). Some RSS feeds are also successfully included in emails or different mass communication systems out there to make distribution of the content easier.
So, naturally, many web content developers use RSS feeds to syndicate their content. The reason for this is that doing so provides a much broader potential audience for the given data then the one they might have had if the content was only published and limited to their website.
As opposed to an HTML document (doesn’t indicate where exactly the content is on the page), RSS states clearly where the headline, body and other important elements of the content are, making it incredibly easy for web harvesters to grab the content and display anywhere they want (minus the surrounding formatting and HTML code, of course).
But interestingly this doesn’t discourage content authors. Moreover, most of them who put their content into RSS feeds know and appreciate this fact and even enjoy seeing it in RSS feed readers. Why? Simply because they know that RSS makes their audience considerably bigger.
But if no one objects and authors themselves like seeing their content in RSS feed readers, it is only natural to wonder what the problem is with it that so many people talk about?
One of the biggest problems is that when website owners use the RSS feeds in order to syndicate the content onto their websites, they use the feeds to provide constantly changing content on their websites rather than writing and posting their own articles and it in its turn raises legal/copyright issues (which will be covered further).
Why do People Scrape RSS Feeds?
With the help of RSS feeds people from all around the world are able to access the content of any type: it also serves as the main reason why people would grab that content and use it on their websites. For instance, spammers seek high rankings in search engines to display their ads more productively. In order for them to do this, they need content but creating content by hand is time-consuming and difficult, especially when much of it is going to make no difference in the search engines.
RSS feeds quickly come to rescue in this situation as it is a fairly easy way for spammers, website owners to fill their sites with content solely acquired from other websites.
As in any web content scraping topic, RSS feed scraping also comes with hot topic of legality attached to it.
Is RSS Scraping Illegal?
Let’s start with saying that there is no definite answer to this question. The only thing we can do is to cover the topic from both sides and find the “golden medium”.
Scrapers like to argue that if content writers and website owners distribute their data via an RSS feed it automatically creates a so-called “license” for others to use it.
The problem with this statement is even though RSS allows syndication, it’s not necessarily to be assumed that it is open license to do so, just like an unlocked door doesn’t give you the right to enter the room.
Another favorite argument amongst scrapers is bringing up fair use to defend RSS feed scraping. I find it hard to imagine any copyright lawyer attempting to bring forward fair use defense to protect a scraper. Because, as contradicting as it sounds fair use is not a license to do what one will with another’s work, even if it isn’t for profit.
That being said, the legality of RSS scraping is still a legal grey area. The courts have not yet come up with a clear answer to whether republishing an RSS feed is actually copyright infringement or not.
In conclusion, it has become incredibly convenient and easy to scrape RSS feeds nowadays, but there are still a number of legal caveats associated with it: as scraping in general, this also comes with its share of responsibility and issues that cannot be neglected. Whether or not you decided to interpret “syndication” as a free license to use someone else’s work doing it sustainably and at the same time giving the author credit is important. Performed intelligently, the practice can be a valuable tool in your business.
Is there anything else you need to know about RSS feed scraping that we didn’t cover? Please, comment below and let us know!