Scraping websites can help you get valuable data but often times it is not easy. You will most likely run into challenges such as creating requests (you will need to learn how to code and use a library to create http requests which is what browsers make behind the scenes), setting the correct request headers (if you don’t set request headers such as the language and encoding, a server may return a 403 error instead of the html that you want), throttling (a website may only allow a certain number of requests in a certain amount of time to make sure that you don’t bog down their server), and getting your ip banned (sometimes a website will try and prevent you from crawling and ban your ip so you can’t make requests). We are going to show you how DataHen can handle all these difficult parts of scraping and make it easy for you to get the data you want.
If you prefer to skip this tutorial, you can clone this script directly here.
For this tutorial we are going to show you how to use DataHen to easily scrape information about television products from the following two different categories on Amazon.com: “LED & LCD TVs” and “OLED TVs.” Specifically we are going to be scraping the following Amazon television data (also highlighted below): name, price, ASIN, seller, category, rating, number of reviews, product availability, and description.
We are going to assume you have Ruby 2.5.3 and the Nokogiri gem installed. If not follow this link here for instructions on how to install Ruby. Once Ruby is installed, make sure Rubygems is also installed and then run the following to install Nokogiri:
$ gem install nokogiri
First let’s set up a new DataHen scraper. Install the DataHen Ruby gem with the following command:
$ gem install datahen --source https://[email protected]/datahen/
You should see something similar to the following output after running this command:
Successfully installed datahen-0.2.3
Parsing documentation for datahen-0.2.3
Done installing documentation for datahen after 0 seconds
1 gem installed
Now that we have the DataHen gem installed we need to create our DataHen environment variable token. This will make it so our token is sent with every DataHen request. Run the following command:
$ export DATAHEN_TOKEN=<your_token_Here>
We are now ready to create a scraper. Let’s create an empty directory first, and name it ‘amazon-tvs’:
$ mkdir amazon-tvs
Next let’s go into the directory and initialize it as a Git repository:
$ cd amazon-tvs
$ git init .
Now that we have our setup is finished, let's move on to the creating the seeders in Part II.