Scraping websites can help you get valuable data quickly but often times it is not straightforward. You will most likely run into challenges such as creating requests (you will need to learn how to code and use a library to create http requests which is what browsers make behind the scenes), throttling (a website may only allow a certain number of requests in a certain amount of time to make sure that you don’t bog down their server), and getting your ip banned (sometimes a website will try and prevent you from crawling and ban your ip so you can’t make requests). We are going to show you how using DataHen can handle all these hard parts of scraping and make it simple to get the data you need.
If you prefer to skip this tutorial, you can clone this script directly here.
For this tutorial we are going to show you how to use DataHen to easily scrape products from Amazon using their ASINs. ASIN stands for Amazon Standard Identification Number and is a 10-character alphanumeric unique identifier that is used for product- identification within the Amazon.com organization. We’ll show you how we can take a list of ASINs as input and easily retrieve info about the corresponding product such as price and seller.
Specifically we are going to be extracting the following Amazon product data (also highlighted below): url, seller/author, number of reviews, rating, price, availability, product description, and image url.
We are going to assume you have Ruby 2.5.3 and the Nokogiri gem installed. If not follow this link here for instructions on how to install Ruby. Once Ruby is installed, make sure Rubygems is also installed and then run the following to install Nokogiri:
$ gem install nokogiri
First let’s set up a new DataHen scraper. Install the DataHen Ruby gem with the following command:
$ gem install datahen
You should see something similar to the following output after running this command:
Successfully installed datahen-0.2.3k
Parsing documentation for datahen-0.2.3
Done installing documentation for datahen after 0 seconds
1 gem installed
Now that we have the DataHen gem installed we need to create our DataHen environment variable token. This will make it so our token is sent with every DataHen request. Run the following command:
$ export DATAHEN_TOKEN=<your_token_Here>
We are now ready to create a scraper. Let’s create an empty directory first, and name it ‘amazon-asins’:
$ mkdir amazon-asins
Next let’s go into the directory and initialize it as a Git repository:
$ cd amazon-asins
$ git init .
Now that you've done the setup, let's move on to the creating the seeders in Part II.