How to Easily Scrape Amazon with Ruby and Nokogiri - Part 3: Parsers

The next step is to create a script that will find and enqueue all the Amazon television pages, which we will then use to parse out the product details. Create a folder called “parsers” in our project root directory:

$ mkdir parsers

Next create a file called “listings.rb” inside this “parsers” folder. Since we set the “page_type” to “listings” in our seeder, the seeder and any other pages with this “page_type” will be run against this “listings” parser. This is where we will write the code to extract and enqueue the links to the televisions as well as links to more listing pages from the pagination, which are the numbered links at the bottom of a page that lead to more pages (Previous, 1, 2, 3, …). First add the following line to the top of this “listings.rb” file:

nokogiri = Nokogiri.HTML(content)

The “content” variable is a reserved work that contains the html content data from the actual page. With this line we are loading the html into Nokogiri so that we can search it easily. Next we are going to extract the movie links from this page. Copy and paste the following below the line you just created:

products = nokogiri.css('#mainResults li', '#resultsCol li')
products.each do |product|
  a_element = product.at_css('a.s-access-detail-page')
  if a_element
    url = a_element['href'].gsub(/&qid=[0-9]*/,'')
    if url =~ /\Ahttps?:\/\//i
      pages << {
          url: url,
          page_type: 'products',
          vars: {
            category: page['vars']['category'],
            url: url
          }
        }
    end
  end
end

Let’s take a look at this code. In the first line we are telling Nokogiri to find all “li” elements inside the two divs with “ids” equal to “mainResults” and “resultsCol.” This will give us list elements that each contain a link to a television product. Next we loop through each of these list elements and look for a link inside with the class “s-access-detail-page.” This is the link to the Amazon product page which has all the detailed info that we want. We do a check to make sure the link element exists and if it does we extract the “href,” which is the url. The “gsub” call removes the “qid” from the url as it is not necessary.

Once we have the full url for each television product we then enqueue it by passing it to the “pages” variable. We set the “page_type” to “products” and will soon create a “products” parser to extract the television details. We also pass the “category” and “url” to the product parser by utilizing the “vars” hash. Any values we set here will be passed to the corresponding parser. In this case, we are scraping two different television categories, so we want to make sure we send that category (which we originally set in the seeder) so it can be saved for each product.

Now that we have enqueued television urls from the first seeded listing page, we need to add more listing pages from the pagination at the bottom so we can get every television url available. Paste the following code below the code you just added.

pagination_links = nokogiri.css('#pagn a')
pagination_links.each do |link|
  page_num = link.text.strip
  if page_num =~ /[0-9]/
    url = "https://www.amazon.com/s/ref=sr_pg_3?rh=n%3A172282%2Cn%3A%21493964%2Cn%3A1266092011%2Cn%3A172659%2Cn%3A6459737011&page=#{page_num}&ie=UTF8"
    pages << {
        url: url,
        page_type: 'listings',
        vars: {
          category: page['vars']['category']
        }
      }
  end
end

Here we are enqueuing all “paginated” links by getting every link (“a” element) inside the “div” with “id” “pagn.” We then iterate through each “a” element and get the inner text from each. We want to ignore the links that say, “Previous” and “Next” because they have a different format, so we use a regular expression to make sure that the text inside the link is a number. Then we use this number, which we call “page_num,” to create a link to the corresponding page. We then enqueue this link, set the “page_type” to “listings” and set the “category” to the “category” that was passed in through the page variables.

Let’s try out our “listings” parser to see if there are any errors. First, let’s see what the seeded pages looks like by running the following command:

datahen scraper page list amazon-tvs

The output should looks something like the following:

[
 {
  "gid": "www.amazon.com-68d081564e3be1c4dd047d67e6556b09", # Global ID
  "job_id": 70,
  "page_type": "listings",
  "method": "GET",
  "url": "https://www.amazon.com/s/ref=lp_172659_nr_n_0?fst=as%3Aoff&rh=n%3A172282%2Cn%3A%21493964%2Cn%3A1266092011%2Cn%3A172659%2Cn%3A6459737011&bbn=172659&ie=UTF8&qid=1547749731&rnid=172659",
  "effective_url": "https://www.amazon.com/s/ref=lp_172659_nr_n_0?fst=as%3Aoff&rh=n%3A172282%2Cn%3A%21493964%2Cn%3A1266092011%2Cn%3A172659%2Cn%3A6459737011&bbn=172659&ie=UTF8&qid=1547749731&rnid=172659",
  "headers": "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
  ...
 }, 
...
]

We want to take the “gid” (Global ID) and try our parser against this specific page to check the output. Copy the “gid” and replace it in the following command:

$ datahen parser try amazon-tvs parsers/listings.rb www.amazon.com-68d081564e3be1c4dd047d67e6556b09  

You should see output similar to the following:

Trying parser script
getting Job Page
=========== Parsing Executed ===========
----------- New Pages to Enqueue: -----------
[
  {
    "url": "https://www.amazon.com/TCL-55S405-55-Inch-Ultra-Smart/dp/B01MTGM5I9",
    "page_type": "products",
    "vars": {
      "category": "LED & LCD TVs",
      "url": "https://www.amazon.com/TCL-55S405-55-Inch-Ultra-Smart/dp/B01MTGM5I9"
    }
  },
  {
    "url": "https://www.amazon.com/TCL-40S325-Inch-1080p-Smart/dp/B07GB61TQR",
    "page_type": "products",
    "vars": {
      "category": "LED & LCD TVs",
      "url": "https://www.amazon.com/TCL-40S325-Inch-1080p-Smart/dp/B07GB61TQR"
    }
  },
 …
]

Next, add the parsers section to the config.yaml file so that it looks like the following:

seeder:
 file: ./seeder/seeder.rb
parsers:
 - page_type: listings
   file: ./parsers/listings.rb

Commit this to Git, and push it to your remote Git repository.

$ git add .
$ git commit -m 'add listings parser to config'                                                                                                                             
$ git push origin master   

Now that your remote git repository is up to date, we can deploy the scraper again:

$ datahen scraper deploy amazon-tvs

Since we have already started a scrape job, DataHen will automatically download and execute your new listings parser. Wait a few minutes and then run the following “stats” command to see the progress:

$ datahen scraper stats amazon-tvs   

Now that our listings parser is working we can move on to extracting television data by creating a “products” parser. Create a file called “products.rb” inside the “parsers” folder with the following code:

nokogiri = Nokogiri.HTML(content)

# initialize an empty hash
product = {}

#save the url
product['url'] = page['vars']['url']

#save the category
product['category'] = page['vars']['category']

#extract the asin
canonical_link = nokogiri.css('link').find{|link| link['rel'].strip == 'canonical' }
product['asin'] = canonical_link['href'].split("/").last

#extract title
product['title'] = nokogiri.at_css('#productTitle').text.strip

#extract seller/author
seller_node = nokogiri.at_css('a#bylineInfo')
if seller_node
  product['seller'] = seller_node.text.strip
else
  product['author'] = nokogiri.css('a.contributorNameID').text.strip
end

#extract number of reviews
reviews_node = nokogiri.at_css('span#acrCustomerReviewText')
reviews_count = reviews_node ? reviews_node.text.strip.split(' ').first.gsub(',','') : nil
product['reviews_count'] = reviews_count =~ /^[0-9]*$/ ? reviews_count.to_i : 0

#extract rating
rating_node = nokogiri.at_css('#averageCustomerReviews span.a-icon-alt')
stars_num = rating_node ? rating_node.text.strip.split(' ').first : nil
product['rating'] = stars_num =~ /^[0-9.]*$/ ? stars_num.to_f : nil

#extract price
price_node = nokogiri.at_css('#price_inside_buybox', '#priceblock_ourprice', '#priceblock_dealprice', '.offer-price')
if price_node
  product['price'] = price_node.text.strip.gsub(/[\$,]/,'').to_f
end

#extract availability
availability_node = nokogiri.at_css('#availability')
if availability_node
  product['available'] = availability_node.text.strip == 'In Stock.' ? true : false
else
  product['available'] = nil
end

#extract product description
description = ''
nokogiri.css('#feature-bullets li').each do |li|
  unless li['id'] || (li['class'] && li['class'] != 'showHiddenFeatureBullets')
    description += li.text.strip + ' '
  end
end
product['description'] = description.strip

# specify the collection where this record will be stored
product['_collection'] = "products"

# save the product to the job’s outputs
outputs << product

Let’s go through this code line by line:

nokogiri = Nokogiri.HTML(content)

We are taking the html string of the page that is inside the “content” variable and parsing it with Nokogiri so that we can search it.

product = {}

We then initialize an empty hash. This is where we will store the data that we extract.

product['url'] = page['vars']['url']

Here we are saving the url that we passed in through our “listings” parser.

product['category'] = page['vars']['category']

Similar to the “url” we are saving the television “category” which we also passed in to the page “vars” in the “listings” parser (originally set in our “seeder”). Since we are only doing two categories, this value with either be "LED & LCD TVs" or "OLED TVs.”

canonical_link = nokogiri.css('link').find{|link| link['rel'].strip == 'canonical' }
product['asin'] = canonical_link['href'].split("/").last

Next we are looking for the ASIN which stands for Amazon Standard Identification Number. In order to do this we find the canonical link element, which is a “standard” link for the page to prevent duplicates for search engine crawlers. This canonical link is clean and has a “standard” format. The following is an example: https://www.amazon.com/Samsung-65NU7300-Curved-Smart-2018/dp/B079NRD7HQ. The last parameter of this example is the ASIN. In order to extract it, we split the link string by the forward slash (“/”), which will create an array using the forward slash as the dividing factor. Then since ever canonical link will have the same format, we know that the ASIN is the last element of this array.

product['title'] = nokogiri.at_css('#productTitle').text.strip

To get the product title we look for a “div” with id “productTitle” and get the text inside.

seller_node = nokogiri.at_css('a#bylineInfo')
if seller_node
  product['seller'] = seller_node.text.strip
end

Next we are extracting either the seller. If a product is a book, it will have an author instead of a seller. We check for a link element with the id “bylineInfo.” If that element exists, we extract the inner text and save it as the seller.

reviews_node = nokogiri.at_css('span#acrCustomerReviewText')
reviews_count = reviews_node ? reviews_node.text.strip.split(' ').first.gsub(',','') : nil
product['reviews_count'] = reviews_count =~ /^[0-9]*$/ ? reviews_count.to_i : 0

For this part we are trying to save the number of reviews that the product has. We check for a “span” with an “id” of “acrCustomerReviewText.” If this element exists, we grab the inner text, which should look something like, “92 customer reviews.” Do just get the number, we split the string into an array using whitespace as the divider and get the first element which should be a string that is a number (“92” in this example). We do a check to make sure the string is a number using a regular expression and if so, we convert it to an integer. Otherwise, if the string is not a number or it doesn’t exist, we set the review count to 0.

rating_node = nokogiri.at_css('#averageCustomerReviews span.a-icon-alt')
stars_num = rating_node ? rating_node.text.strip.split(' ').first : nil
product['rating'] = stars_num =~ /^[0-9.]*$/ ? stars_num.to_f : nil

Next we are extracting the star rating. To find this rating, we look for a “span” element with class “a-icon-alt” that is inside a “div” with an “id” named “averageCustomerReviews.” If this element exists, we grab the inner text, which should look something like “4.4 out of 5 stars,” and call split on the text which will turn it into an array using a space as the dividing character. This way the first element of the array will be the rating in string form. We do a check using a regular expression to make sure the rating is indeed a number, allowing for a decimal, and convert the string into a fixed number. If the rating element doesn’t exist, or if the rating string is not a number, we set the “rating” to nil.

price_node = nokogiri.at_css('#price_inside_buybox', '#priceblock_ourprice', '#priceblock_dealprice', '.offer-price')
if price_node
  product['price'] = price_node.text.strip.gsub(/[\$,]/,'').to_f
end

Next we are extracting the price. The html elements that the price appears varies depending on a number of factors such as time and product type. As a result, we need to look for a number of different elements to find the price. Specifically we check for “divs” with the following “ids” “price_inside_buybox,” “priceblock_ourprice,” “priceblock_dealprice,” and also a “div” with the “class” “offer-price.” If an element exists, we grab the inner text, remove the dollar sign, and convert it to a fixed number to allow for decimal values.

availability_node = nokogiri.at_css('#availability')
if availability_node
  product['available'] = availability_node.text.strip == 'In Stock.' ? true : false
else
  product['available'] = nil
end

Next we are looking at the product availability. To check the availability we look for a “div” with the following id: “availability.” If this element is present, we check to make sure that the text inside is equal to “In Stock.” If so, that means the product is available and we set “available” to true, otherwise we set it to false.

description = ''
nokogiri.css('#feature-bullets li').each do |li|
  unless li['id'] || (li['class'] && li['class'] != 'showHiddenFeatureBullets')
    description += li.text.strip + ' '
  end
end
product['description'] = description.strip

With this code we are extracting the product’s description. We look for “li” elements inside a “div” with the “id” “feature-bullets” and iterate through each of these “li” elements. We add the text inside the element to a string but exclude “li” elements that have an “id” or have a “showHiddenFeatureBullets” “class.”

product['_collection'] = 'products'

This line sets the “collection” name to “products.” Job outputs are stored in collections and specifying the collection will allow us to query and export the data later.

outputs << product

Finally, we save the Amazon television product hash to the “outputs” variable which is an array for saving job output.

Now we can update our config.yaml file by specifying our products parser. The config.yaml file should look like the following:

seeder:
 file: ./seeder/seeder.rb
Parsers:
 - page_type: listings
   file: ./parsers/listings.rb
 - page_type: products
   file: ./parsers/products.rb

Commit this to Git, and push it to your remote Git repository.

$ git add .
$ git commit -m 'add products parser to config'                               
$ git push origin master  

Now that we have updated our git repository, we can deploy the scraper again:

$ datahen scraper deploy amazon-tvs

DataHen will automatically download this new parser and start to parse all the pages with “page_type” set to “products.” You can keep running the “stats” command from earlier to see how the progress is going. This scraper will take some time to finish as there are a lot of television products in each of these categories. One thing we can do to speed up the scraper is to add more workers which will allow us to process multiple pages at a time in parallel. We can increase the number of workers to 5 with the following commands:

$ datahen scraper job cancel amazon-tvs
$ datahen scraper job update amazon-tvs --workers 5
$ datahen scraper job resume amazon-tvs

You can keep running the stats command to check on how the scraper is performing over time. Once the scraper has finished parsing all the television product pages we can export and download the data. We will take a look at exporting in Part IV.