How to Easily Scrape Amazon with Ruby and Nokogiri

To create an exporter first create an “exporters” directory in your project’s root folder. Inside this “exporters” folder create a file called “products_json.yaml” with the following content:

exporter_name: products_json # Must be unique
exporter_type: json
collection: products
write_mode: pretty_array # can be `line`,`pretty`, `pretty_array`, or `array`
offset: 0 # offset to where the exported record will start from
order: desc # can be ascending `asc` or descending `desc`

Update your config.yaml file with this exporter location so that config.yaml looks like the following:

seeder:
 file: ./seeder/seeder.rb
Parsers:
 - page_type: listings
   file: ./parsers/listings.rb
 - page_type: products
   file: ./parsers/products.rb
exporters:
 - file: ./exporters/products_json.yaml

Commit this update to Git, push it to your remote Git repository and deploy once again. Check if the exporter is present with the following command:

$ datahen scraper exporter list amazon-tvs

After that, let’s start the exporter.

$ datahen scraper exporter start amazon-tvs products_json

This will return a hash with info about our exporter that should look like the following:

{
 "id": "c700cb749f4e45eeb53609927e45da89", # Export ID here
 "job_id": 852,
 "scraper_id": 21,
 "exporter_name": "products_json",
 "exporter_type": "json",
 "config": {
  "collection": "products",
  "exporter_name": "products_json",
  "exporter_type": "json",
  "offset": 0,
  "order": "desc",
  "write_mode": "pretty_array"
 },
 "status": "enqueued", # the status of the export
 "created_at": "2019-02-13T03:40:56.815979Z"
}

The first line is the “id” of the exporter. We can use this “id” to check the status of the exporter and then download it once it has finished.

$ datahen scraper export show c700cb749f4e45eeb53609927e45da89

And then to download:

$ datahen scraper export download c700cb749f4e45eeb53609927e45da89

This will automatically download a compressed file with your json data inside. You now should have a working scraper for getting television information from Amazon by category. Feel free to add try the scraper with your own categories. Just make sure to keep testing your parsers locally to see if they run into any errors in case there are subtle differences between categories.

Now we have completed the tutorial. This should give you a good idea of what DataHen is capable of and the power behind it. DataHen allows you to just focus on getting the data you need, in this case Amazon product data, instead of having to worry about the intricacies of scraping an unfriendly to scrapers website such as Amazon.com. We only did two Amazon categories, but you can scale this up to as many as you want and do it in an easy way using DataHen. This tutorial should also have been helpful in guiding you on how to use Ruby and Nokogiri to extract this specific data. You should now be able to create your own scraper for whatever site you want and use this tutorial as a reference.