To create an exporter first create an “exporters” directory in your project’s root folder. Inside this “exporters” folder create a file called “products_json.yaml” with the following content:
exporter_name: products_json # Must be unique
exporter_type: json
collection: products
write_mode: pretty_array # can be `line`,`pretty`, `pretty_array`, or `array`
offset: 0 # offset to where the exported record will start from
order: desc # can be ascending `asc` or descending `desc`
Update your config.yaml file with this exporter location so that config.yaml looks like the following:
seeder:
file: ./seeder/seeder.rb
parsers:
- page_type: products
file: ./parsers/products.rb
exporters:
- file: ./exporters/products_json.yaml
Commit this to Git, and push it to your remote Git repository and deploy once again. Check if the exporter is present with the following command:
$ datahen scraper exporter list amazon-asin
After that, let’s start the exporter.
$ datahen scraper exporter start amazon-asin products_json
This will return a hash with info about our exporter that should look like the following:
{
"id": "c700cb749f4e45eeb53609927e21da56", # Export ID here
"job_id": 852,
"scraper_id": 20,
"exporter_name": "products_json",
"exporter_type": "json",
"config": {
"collection": "products",
"exporter_name": "products_json",
"exporter_type": "json",
"offset": 0,
"order": "desc",
"write_mode": "pretty_array"
},
"status": "enqueued", # the status of the export
"created_at": "2019-02-05T06:19:56.815979Z"
}
Using the value of the “id” we can check the status of the exporter and then download it when it is done.
$ datahen scraper export show c700cb749f4e45eeb53609927e21da56
And then to download:
$ datahen scraper export download c700cb749f4e45eeb53609927e21da56
This will automatically download a compressed file with your json data inside. You now should have a working scraper for getting Amazon product info from a list of ASINs. Feel free to add your own product ASINs. Just make sure to keep testing your parsers locally to see if they run into any errors in case there are subtle differences between categories.
Now that we are at the end of this tutorial, you should have a good idea of what DataHen is capable of. We were able to take a number of ASINs and get the corresponding data from Amazon.com for each one. Amazon.com is not straightforward to scrape as they have a lot of protections in place to block requests if they think the request is a scraper or a bot, but as you can see DataHen takes care of this part for you. We also didn’t need to worry about running the scraper on a server or where to save the data. All the hard parts of scraping have been solved for you which allows you to focus on getting the exact data you need. Hopefully this tutorial has been helpful in guiding you on how to use Ruby and Nokogiri to get that data. You should now be able to create your own scraper for whatever site you want and use this tutorial as a reference.