Every DataHen scraper requires a seeder script which tells the scraper which pages to start scraping. A seeder script is a Ruby file that uses Ruby to load urls into a variable called, “pages.” First create a directory for our seeder script:
$ mkdir seeder
We will also place our CSV file of Amazon ASINs in this “seeder” directory. Create a file called asins.csv and add the following content:
ASIN
1476753830
B07F35VT29
B07JVJP46R
B01EBAOUZ0
B001QU38IY
B01LTHYW9W
B078BCB9WW
B07BWNPS9G
B00MDRTV8A
B074XMH3W2
B07DPRT1FJ
B01983OFK0
B01I0IGFKC
B06XW75KZW
B075MGQFF6
Next create a file called, “seeder.rb” also inside this seeder directory with the following code:
CSV.foreach("./seeder/asins.csv", headers: true) do |row|
url = "https://www.amazon.com/o/ASIN/#{row['ASIN']}"
pages << {
page_type: "products",
method: "GET",
headers: {"User-Agent" => "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"},
url: url,
vars: {
url: url,
asin: row["ASIN"]
}
}
end
In the ruby script above, we are basically iterating through each ASIN in our csv file, creating a url to the corresponding product page on Amazon and passing that url to AnswersEngins by saving it to the pages array. Please note that “pages” is a reserved variable. It is an array that represents what pages you want to seed. Let’s go through the other values in detail.
The “page_type” is a setting that determines which parser script to use. Later we will create a Ruby parser script called, “products” that we will use to extract specific info from each product page.
The “method” is the type of http request we want to make. In this example, we are doing a simple “GET” request which is what your browser would make if you were viewing this url.
For the “headers” setting we are setting a “User-Agent” which is basically a string that represents a browser. Whenever you access a website, your browser includes a “User-Agent” so the website knows how to render the page that you request. By including a “User-Agent” string, we avoid having the Walmart website thinking we are a scraping bot and blocking our requests. You can also leave this “headers” setting out completely and DataHen will create a “User-Agent” for you by randomly submitting one with page request. The “User-Agents” that will be randomly selected are all valid and from the main browsers (Chrome, Firefox, and Internet Explorer), so no need to worry if you leave this out.
For the “url” value we are passing the url that we created on the second line. This is the url for the corresponding Amazon product.
The “vars” are user-defined variables. You can set them to whatever you want and these variables will be passed to the “products” parser for the corresponding url. In this case, we are passing the “asin” and “url” so that we can save these values with the info that we extract for each product.
Now that we have created a seeder script we can try it out to see if there are any syntax errors. Run the following command from the root of your project directory:
$ datahen seeder try seeder/seeder.rb
Your should see the following output:
Trying seeder script
=========== Seeding Script Executed ===========
----------- New Pages to Enqueue: -----------
[
{
"page_type": "products",
"method": "GET",
"headers": {
"User-Agent": "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
},
"url": "https://www.amazon.com/o/ASIN/1476753830",
}
]
Now we can commit this seeder to our git repository with the following commands:
$ git add .
$ git commit -m 'created a seeder file'
DataHen scrapers live in git repositories so we will first need to create one. Bitbucket offers free git repositories. Create a Bitbucket account and then a new repository here: https://bitbucket.org/repo/create. Use the git repo address from Bitbucket and push your scraper with the following commands (replace <username> with your Bitbucket username):
$ git remote add origin [email protected]:<username>/amazon-asins.git
$ git push -u origin master
We will need a config file to tell DataHen where to find our files. Create a config.yaml file in the root project directory with the following content:
seeder:
file: ./seeder/seeder.rb
disabled: false # Optional. Set it to true if you want to disable execution of this file
Commit this config file on git, and push it to Bitbucket:
$ git add .
$ git commit -m 'add config.yaml file'
$ git push origin master
We can now create a scraper and run it on DataHen. Replace your git repo (should end in .git) in the following command which will create a scraper called, “amazon-asins”:
datahen scraper create amazon-asins [email protected]:<username>/amazon-asins.git --workers 1
Next, we need to deploy from your remote Git repository onto DataHen:
$ datahen scraper deploy amazon-asins
After deploying we can start the scraper with the following command:
$ datahen scraper start amazon-asins
Starting a new scraper will create a new scrape job and run it. Wait a minute or two and then check the status of this job with the following command:
$ datahen scraper stats amazon-asins
You should see something similar to the following:
{
"job_id": 100, # Job ID
"pages": 15, # How many pages in the scrape job
"fetched_pages": 1, # Number of fetched pages
"to_fetch": 0, # Pages that needs to be fetched
"fetching_failed": 0, # Pages that failed fetching
"fetched_from_web": 1, # Pages that were fetched from Web
"fetched_from_cache": 0, # Pages that were fetched from the shared Cache
"parsed_pages": 0, # Pages that have been parsed by parsing script
"to_parse": 1, # Pages that needs to be parsed
"parsing_failed": 0, # Pages that failed parsing
"outputs": 0, # Outputs of the scrape
"output_collections": 0, # Output collections
"workers": 1, # How many workers are used in this scrape job
"time_stamp": "2019-02-01T22:09:57.956158Z"
}
The “fetched_pages” value of 15 indicates that our scraper has successfully seeded the corresponding pages using the ASINs from our yaml file. We will now create our parsers in Part III.