Every DataHen scraper requires a seeder script which tells the scraper which pages to start scraping. A seeder script is a Ruby file that uses Ruby to load urls into a variable called, “pages.” First create a directory for our seeder script:
$ mkdir seeder
Next create a file called, “seeder.rb” inside this seeder directory with the following code:
pages << {
page_type: "listings",
method: "GET",
headers: {"User-Agent" => "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"},
url: "https://www.amazon.com/s/ref=lp_172659_nr_n_0?fst=as%3Aoff&rh=n%3A172282%2Cn%3A%21493964%2Cn%3A1266092011%2Cn%3A172659%2Cn%3A6459737011&bbn=172659&ie=UTF8&qid=1547749731&rnid=172659",
vars: {
category: "LED & LCD TVs"
}
}
pages << {
page_type: "listings",
method: "GET",
headers: {"User-Agent" => "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"},
url: "https://www.amazon.com/s/ref=lp_172659_nr_n_1?fst=as%3Aoff&rh=n%3A172282%2Cn%3A%21493964%2Cn%3A1266092011%2Cn%3A172659%2Cn%3A6463520011&bbn=172659&ie=UTF8&qid=1547749731&rnid=172659",
vars: {
category: "OLED TVs"
}
}
In the ruby script above, we are basically seeding two links to two different television categories on Amazon (LED & LCD TVs and OLED TVs). Please note that “pages” is a reserved variable. It is an array that represents what pages you want to seed. Let’s go through the other values in detail.
The “page_type” is a setting that determines which parser script to use. Later we will create a Ruby parser script called, “listings” which we will use to gather television links from.
The “method” is the type of http request we want to make. In this example, we are doing a simple “GET” request which is what your browser would make if you were viewing this url.
For the “headers” setting we are setting a “User-Agent” which is basically a string that represents a browser. Whenever you access a website, your browser includes a “User-Agent” so the website knows how to render the page that you request. By including a “User-Agent” string, we avoid having the Amazon website thinking we are a scraping bot and blocking our requests. You can also leave this “headers” setting out completely and DataHen will create a “User-Agent” for you by randomly submitting one with page request. The “User-Agents” that will be randomly selected are all valid and from the main browsers (Chrome, Firefox, and Internet Explorer), so no need to worry if you leave this out.
Now that we have created a seeder script we can try it out to see if there are any syntax errors. Run the following command from the root of your project directory:
$ datahen seeder try seeder/seeder.rb
Your should see the following output:
Trying seeder script
=========== Seeding Script Executed ===========
----------- New Pages to Enqueue: -----------
[
{
"page_type": "listings",
"method": "GET",
"headers": {
"User-Agent": "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
},
"url": "https://www.amazon.com/s/ref=lp_172659_nr_n_0?fst=as%3Aoff&rh=n%3A172282%2Cn%3A%21493964%2Cn%3A1266092011%2Cn%3A172659%2Cn%3A6459737011&bbn=172659&ie=UTF8&qid=1547749731&rnid=172659",
"vars": {
"category": "LED & LCD TVs"
}
},
{
"page_type": "listings",
"method": "GET",
"headers": {
"User-Agent": "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
},
"url": "https://www.amazon.com/s/ref=lp_172659_nr_n_1?fst=as%3Aoff&rh=n%3A172282%2Cn%3A%21493964%2Cn%3A1266092011%2Cn%3A172659%2Cn%3A6463520011&bbn=172659&ie=UTF8&qid=1547749731&rnid=172659",
"vars": {
"category": "OLED TVs"
}
}
]
Now we can commit this seeder to our git repository with the following commands:
$ git add .
$ git commit -m 'created a seeder file'
DataHen scrapers live in git repositories so we will first need to create one. Bitbucket offers free git repositories. Create a Bitbucket account and then a new repository here: https://bitbucket.org/repo/create. Use the git repo address from Bitbucket and push your scraper with the following commands (replace <username> with your Bitbucket username):
$ git remote add origin [email protected]:<username>/amazon-tvs.git
$ git push -u origin master
We will need a config file to tell DataHen where to find our files. Create a config.yaml file in the root project directory with the following content:
seeder:
file: ./seeder/seeder.rb
disabled: false # Optional. Set it to true if you want to disable execution of this file
Commit this config file on git, and push it to Bitbucket:
$ git add .
$ git commit -m 'add config.yaml file'
$ git push origin master
We can now create a scraper and run it on DataHen. Replace your git repo (should end in .git) in the following command which will create a scraper called, “walmart-movies”:
datahen scraper create amazon-tvs [email protected]:<username>/amazon-tvs.git --workers 1
Next, we need to deploy from your remote Git repository onto DataHen:
$ datahen scraper deploy amazon-tvs
After deploying we can start the scraper with the following command:
$ datahen scraper start amazon-tvs
Starting a new scraper will create a new scrape job and run it. Wait a minute and then check the status of this job with the following command:
$ datahen scraper stats amazon-tvs
You should see something similar to the following:
{
"job_id": 70, # Job ID
"pages": 0, # How many pages in the scrape job
"fetched_pages": 2, # Number of fetched pages
"to_fetch": 0, # Pages that needs to be fetched
"fetching_failed": 0, # Pages that failed fetching
"fetched_from_web": 2, # Pages that were fetched from Web
"fetched_from_cache": 0, # Pages that were fetched from the shared Cache
"parsed_pages": 0, # Pages that have been parsed by parsing script
"to_parse": 1, # Pages that needs to be parsed
"parsing_failed": 0, # Pages that failed parsing
"outputs": 0, # Outputs of the scrape
"output_collections": 0, # Output collections
"workers": 1, # How many workers are used in this scrape job
"time_stamp": "2019-02-01T22:09:57.956158Z"
}
The “fetched_pages” value indicates that our scraper has successfully seeded our first two page from the seeder. Next we will look at creating parsers to add more links and extract data in Part III.