Every DataHen scraper requires a seeder script which tells the scraper which pages to start scraping. A seeder script is a Ruby file that uses Ruby to load urls into a variable called, “pages.” First create a directory for our seeder script:
$ mkdir seeder
Next create a file called, “seeder.rb” inside this seeder directory with the following code:
pages << {
page_type: "listings",
method: "GET",
headers: {"User-Agent" => "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"},
url: "https://www.walmart.com/browse/movies-tv-shows/4096?facet=new_releases:Last+90+Days",
fetch_type: "browser"
}
In the ruby script above, we are basically seeding a link to the most recent releases on Walmart within the last 90 days. Please note that “pages” is a reserved variable. It is an array that represents what pages you want to seed. Let’s go through the other values in detail.
The “page_type” is a setting that determines which parser script to use. Later we will create a Ruby parser script called, “listings.”
The “method” is the type of http request we want to make. In this example, we are doing a simple “GET” request which is what your browser would make if you were viewing this url.
For the “headers” setting we are setting a “User-Agent” which is basically a string that represents a browser. Whenever you access a website, your browser includes a “User-Agent” so the website knows how to render the page that you request. By including a “User-Agent” string, we avoid having the Walmart website thinking we are a scraping bot and blocking our requests. You can also leave this “headers” setting out completely and DataHen will create a “User-Agent” for you by randomly submitting one with page request. The “User-Agents” that will be randomly selected are all valid and from the main browsers (Chrome, Firefox, and Internet Explorer), so no need to worry if you leave this out.
The “fetch_type” value is set to “browser” which will allow us to use a headless browser for the request. A headless browser simulates a real browser and will even load the Javascript that is on the page. Walmart uses Javascript to render it’s pages so if we don’t use a “browser” fetch type we won’t be able to retrieve the data we want.
Now that we have created a seeder script we can try it out to see if there are any syntax errors. Run the following command from the root of your project directory:
$ datahen seeder try seeder/seeder.rb
Your should see the following output:
Trying seeder script
=========== Seeding Script Executed ===========
----------- New Pages to Enqueue: -----------
[
{
"page_type": "listings",
"method": "GET",
"headers": {
"User-Agent": "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
},
"url": "https://www.walmart.com/browse/movies-tv-shows/4096?facet=new_releases:Last+90+Days",
"fetch_type": "browser",
"force_fetch": true
}
]
Now we can commit this seeder to our git repository with the following commands:
$ git add .
$ git commit -m 'created a seeder file'
DataHen scrapers live in git repositories so we will first need to create one. Bitbucket offers free git repositories. Create a Bitbucket account and then a new repository here: https://bitbucket.org/repo/create. Use the git repo address from Bitbucket and push your scraper with the following commands (replace <username> with your Bitbucket username):
$ git remote add origin [email protected]:<username>/walmart-movies.git
$ git push -u origin master
We will need a config file to tell DataHen where to find our files. Create a config.yaml file in the root project directory with the following content:
seeder:
file: ./seeder/seeder.rb
disabled: false # Optional. Set it to true if you want to disable execution of this file
Commit this config file on git, and push it to Bitbucket:
$ git add .
$ git commit -m 'add config.yaml file'
$ git push origin master
We can now create a scraper and run it on DataHen. Replace your git repo (should end in .git) in the following command which will create a scraper called, “walmart-movies”:
datahen scraper create walmart-movies [email protected]:<username>/walmart-movies.git --workers 1
Next, we need to deploy from your remote Git repository onto DataHen:
$ datahen scraper deploy walmart-movies
After deploying we can start the scraper with the following command:
$ datahen scraper start walmart-movies
Starting a new scraper will create a new scrape job and run it. Wait a minute and then check the status of this job with the following command:
$ datahen scraper stats walmart-movies
You should see something similar to the following:
{
"job_id": 70, # Job ID
"pages": 1, # How many pages in the scrape job
"fetched_pages": 1, # Number of fetched pages
"to_fetch": 0, # Pages that needs to be fetched
"fetching_failed": 0, # Pages that failed fetching
"fetched_from_web": 1, # Pages that were fetched from Web
"fetched_from_cache": 0, # Pages that were fetched from the shared Cache
"parsed_pages": 0, # Pages that have been parsed by parsing script
"to_parse": 1, # Pages that needs to be parsed
"parsing_failed": 0, # Pages that failed parsing
"outputs": 0, # Outputs of the scrape
"output_collections": 0, # Output collections
"workers": 1, # How many workers are used in this scrape job
"time_stamp": "2019-02-01T22:09:57.956158Z"
}
The “fetched_pages” value of 1 indicates that our scraper has successfully seeded our first page from the seeder. Next we will look at creating parsers to extract data in Part III.