Every DataHen scraper requires a seeder script which tells the scraper which pages to start scraping. A seeder script is a Ruby file that uses Ruby to load urls into a variable called, “pages.” First create a directory for our seeder script:
$ mkdir seeder
Next create a file called, “seeder.rb” inside this seeder directory with the following code:
pages << {
page_type: "listings",
method: "GET",
headers: {"User-Agent" => "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"},
url: "https://www.aliexpress.com/category/100003109/women-clothing.html",
vars: {
category: "Women's clothing"
}
}
In the ruby script above, we are basically seeding a link to the “Women’s Clothing” category on AliExpress.com. Please note that “pages” is a reserved variable. It is an array that represents what pages you want to seed. Let’s go through the other values in detail.
The “page_type” is a setting that determines which parser script to use. Later we will create a Ruby parser script called, “listings.”
The “method” is the type of http request we want to make. In this example, we are doing a simple “GET” request which is what your browser would make if you were viewing this url.
For the “headers” setting we are setting a “User-Agent” which is basically a string that represents a browser. Whenever you access a website, your browser includes a “User-Agent” so the website knows how to render the page that you request. By including a “User-Agent” string, we avoid having the Walmart website thinking we are a scraping bot and blocking our requests. You can also leave this “headers” setting out completely and DataHen will create a “User-Agent” for you by randomly submitting one with page request. The “User-Agents” that will be randomly selected are all valid and from the main browsers (Chrome, Firefox, and Internet Explorer), so no need to worry if you leave this out.
Then for the “vars” parameter we are passing the “Women’s clothing” category. The “vars” parameter allows you to pass in user-defined variables. We will be able to access and save this “vars” value in our “listings” parser as designated by the “page_type” value. You can pass whatever information you want through this “vars” parameter.
Now that we have created a seeder script we can try it out to see if there are any syntax errors. Run the following command from the root of your project directory:
$ datahen seeder try seeder/seeder.rb
Your should see the following output:
Trying seeder script
=========== Seeding Script Executed ===========
----------- New Pages to Enqueue: -----------
[
{
"page_type": "listings",
"method": "GET",
"headers": {
"User-Agent": "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
},
"url": "https://www.aliexpress.com/category/100003109/women-clothing.html",
"vars": {
"category": "Women's clothing"
}
}
]
Now we can commit this seeder to our git repository with the following commands:
$ git add .
$ git commit -m 'created a seeder file'
DataHen scrapers live in git repositories so we will first need to create one. Bitbucket offers free git repositories. Create a Bitbucket account and then a new repository here: https://bitbucket.org/repo/create. Use the git repo address from Bitbucket and push your scraper with the following commands (replace <username> with your Bitbucket username):
$ git remote add origin [email protected]:<username>/amazon-tvs.git
$ git push -u origin master
We will need a config file to tell DataHen where to find our files. Create a config.yaml file in the root project directory with the following content:
seeder:
file: ./seeder/seeder.rb
disabled: false # Optional. Set it to true if you want to disable execution of this file
Commit this config file on git, and push it to Bitbucket:
$ git add .
$ git commit -m 'add config.yaml file'
$ git push origin master
We can now create a scraper and run it on DataHen. Replace your git repo (should end in .git) in the following command which will create a scraper called, “ali-express”:
datahen scraper create ali-express [email protected]:<username>/ali-express.git --workers 1
Next, we need to deploy from your remote Git repository onto DataHen:
$ datahen scraper deploy ali-express
After deploying we can start the scraper with the following command:
$ datahen scraper start ali-express
Starting a new scraper will create a new scrape job and run it. Wait a minute and then check the status of this job with the following command:
$ datahen scraper stats ali-express
You should see something similar to the following:
{
"job_id": 83, # Job ID
"pages": 0, # How many pages in the scrape job
"fetched_pages": 1, # Number of fetched pages
"to_fetch": 0, # Pages that needs to be fetched
"fetching_failed": 0, # Pages that failed fetching
"fetched_from_web": 1, # Pages that were fetched from Web
"fetched_from_cache": 0, # Pages that were fetched from the shared Cache
"parsed_pages": 0, # Pages that have been parsed by parsing script
"to_parse": 0, # Pages that needs to be parsed
"parsing_failed": 0, # Pages that failed parsing
"outputs": 0, # Outputs of the scrape
"output_collections": 0, # Output collections
"workers": 1, # How many workers are used in this scrape job
"time_stamp": "2019-02-23T22:09:57.956158Z"
}
The “fetched_pages” value indicates that our scraper has successfully seeded our first page from the seeder. Next we will create parsers to parse these pages to extract product data and enqueue more pages in Part III.