How to Web Scrape Walmart with Ruby and Nokogiri Part 2: Seeders

Every DataHen scraper requires a seeder script which tells the scraper which pages to start scraping. A seeder script is a Ruby file that uses Ruby to load urls into a variable called, “pages.” First create a directory for our seeder script:

$ mkdir seeder

Next create a file called, “seeder.rb” inside this seeder directory with the following code:

pages << {
  page_type: "listings",
  method: "GET",
  headers: {"User-Agent" => "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"},
  url: "https://www.walmart.com/browse/movies-tv-shows/4096?facet=new_releases:Last+90+Days",
  fetch_type: "browser"
}

In the ruby script above, we are basically seeding a link to the most recent releases on Walmart within the last 90 days. Please note that “pages” is a reserved variable. It is an array that represents what pages you want to seed. Let’s go through the other values in detail.

The “page_type” is a setting that determines which parser script to use. Later we will create a Ruby parser script called, “listings.”

The “method” is the type of http request we want to make. In this example, we are doing a simple “GET” request which is what your browser would make if you were viewing this url.

For the “headers” setting we are setting a “User-Agent” which is basically a string that represents a browser. Whenever you access a website, your browser includes a “User-Agent” so the website knows how to render the page that you request. By including a “User-Agent” string, we avoid having the Walmart website thinking we are a scraping bot and blocking our requests. You can also leave this “headers” setting out completely and DataHen will create a “User-Agent” for you by randomly submitting one with page request. The “User-Agents” that will be randomly selected are all valid and from the main browsers (Chrome, Firefox, and Internet Explorer), so no need to worry if you leave this out.

The “fetch_type” value is set to “browser” which will allow us to use a headless browser for the request. A headless browser simulates a real browser and will even load the Javascript that is on the page. Walmart uses Javascript to render it’s pages so if we don’t use a “browser” fetch type we won’t be able to retrieve the data we want.

Now that we have created a seeder script we can try it out to see if there are any syntax errors. Run the following command from the root of your project directory:

$ datahen seeder try seeder/seeder.rb

Your should see the following output:

Trying seeder script
=========== Seeding Script Executed ===========
----------- New Pages to Enqueue: -----------
[
  {
    "page_type": "listings",
    "method": "GET",
    "headers": {
      "User-Agent": "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
    },
    "url": "https://www.walmart.com/browse/movies-tv-shows/4096?facet=new_releases:Last+90+Days",
    "fetch_type": "browser",
    "force_fetch": true
  }
]

Now we can commit this seeder to our git repository with the following commands:

$ git add .
$ git commit -m 'created a seeder file'

DataHen scrapers live in git repositories so we will first need to create one. Bitbucket offers free git repositories. Create a Bitbucket account and then a new repository here: https://bitbucket.org/repo/create. Use the git repo address from Bitbucket and push your scraper with the following commands (replace <username> with your Bitbucket username):

$ git remote add origin git@bitbucket.org:<username>/walmart-movies.git
$ git push -u origin master

We will need a config file to tell DataHen where to find our files. Create a config.yaml file in the root project directory with the following content:

seeder:
 file: ./seeder/seeder.rb
 disabled: false # Optional. Set it to true if you want to disable execution of this file

Commit this config file on git, and push it to Bitbucket:

$ git add .
$ git commit -m 'add config.yaml file'
$ git push origin master

We can now create a scraper and run it on DataHen. Replace your git repo (should end in .git) in the following command which will create a scraper called, “walmart-movies”:

datahen scraper create walmart-movies git@bitbucket.org:<username>/walmart-movies.git --workers 1

Next, we need to deploy from your remote Git repository onto DataHen:

$ datahen scraper deploy walmart-movies

After deploying we can start the scraper with the following command:

$ datahen scraper start walmart-movies

Starting a new scraper will create a new scrape job and run it. Wait a minute and then check the status of this job with the following command:

$ datahen scraper stats walmart-movies

You should see something similar to the following:

{
 "job_id": 70,             # Job ID
 "pages": 1,               # How many pages in the scrape job
 "fetched_pages": 1,       # Number of fetched pages
 "to_fetch": 0,            # Pages that needs to be fetched
 "fetching_failed": 0,     # Pages that failed fetching
 "fetched_from_web": 1,    # Pages that were fetched from Web
 "fetched_from_cache": 0,  # Pages that were fetched from the shared Cache
 "parsed_pages": 0,        # Pages that have been parsed by parsing script
 "to_parse": 1,            # Pages that needs to be parsed
 "parsing_failed": 0,      # Pages that failed parsing
 "outputs": 0,             # Outputs of the scrape
 "output_collections": 0,  # Output collections
 "workers": 1,             # How many workers are used in this scrape job
 "time_stamp": "2019-02-01T22:09:57.956158Z"
}

The “fetched_pages” value of 1 indicates that our scraper has successfully seeded our first page from the seeder. Next we will look at creating parsers to extract data in Part III.