Now that we have our seeder working, the next step is to create a script that will find and enqueue all the Walmart movie pages, which we will then use to parse out details such as movie titles, prices, publishers, etc. Create a folder called “parsers” in our project root directory:
$ mkdir parsers
Next create a file called “listings.rb” inside this “parsers” folder. Since we set the “page_type” to “listings” in our seeder, the seeder and any other pages with this “page_type” will be run against this “listings” parser. This is where we will write the code to extract and enqueue the links to the movies as well as links to more listing pages from the pagination, which are the numbered links at the bottom of a page that lead to more pages (Previous, 1, 2, 3, …). First add the following line to the top of this “listings.rb” file:
nokogiri = Nokogiri.HTML(content)
The “content” variable is a reserved work that contains the html content data from the actual page. With this line we are loading the html into Nokogiri so that we can search it easily. Next we are going to extract the movie links from this page. Copy and paste the following below the line you just created:
products = nokogiri.css('#searchProductResult li')
products.each do |product|
href = product.at_css('a.product-title-link')['href']
url = URI.join('https://www.walmart.com', href).to_s
pages << {
url: url,
page_type: 'products',
fetch_type: 'browser',
vars: {}
}
end
Let’s dive into this code a bit. In the first line we are telling Nokogiri to find all “li” elements inside the div with “id” equal to “searchProductResult.” Then we loop through these “li” elements and extract the “href” from each element. We do this by finding the link inside each “li” element with the, “product-title-link” class. Each “href” is a relative link, meaning it is relative to the root domain and does not include the domain in the url. In order to create a full url that DataHen can scrape we need to add “https://www.walmart.com” to each relative href and we do that by using “URI.join.” Once we have the full url for each movie we then enqueue it by passing it to the “pages” variable. Note that we are also setting the “fetch_type” to “browser” here as these movie pages require Javascript to render as well. We set the “page_type” to “products” and will soon create a “products” parser to extract our desired movie info.
Now that we have enqueued movie urls from the first seeded listing page, we need to add more listing pages from the pagination at the bottom so we can get every movie url available. Paste the following code below the code you just added.
pagination_links = nokogiri.css('ul.paginator-list li a')
pagination_links.each do |link|
url = URI.join('https://www.walmart.com', link['href']).to_s
pages << {
url: url,
page_type: 'listings',
fetch_type: 'browser',
vars: {}
}
end
Basically this code finds all links within the “ul” element within the “paginator-list” class. We use “URI.join” to create the full urls using the “href” values, which are also relative links. We pass these urls to the “pages” variable and set the “page_type” to “listings.” This makes it so each url we pass will be parsed using this same parser that we are creating now. Only unique urls will be saved, so this will allow us to find all the pagination links and all the movie links within each pagination link.
Let’s try out our “listings” parser to see if there are any errors. First, let’s see what the seeded page looks like by running the following command:
datahen scraper page list walmart-movies
The output should looks something like the following:
[
{
"gid": "www.walmart.com-4aa9b6bd1f2717409c22d58c4870471e", # Global ID
"job_id": 70,
"page_type": "listings",
"method": "GET",
"url": "https://www.walmart.com/browse/movies-tv-shows/4096?facet=new_releases:Last+90+Days",
"effective_url": "https://www.walmart.com/browse/movies-tv-shows/4096?facet=new_releases:Last+90+Days",
"headers": "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
...
}
]
We want to take the “gid” (Global ID) and try our parser against this specific page to check the output. Copy the “gid” and replace it in the following command:
$ datahen parser try walmart-movies parsers/listings.rb www.walmart.com-4aa9b6bd1f2717409c22d58c4870471e
You should see output similar to the following:
Trying parser script
getting Job Page
=========== Parsing Executed ===========
----------- New Pages to Enqueue: -----------
[
{
"url": "https://www.walmart.com/ip/Incredibles-2-Walmart-Exclusive-Blu-ray-DVD/511079861",
"page_type": "products",
"fetch_type": "browser",
"vars": {}
},
{
"url": "https://www.walmart.com/ip/The-Spy-Who-Dumped-Me-Blu-ray-DVD-Digital-Copy/501684830",
"page_type": "products",
"fetch_type": "browser",
"vars": {}
},
…
]
Next, add the parsers section to the config.yaml file so that it looks like the following:
seeder:
file: ./seeder/seeder.rb
parsers:
- page_type: listings
file: ./parsers/listings.rb
Commit this to Git, and push it to your remote Git repository.
$ git add .
$ git commit -m 'add listings parser to config'
$ git push origin master
Now that your remote git repository is up to date, we can deploy the scraper again:
$ datahen scraper deploy walmart-movies
Since we have already started a scrape job, DataHen will automatically download and execute your new listings parser. Wait a few minutes and then run the following “stats” command to see the progress:
$ datahen scraper stats walmart-movies
Now that our listings parser is working we can move on to extracting movie data by creating a “products” parser. Create a file called “products.rb” inside the “parsers” folder with the following code:
nokogiri = Nokogiri.HTML(content)
# initialize an empty hash
product = {}
#extract title
product['title'] = nokogiri.at_css('.ProductTitle').text.strip
#extract current price
product['current_price'] = nokogiri.at_css('span.price-characteristic').attr('content').to_f
#extract original price
original_price_div = nokogiri.at_css('.price-old')
original_price = original_price_div ? original_price_div.text.strip.gsub('$','').to_f : nil
product['original_price'] = original_price == 0.0 ? nil : original_price
#extract rating
rating = nokogiri.at_css('.hiddenStarLabel .seo-avg-rating').text.strip.to_f
product['rating'] = rating == 0 ? nil : rating
#extract number of reviews
review_text = nokogiri.at_css('.stars-reviews-count-node').text.strip
product['reviews_count'] = review_text =~ /reviews/ ? review_text.split(' ').first.to_i : 0
#extract publisher
product['publisher'] = nokogiri.at_css('a.prod-brandName').text.strip
#extract walmart item number
product['walmart_number'] = nokogiri.at_css('.prod-productsecondaryinformation .wm-item-number').text.split('#').last.strip
#extract product image
product['img_url'] = nokogiri.at_css('.prod-hero-image img')['src'].split('?').first
#extract product categories
product['categories'] = nokogiri.css('.breadcrumb-list li').collect{|li| li.text.strip.gsub('/','') }
# specify the collection where this record will be stored
product['_collection'] = 'products'
# save the product to the job’s outputs
outputs << product
Let’s go through this code line by line:
nokogiri = Nokogiri.HTML(content)
We are taking the html string of the page that is inside the “content” variable and parsing it with Nokogiri so that we can search it.
product = {}
We then initialize an empty hash. This is where we will store the data that we extract.
product['title'] = nokogiri.at_css('.ProductTitle').text.strip
First we extract the title. This line is saying give us the “div” html element with the class, “ProductTitle” then extract just the text inside the “div” and strip any trailing whitespace characters.
product['current_price'] = nokogiri.at_css('span.price-characteristic').attr('content').to_f
Next we are extracting the current price of the dvd. The current price is inside a “span” element with the class “price-characteristic.” We grab this element and get the value of the attribute named, “content” and then convert this element to a fixed number type, which will allow us to save decimal values.
original_price_div = nokogiri.at_css('.price-old')
original_price = original_price_div ? original_price_div.text.strip.gsub('$','').to_f : nil
product['original_price'] = original_price == 0.0 ? nil : original_price
The original price is a little trickier only because sometimes it does not exist. If a dvd has not been discounted then it will only have one price and the original won’t be present. To handle this scenario we grab the “div” with the class, “price-old” and check if this element exists. If it exists, we get the text of this element, remove any dollar signs and convert it to a fixed number. We also do a check to see if the original price is 0 because sometimes that occurs and if so, we do not want to save it.
rating = nokogiri.at_css('.hiddenStarLabel .seo-avg-rating').text.strip.to_f
product['rating'] = rating == 0 ? nil : rating
To save the rating we look for the div with class “seo-avg-rating” that is inside another div with class “hiddenStarLabel.” Sometimes multiple “divs” with the same class appear in html so it is a good habit to use multiple classes to narrow it down. We extract the text and convert it to a fixed number. If the rating is 0 that means the dvd has not been rated yet, so we do not save it in that case.
review_text = nokogiri.at_css('.stars-reviews-count-node').text.strip
product['reviews_count'] = review_text =~ /reviews/ ? review_text.split(' ').first.to_i : 0
Next is extracting the number of reviews. We grab the “div” with the class “stars-reviews-count-node” and the text inside. The reviews appear with a number and then the word reviews after such as, “4 reviews.” We check for the word “reviews” using a regular expression and if it appears we use “split” on the string to convert it into an array with the divider being spaces. For our example, this would look like the following: [“4”, “reviews”]. Now we have the number we want in the first value of the array, so we take that value and convert it to an integer.
product['publisher'] = nokogiri.at_css('a.prod-brandName').text.strip
Getting the publisher is pretty straightforward. We grab the link element with class “prod-brandName” and extract the text inside.
product['walmart_number'] = nokogiri.at_css('.prod-productsecondaryinformation .wm-item-number').text.split('#').last.strip
To extract the Walmart number we get the “div” with the class name “wm-item-number” inside the div with class “prod-productsecondaryinformation,” extract the text inside, split by the “#” sign and take the last element of the array.
product['img_url'] = nokogiri.at_css('.prod-hero-image img')['src'].split('?').first
Next is the dvd product image. We grab the “img” inside the “div” with class “prod-hero-image,” and take its “src” value. This url has some extra parameters after the question mark that we don’t need, so we split using the question mark and use the first element in the array to return a clean url.
product['categories'] = nokogiri.css('.breadcrumb-list li').collect{|li| li.text.strip.gsub('/','') }
The dvd categories are next. First we grab the “li” elements inside the “div” with class “breadcrumb-list. Then we use collect to iterate through each “li” element, grabbing just the text of each one and removing any forward slashes.
product['_collection'] = 'products'
This line sets the “collection” name to “products.” Job outputs are stored in collections and specifying the collection will allow us to query and export the data later.
outputs << product
Finally, we save the dvd product info to the “outputs” variable which is an array for saving job output.
Now we can update our config.yaml file by specifying our products parser. The config.yaml file should look like the following:
seeder:
file: ./seeder/seeder.rb
parsers:
- page_type: listings
file: ./parsers/listings.rb
- page_type: products
file: ./parsers/products.rb
Commit this to Git, and push it to your remote Git repository.
$ git add .
$ git commit -m 'add products parser to config'
$ git push origin master
Now that we have pushed our parser to our git repository, we can deploy the scraper again:
$ datahen scraper deploy walmart-movies
DataHen will automatically download this new parser and start to parse all the pages with “page_type” set to “products.” There are a lot of movie pages to parse so this will take some time. One thing we can do to speed this up is to increase the number of browser workers which will allow us to process multiple pages at a time in parallel. We can increase the browser workers with the following commands:
$ datahen scraper job cancel walmart-movies
$ datahen scraper job update walmart-movies --browsers 5
$ datahen scraper job resume walmart-movies
You can keep running the “stats” command from earlier to see how the progress is going. Once the scraper has completed parsing all pages we can export and download the data. We will learn more about exporting in Part IV.