The next step is to create a script that will find and enqueue all the individual Ali Express product pages within the “Women’s Clothing” category, which we will then use to parse out the product details. Create a folder called “parsers” in our project root directory:
$ mkdir parsers
Next create a file called “listings.rb” inside this “parsers” folder. Because we set the “page_type” to “listings” in our seeder, the seeder and any other pages with this “listings” “page_type” will be run against this “listings” parser. This is where we will write the code to extract and enqueue the links to the women’s clothing products as well as links to more listing pages from the pagination, which are the numbered links at the bottom of a page that lead to more pages (Previous, 1, 2, 3, …). First add the following line to the top of this “listings.rb” file:
nokogiri = Nokogiri.HTML(content)
The “content” variable is a reserved work that contains the html content data from the actual page. With this line we are loading the html into Nokogiri so that we can search it easily. Next we are going to extract the movie links from this page. Copy and paste the following below the line you just created:
#load products
products = nokogiri.css('#list-items li')
products.each do |product|
a_element = product.at_css('a.product')
if a_element
url = URI.join('https:', a_element['href']).to_s.split('?').first
if url =~ /\Ahttps?:\/\//i
pages << {
url: url,
page_type: 'products',
fetch_type: 'browser',
force_fetch: true,
vars: {
category: page['vars']['category'],
url: url
}
}
end
end
end
Let’s take a look at this code. In the first line we are telling Nokogiri to find all “li” elements inside a div with an “id” equal to “list-items.” This will give us list elements that each contain a link to an Ali Express product. Next we loop through each of these list elements and look for a link inside with the class “product.” This is the link to the Ali Express product page which has all the product info that we want. We do a check to make sure the link element exists and if it does we extract the “href.” In this case, the “href” does not include the “https,” so we use URI.join to join the “https” together with the “href” value. Calling “to_s” will convert the URI.join value to a string. This url string will contain a question mark followed by many link parameters. We don’t need these link parameters to visit the product page so we split the string by the question mark, which will create an array with everything before the question mark as the first value in the array and everything after the question mark as the second value in the array. Then we set a “url” variable as equal to the first value giving us a unique product url.
Once we have the individual url for each Ali Express product we enqueue it by passing it to the “pages” variable. We set the “page_type” to “products” and will soon create a “products” parser to extract the clothing product details. We also pass the “category” and “url” to the product parser by utilizing the “vars” hash. Any values we set here will be passed to the corresponding parser. In this case, we are scraping just the “Women’s Clothing” category, so we want to make sure we send that category (which we originally set in the seeder) so it can be saved for each product.
One thing to note here is that we set the “fetch_type” to “browser.” This makes it so these product pages will be parsed using a headless browser which will run Javascript instead of a regular worker. The reason we need to use a headless browser in this case is because some elements on the page such as shipping info are rendered using Javascript and will not show otherwise.
Now that we have enqueued women’s clothing product urls from the first seeded listing page, we need to add more listing pages from the pagination at the bottom so we can get every women’s clothing url available. Paste the following code below the code you just added.
#load paginated links
pagination_links = nokogiri.css('#pagination-bottom a')
pagination_links.each do |link|
l_val = link.text.strip
if l_val !~ /next|previous/i && l_val.to_i < 11 #limit pagination to 10 pages
url = URI.join('https:', link['href']).to_s.split('?').first
pages << {
url: url,
page_type: 'listings'
}
end
end
Here we are enqueuing all “paginated” links by getting every link (“a” element) inside the “div” with “id” “pagination-bottom.” We then iterate through each “a” element and get the inner text from each. We want to ignore the links that say, “Previous” and “Next” because they have a different format, so we use a regular expression to make sure that the text inside the link is a number. After that we remove any unwanted parameters after the question mark by splitting into an array and using the first element of the array. We also limit the number of pages by only getting pagination that is under 11 pages. If you want to scrape more products you can remove the, “&& l_val.to_i < 11” from the if statement. We then enqueue this link, set the “page_type” to “listings” and set the “category” to the “category” that was passed in through the page variables.
Let’s try out our “listings” parser to see if there are any errors. First, let’s see what the seeded pages looks like by running the following command:
datahen scraper page list ali-express
The output should looks something like the following:
[
{
"gid": "www.aliexpress.com-4dc6c3d39d4f47339c9af33e4834c0b9", # Global ID
"job_id": 1287,
"page_type": "products",
"method": "GET",
"url": "https://www.aliexpress.com/item/Womens-Casual-Short-Sleeve-Tops-Summer-Vogue-Slogan-Printed-Tee-shirt-femme-fashion-harajuku-tumblr-Blouse/32888976324.html",
...
},
...
]
We want to take the “gid” (Global ID) and try our parser against this specific page to check the output. Copy the “gid” and replace it in the following command:
$ datahen parser try ali-express parsers/listings.rb www.aliexpress.com-ad87e7ec9081623f1bb4ce74815435ea
You should see output similar to the following:
Trying parser script
getting Job Page
=========== Parsing Executed ===========
----------- New Pages to Enqueue: -----------
[
{
"url": "https://www.aliexpress.com/item/CUHAKCI-Summer-Women-Sunflower-Bird-Chiffon-Blouse-Stripe-Plaid-Shirt-Cross-Love-Blouse-Short-Sleeve-Blue/32818498674.html",
"page_type": "products",
"fetch_type": "browser",
"force_fetch": true,
"vars": {
"category": "Women's clothing",
"url": "https://www.aliexpress.com/item/CUHAKCI-Summer-Women-Sunflower-Bird-Chiffon-Blouse-Stripe-Plaid-Shirt-Cross-Love-Blouse-Short-Sleeve-Blue/32818498674.html"
}
},
{
"url": "https://www.aliexpress.com/item/Women-Ladies-Summer-Vest-Top-Sleeveless-V-neck-women-Shirt-Blouse-Casual-Tank-Shirt-Tops/32968090644.html",
"page_type": "products",
"fetch_type": "browser",
"force_fetch": true,
"vars": {
"category": "Women's clothing",
"url": "https://www.aliexpress.com/item/Women-Ladies-Summer-Vest-Top-Sleeveless-V-neck-women-Shirt-Blouse-Casual-Tank-Shirt-Tops/32968090644.html"
}
},
…
]
Next, add the parsers section to the config.yaml file so that it looks like the following:
seeder:
file: ./seeder/seeder.rb
parsers:
- page_type: listings
file: ./parsers/listings.rb
Commit this to Git, and push it to your remote Git repository.
$ git add .
$ git commit -m 'add listings parser to config'
$ git push origin master
Now that your remote git repository is up to date, we can deploy the scraper again:
$ datahen scraper deploy ali-express
Since we have already started a scrape job, DataHen will automatically download and execute your new listings parser. Wait a few minutes and then run the following “stats” command to see the progress:
$ datahen scraper stats ali-express
Now that our listings parser is working we can move on to extracting individual product data by creating a “products” parser. Create a file called “products.rb” inside the “parsers” folder with the following code:
# initialize an empty hash
product = {}
#save the url
product['url'] = page['vars']['url']
#save the category
product['category'] = page['vars']['category']
#extract title
product['title'] = nokogiri.at_css('.product-name').text.strip
#extract product image
product['image_url'] = nokogiri.at_css('#magnifier img')['src']
#extract discount price
discount_element = nokogiri.at_css('span#j-sku-discount-price')
if discount_element
discount_low_price = discount_element.css('span').find{|span| span['itemprop'] == 'lowPrice' }
if discount_low_price
product['discount_low_price'] = discount_element.css('span').find{|span| span['itemprop'] == 'lowPrice' }.text.to_f
product['discount_high_price'] = discount_element.css('span').find{|span| span['itemprop'] == 'highPrice' }.text.to_f
else
product['discount_price'] = discount_element.text.to_f
end
end
#extract original price
price_element = nokogiri.at_css('#j-sku-price')
if price_element
price_array = price_element.text.strip.split('-')
if price_array.size > 1
product['original_low_price'], product['original_high_price'] = price_array.map{|price| price.strip.to_f }
else
product['original_price'] = price_array.first.strip.to_f
end
end
#extract categories
breadcrumb_categories = nokogiri.at_css('.ui-breadcrumb').text.strip
categories = breadcrumb_categories.split('>').map{|category| category.strip }
categories.delete('Home')
product['categories'] = categories
#extract SKUs
skus_element = nokogiri.css('ul.sku-attr-list').find{|ul| ul['data-sku-prop-id'] == '14' }
if skus_element
skus = skus_element.css('a').collect{|a| a['title'] }
product['skus'] = skus
end
#extract sizes
sizes_element = nokogiri.css('ul.sku-attr-list').find{|ul| ul['data-sku-prop-id'] == '5' }
if sizes_element
sizes = sizes_element.css('a').collect{|a| a.text.strip }
product['sizes'] = sizes
end
#extract rating and reviews
rating_element = nokogiri.at_css('span.ui-rating-star')
if rating_element
rating_value = rating_element.css('span').find{|span| span['itemprop'] == 'ratingValue' }
product['rating'] = rating_value.text.strip.to_f if rating_value
review_count = rating_element.css('span').find{|span| span['itemprop'] == 'reviewCount' }
product['reviews_count'] = review_count.text.strip.to_i if review_count
end
#extract orders count
order_count_element = nokogiri.at_css('#j-order-num')
if order_count_element
product['orders_count'] = order_count_element.text.strip.split(' ').first.to_i
end
#extract shipping info
shipping_element = nokogiri.at_css('dl#j-product-shipping')
if shipping_element
product['shipping_info'] = shipping_element.text.strip.gsub(/\s\s+/, ' ')
end
#extract return policy
return_element = nokogiri.at_css('#j-seller-promise-list')
if return_element
product['return_policy'] = return_element.at_css('.s-serve').text.strip
end
#extract guarantee
guarantee_element = nokogiri.at_css('#serve-guarantees-detail')
if guarantee_element
product['guarantee'] = guarantee_element.text.strip.gsub(/\s\s+/, ' ')
end
# specify the collection where this record will be stored
product['_collection'] = "products"
# save the product to the job’s outputs
outputs << product
Let’s go through this code line by line:
nokogiri = Nokogiri.HTML(content)
We are taking the html string of the page that is inside the “content” variable and parsing it with Nokogiri so that we can search it.
product = {}
We then initialize an empty hash. This is where we will store the data that we extract.
product['url'] = page['vars']['url']
Here we are saving the url that we passed in through our “listings” parser.
product['category'] = page['vars']['category']
Similar to the “url” we are saving the Ali Express “category” which we also passed in to the page “vars” in the “listings” parser (originally set in our “seeder”). For this example, this value will be equal to “Women’s clothing.”
product['title'] = nokogiri.at_css('.product-name').text.strip
To get the product title we look for a “div” with class “product-name” and get the text inside.
product['image_url'] = nokogiri.at_css('#magnifier img')['src']
After the title we are exacting the product image url by finding the image element inside the div with the id “magnifier.” We then get the “src” value for from this image element.
discount_element = nokogiri.at_css('span#j-sku-discount-price')
if discount_element
discount_low_price = discount_element.css('span').find{|span| span['itemprop'] == 'lowPrice' }
if discount_low_price
product['discount_low_price'] = discount_element.css('span').find{|span| span['itemprop'] == 'lowPrice' }.text.to_f
product['discount_high_price'] = discount_element.css('span').find{|span| span['itemprop'] == 'highPrice' }.text.to_f
else
product['discount_price'] = discount_element.text.to_f
end
end
Next we are extracting discount price info. We first look for a “span” element with the
“id” “j-sku-discount-price” and if it exists we check for discounted low and high pricing. We do this by seeing if a span exists inside this “span” element with an “itemprop” set to “lowprice.” If the “lowprice” “span” exists then we know a high price element exists as well. We take the text inside both the “lowprice” and “highprice” elements and convert them to float objects to preserve any decimal values. If the low and high price elements do not exist, then we know there’s just one price, which we get by getting the inner text of the “span” element.
price_element = nokogiri.at_css('#j-sku-price')
if price_element
price_array = price_element.text.strip.split('-')
if price_array.size > 1
product['original_low_price'], product['original_high_price'] = price_array.map{|price| price.strip.to_f }
else
product['original_price'] = price_array.first.strip.to_f
end
end
Next we are looking at getting the original price. First we get the “div” element with id “j-sku-price.” We split the price into an array using the dash character. If there are multiple prices the array will have multiple elements which are both the original low price and original high price. If there are multiple elements, then we use map to convert each price to a float object. If there are not multiple elements in the array, then there is just one original price, which we also convert to a float object.
breadcrumb_categories = nokogiri.at_css('.ui-breadcrumb').text.strip
categories = breadcrumb_categories.split('>').map{|category| category.strip }
categories.delete('Home')
product['categories'] = categories
Here we are extracting the product categories. We get the “div” element with the class “ui-breadcrumb” and get it’s inner text. This gives us a string with each category separated by a “>” character. We split the string into an array using this character which gives us an array with each category as a value in the array. We also use map to go through each element in the array and “strip” away any whitespace. After that we delete “Home” from the array and save it to the product hash.
skus_element = nokogiri.css('ul.sku-attr-list').find{|ul| ul['data-sku-prop-id'] == '14' }
if skus_element
skus = skus_element.css('a').collect{|a| a['title'] }
product['skus'] = skus
end
To get the SKUs (which stands for Stock Keeping Unit) we get the “ul” (unordered list) elements and find the one that has the “data-sku-prop-id” equal to 14. If it exists we grab all the “a” elements (which are links) inside, and collect only the titles from each one. We then save this array to the product hash.
sizes_element = nokogiri.css('ul.sku-attr-list').find{|ul| ul['data-sku-prop-id'] == '5' }
if sizes_element
sizes = sizes_element.css('a').collect{|a| a.text.strip }
product['sizes'] = sizes
end
Next we get the sizes which is very similar to how we got the SKUs. We get the “ul” element with class “sku-attr_list” and the “data-sku-prop-id” value equal to 5. If this element exists, we collect the inner text of each link inside this element and save the resulting array.
rating_element = nokogiri.at_css('span.ui-rating-star')
if rating_element
rating_value = rating_element.css('span').find{|span| span['itemprop'] == 'ratingValue' }
product['rating'] = rating_value.text.strip.to_f if rating_value
review_count = rating_element.css('span').find{|span| span['itemprop'] == 'reviewCount' }
product['reviews_count'] = review_count.text.strip.to_i if review_count
end
After sizes we are extracting the rating and number of reviews. We first find the “span” element with the class “ui-rating-star.” If this element exists then we look for a span inside this “ui-rating-star” element with “itemprop” equal to “ratingValue.” We get the text inside, convert to a float, and save. For the number of reviews, we look for the “span” element with “itemprop” set to “reviewCount,” convert the text inside to an integer and save.
order_count_element = nokogiri.at_css('#j-order-num')
if order_count_element
product['orders_count'] = order_count_element.text.strip.split(' ').first.to_i
end
The order count is next and is pretty straightforward. We look for the “div” with the “id” “j-order-num” and if that element exists we grab the text inside and split by a single whitespace. This will give us a string that looks something like: “7100 orders.” We just want the count, so we split by a single whitespace, which creates an array that looks like: [“7100”, “orders”]. Then we get the first value of the array and convert it to an integer.
shipping_element = nokogiri.at_css('dl#j-product-shipping')
if shipping_element
product['shipping_info'] = shipping_element.text.strip.gsub(/\s\s+/, ' ')
end
For the shipping info we look for the “dl” element with “id” “j-product-shipping.” If this element exists, we get the inner text, use a regular expression to remove any new lines, and save the resulting text.
return_element = nokogiri.at_css('#j-seller-promise-list')
if return_element
product['return_policy'] = return_element.at_css('.s-serve').text.strip
end
For the return policy, we look for a “div” with “id” “j-seller-promise-list” and if the element exists, we look for a “div” with “class” “s-serve” inside, extract it’s inner text and save.
guarantee_element = nokogiri.at_css('#serve-guarantees-detail')
if guarantee_element
product['guarantee'] = guarantee_element.text.strip.gsub(/\s\s+/, ' ')
end
Next is the guarantee which is similar to the shipping info To get the guarantee, we look for a “div” with “id” “serve-guarantees-detail.” If the element exists, we get the inner text, use a regular expression to remove new lines and save.
product['_collection'] = 'products'
This line sets the “collection” name to “products.” Job outputs are stored in collections and specifying the collection will allow us to query and export the data later.
outputs << product
Finally, we save the Ali Express product hash to the “outputs” variable which is an array for saving job output.
Now we can update our config.yaml file by specifying our products parser. The config.yaml file should look like the following:
seeder:
file: ./seeder/seeder.rb
Parsers:
- page_type: listings
file: ./parsers/listings.rb
- page_type: products
file: ./parsers/products.rb
Commit this to Git, and push it to your remote Git repository.
$ git add .
$ git commit -m 'add products parser to config'
$ git push origin master
Now that we have updated our git repository, we can deploy the scraper again:
$ datahen scraper deploy ali-express
DataHen will automatically download this new parser and start to parse all the pages with “page_type” set to “products.” You can keep running the “stats” command from earlier to see how the progress is going. This scraper will take some time to finish as there are a lot of products in this Women’s Clothing category. One thing we can do to speed up the scraper is to add more workers which will allow us to process multiple pages at a time in parallel. We can increase the number of browser workers to 5 with the following commands:
$ datahen scraper job cancel ali-express
$ datahen scraper job update ali-express --browsers 5
$ datahen scraper job resume ali-express
You can keep running the stats command to check on how the scraper is performing over time. Once the scraper has finished parsing all the women’s clothing product pages we can export and download the data, which will will do in Part 4.