Scrape Amazon Reviews using Python
Web scraping is one of the best ways to automate collecting a large set of data according to our needs. The program that is used to scrape a website is called a web crawler.
The origin of scraping goes back to the time where the internet was a collection of File Transfer Protocol (FTP) sites. It was daunting to search for information or data on these sites. Users had to navigate the sites to find specific shared files. As a solution to this problem, automated programs called web crawlers or bots were created.
However, scraping technology has improved a lot since then. There are many web crawlers built with different functionalities targeting specific tasks. While some of them are simple programs, others are very complicated.
In this article, I will show you how to scrape review data from Amazon using Scrapy. Scrapy is a free and open-source framework written in Python specifically targeting scraping.
Installation and Setup
Make sure that you have installed python as it is required to install Scrapy. There are two ways of installing Scrapy:
- We can install Scrapy using pip which is a package management tool for python.
$ pip install Scrapy
- In case you are using Anaconda, you can install it using conda.
conda install -c conda-forge Scrapy
Create a Project for Scraping
First, create a folder in which you are going to create your application. Inside this folder, we’ll run the below command to create a project using Scrapy.
scrapy startproject Scrape_AmazonReviews
Once you create the project, open the folder as a work space in your favorite editor. You will have a folder structure, as described below.
The folder structure of the crawling project
Next, we’ll create a Spider which is the real program that does the scraping. It crawls through a given URL and parses the data that are described using XPath.
In this example, we’ll create the Spider to extract data from the Amazon web page into an excel sheet in a particular format.
To create a Spider, we need to provide the URL to be crawled.
scrapy genspider spiderName your-amazon-link-her
So, as we are extracting the reviews for a particular product called "World Tech Toys Elite Mini Orion Spy Drone". The URL for this product is “https://www.amazon.com/product-reviews/B01IO1VPYG/ref=cm_cr_arp_d_viewopt_sr?pageNumber=”.
scrapy genspider AmazonReviews https://www.amazon.com/product-reviews/B01IO1VPYG/ref=cm_cr_arp_d_viewopt_sr?pageNumber=
After we create the Spider, we will take a look at folder structure and supporting files.
├── scrapy.cfg # deploy configuration file
└── Scrape_AmazonReviews # project's Python module, we just created
├── __init__.py
├── items.py # project items definition file
├── middlewares.py # project middlewares file
├── pipelines.py # project pipeline file
├── settings.py # project settings file
└── spiders # a directory where spiders are located
├── __init__.py
└── example.py # spider we just created
Once that’s done, we need to identify the patterns to be extracted from the web page before coding the Spider
Identifying the patterns from web Page
We’ll identify the XML patterns for the review page and inspect the title, ratings, comments, and reviews from this page.
Identifying the items and patterns to be scraped
Inspecting the elements and their class
Skeleton of Spider
Once you create a Spider using a URL you will have a basic skeleton of the Spider created in the folder path “Scrape_AmazonReviews\Scrape_AmazonReviews\spiders”.
# -*- coding: utf-8 -*-
import scrapy
class AmazonReviewsSpider(scrapy.Spider):
name = "amazon_reviews"
allowed_domains = ["https://www.amazon.com/product-reviews/B01IO1VPYG/ref=cm_cr_arp_d_viewpnt_lft?pageNumber="]
start_urls = (
'https://www.amazon.com/product-reviews/B01IO1VPYG/ref=cm_cr_arp_d_viewpnt_lft?pageNumber=/',
)
def parse(self, response):
pass
Now, we’ll start extracting the data using the classes used to display the details of the reviews. Also, we will make sure that we scroll through the pages to extract the reviews.
First, we’ll set the base_url and add the number of pages to be crawled. Next, let’s extract the data using the class as identifies.
Below is the code to extract the highlighted data.
# -*- coding: utf-8 -*-
# Importing Scrapy Library
import scrapy
# Creating a new class to implement Spide
class AmazonReviewsSpider(scrapy.Spider):
# Spider name
name = 'amazon_reviews'
# Domain names to scrape
allowed_domains = ['amazon.in']
# Base URL for the World Tech Toys Elite Mini Orion Spy Drone
myBaseUrl = "https://www.amazon.com/product-reviews/B01IO1VPYG/ref=cm_cr_arp_d_viewopt_sr?pageNumber="
start_urls=[]
# Creating list of urls to be scraped by appending page number a the end of base url
for i in range(1,5):
start_urls.append(myBaseUrl+str(i))
# Defining a Scrapy parser
def parse(self, response):
#Get the Review List
data = response.css('#cm_cr-review_list')
#Get the Name
name = data.css('.a-profile-name')
#Get the Review Title
title = data.css('.review-title')
# Get the Ratings
star_rating = data.css('.review-rating')
# Get the users Comments
comments = data.css('.review-text')
count = 0
# combining the results
for review in star_rating:
yield{'Name':''.join(name[count].xpath(".//text()").extract()),
'Title':''.join(title[count].xpath(".//text()").extract()),
'Rating': ''.join(review.xpath('.//text()').extract()),
'Comment': ''.join(comments[count].xpath(".//text()").extract())
}
count=count+1
Extracting the data into the file
Once you build the Spider successfully, we can save the extracted output using runspider command. It takes the output of the Spider and stores it into a file. The runspider provides the output in formats like CSV, XML, and JSON.
To use a specific format you can use ‘-t’ to set your output format, like below.
scrapy runspider spiders/filename.py -t txt -o - > amazonreviews.txt
Here we’ll extract the output into the `.csv` file. Open the Anaconda prompt and run the below command from the folder Scrape_AmazonReviews\Scrape_AmazonReviews.
scrapy runspider spiders/AmazonReview.py -o output.csv
You will get the output in the folder Scrape_AmazonReviews\Scrape_AmazonReviews.
Amazon Reviews for World Tech Toys Elite Mini Orion Spy Drone
Summary
That being said, Scrapy is the best tool to extract the selected data and store it in the required format. By using Scrapy, we can customize the extracted data. Also, Scrapy uses a “Twisted asynchronous networking” framework to connect the given URL. Therefore, it creates a Get request and extracts the XML nodes from the given URL. The extracted data is transferred to the given output data format. Therefore, using Scrapy will never be a disappointment.