Introduction to Web Scraping with Python and Scrapy
2020-12-03 18:02:02 |
Tested On
- Linux Ubuntu 20.04
- Windows 10
- macOS Catalina
Scrapy is a versatile, open-source Python framework for web scraping. Although largely known for its web crawling features, Scrapy has APIs that can extract large amounts of data. Scrapy is easy to get running quickly, but it has a steep learning curve, being a 'batteries included' framework.
This means a lot of out-of-the-box functionality is included in the framework. Scrapy has a working-class template and promotes DRY (Don't Repeat Yourself). Before we create our Scrapy project, we'll touch on what web scraping is and why Scrapy is a popular solution among web scrapers.
What is Web Scraping?
Also called web crawling and web spidering, web scraping is a programming technique used in the extraction or collection of data from web pages. This data is used to make business decisions, or for analysis.
Initially a skill set of data scientists, web scraping has now become quite popular among developers. Anyone with a solid programming background can learn how to extract data from websites. Python offers a variety of libraries for web scraping— Selenium, Beautiful Soup, Mechanical Soup, and Requests are just a few of them.
What Can You Web Scrape?
We can write a Scrapy program to extract any visible information on a web page. With more advanced programs, we can scrape data from behind login pages, auto-scroll pages, and even try to by-pass reCAPTCHA services.
Is Web Scraping Legal?
Yes. Web scraping is legal, but some sites document their own scraping rules, instructions, and requirements in the robots.txt. To read the robots.txt file of a website, append "robots.txt" to the end home url like this - https://codeboxsystems.com/robots.txt. IMDb has an explicit 'DISALLOW', to all forms of scraping and data can only be collected with express written consent. Wikipedia's robots.txt file outlines specific conditions under which a scraper could operate.
Websites without robots.txt files will sometimes indicate their web scraping policy under their Terms and Conditions. Violating these T&Cs is considered illegal. Some sites recognize the importance of web scraping and readily provide easy access to their data through APIs.
Why Scrapy?
What makes Scrapy such a popular library for web scraping?
- It is fast
- Can handle asynchronous requests (multiple requests, in parallel)
- Doesn't require much memory
- Comes with built-in components and functionality that make coding easier for the developer
- Cross-platform framework
- Scrapy with Splash (a Javascript rendering service) can extract data from dynamic websites that use Javascript
- Data scraped from Scrapy can be exported as JSON, CSV, Excel, or inputted into databases
- The Selectors functionality allows us to scrape data either by XPath or CSS selectors
Getting Started With Scrapy
Requirements
Before we can work with Scrapy, we need to have a basic understanding of HTML and inspecting the DOM.
The importance of DOM knowledge cannot be overemphasized. Web pages have their peculiar structure and tags used to display information. If we don't understand how this works, we will never be able to get the data we need.
How to Set Up a Project Skeleton
In the official Scrapy documentation, it is highly recommended to install Scrapy within a virtual environment so it doesn't cause conflicts with already-installed Python system packages.
How to Create Python Project Files with Windows 10 PowerShell 2.0+
cd ~
New-Item -ItemType "directory" -Path ".\scrapy-projects"
cd scrapy-projects
virtualenv venv
.\venv\Scripts\activate
To verify that the virtual environment is active, make sure (venv) is in the PowerShell command prompt. For example, (venv) PS C:\Users\username\scrapy-projects>
How to Create Python Project Files with Linux Ubuntu 14.04+ or macOS
cd ~
mkdir scrapy-projects
cd scrapy-projects
virtualenv -p python3 venv
source venv/bin/activate
To verify that the virtual environment is active, make sure (venv) is in the terminal command prompt.
This will create the following files and folders, and activate the virtual environment.
▾ scrapy-projects/
▸ venv/
How to Install Scrapy
Note: Scrapy works with Python 2.7 and Python 3.3 and above. We will be using Scrapy 2.4.1 for this tutorial.
To install Scrapy, make sure you're still inside the virtual environment, and run the following command with pip.
pip install scrapy==2.4.1
If you have anaconda installed, you can use conda in addition to the conda-forge channel. For Windows users, the official Scrapy documentation recommends we use conda for installation to avoid most installation issues.
conda install -c conda-forge scrapy
Or simply,
conda install scrapy==2.4.1
To install Scrapy on Ubuntu (or Ubuntu-based) systems, we need to install these dependencies:
sudo apt-get install Python python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
How to Create a Scrapy Project
Now that we have Scrapy installed, we need to double check to make sure the virtual environment is still activated, and run the following command from inside the scrapy-projects folder.
scrapy startproject sitescrape
After executing this command, you will notice Scrapy has created a directory called sitescrape inside scrapy-projects with the following files and folders. These are all autogenerated by Scrapy and are required to run the program.
▾ scrapy-projects/
▾ sitescrape/
▾ sitescrape/
▾ spiders/
__init__.py
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
scrapy.cfg
When creating Scrapy projects with the above command, feel free to replace sitescrape with whatever you prefer. We've chosen this name to indicate that we'll be scraping websites with this program. You should also notice the following prompt, after running this command.
We can start your first spider with:
cd sitescrape
scrapy genspider example example.com
If you decide to follow this prompt, Scrapy will add a scrapy-projects/sitescrape/sitescrape/spiders/example.py file to your project, with the following code:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
def parse(self, response):
pass
This sets up some code that we can use to get a scraper running against a URL. We'll explain what all this code means in the next section. For now, just be aware that the boilerplate command scrapy genspider example example.com is available to you, and you can swap out the example spider name and example.com URL with the domain of the site you want to scrape. Also, notice where example.py is located in the following file tree. We are also free to manually add your own files/code if you want to skip the boilerplate command.
▾ scrapy-projects/
▾ sitescrape/
▾ sitescrape/
▸ __pycache__/
▾ spiders/
▸ __pycache__/
__init__.py
example.py
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
scrapy.cfg
▸ venv/
For the purposes of this tutorial, we're going to continue with the boilerplate-generated example.py file, and update it to parse h1 tags on the page. Update the parse function in your example.py file with the following code:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
def parse(self, response):
for h1_text in response.xpath('//h1/text()'):
yield {
'text': h1_text.extract()
}
Explanation of the Code
Line 1: Imports the scrapy module.
Line 4: Declares an ExampleSpider class as a subclass of scrapy.Spider. This allows ExampleSpider to inherite the properties and methods of the scrapy.Spider parent class.
Line 5: Declares the name of the class, which Scrapy requires.
Line 6: According to the Scrapy docs, allowed_domains is "An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list (or their subdomains) won’t be followed if OffsiteMiddleware is enabled. Let’s say your target url is https://www.example.com/1.html, then add 'example.com' to the list."
Line 7: start_urls is the list of URLs for the Scrapy spider to begin crawling from.
Line 8: parse() is the default callback that fires after a request has executed. The response gets processed and stored into the response parameter, where further action can be taken.
Lines 10-13: In our example, we target h1 tags with the response.xpath() selector, loop through each element, and yield the text to the console output.
How to Run Scrapy
Run your crawler with one of the following commands: scrapy crawl example or scrapy runspider sitescrape/spiders/example.py. Somewhere in the output log, you should see {'txt': 'Example Domain'}, indicating that the h1 title tag was found, and its text was returned.
2020-12-05 12:44:57 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: sitescrape)
...
2020-12-05 12:44:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com/> (referer: None)
2020-12-05 12:44:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://example.com/>
{'text': 'Example Domain'}
2020-12-05 12:44:58 [scrapy.core.engine] INFO: Closing spider (finished)
Conclusion
In this tutorial, we briefly looked at an overview of web scraping and how to use Scrapy for webscraping. At this point, we have everything we need to install and start a Scrapy project. We will dive deeper into creating a more complex spiders in the next tutorial, so subscribe to get notified when it becomes available.
If you're interested in programs that carry out your computer tasks for you, take our Automation the Easy Way with Python course. This course teaches CSV and Excel file generation, API requests, website scraping, email delivery, task scheduling, and browser click, mouse, and keyboard automation. Automate your daily tasks, free up time, and get ahead, today.
Comments
You must log in to comment. Don't have an account? Sign up for free.
Subscribe to comments for this post
Info