Introduction to Web Scraping with Python and Scrapy

2020-12-03 18:02:02 | #programming #python #automation

Tested On

  • Linux Ubuntu 20.04
  • Windows 10
  • macOS Catalina

Scrapy is a versatile, open-source Python framework for web scraping. Although largely known for its web crawling features, Scrapy has APIs that can extract large amounts of data. Scrapy is easy to get running quickly, but it has a steep learning curve, being a 'batteries included' framework.

This means a lot of out-of-the-box functionality is included in the framework. Scrapy has a working-class template and promotes DRY (Don't Repeat Yourself). Before we create our Scrapy project, we'll touch on what web scraping is and why Scrapy is a popular solution among web scrapers.

What is Web Scraping?

Also called web crawling and web spidering, web scraping is a programming technique used in the extraction or collection of data from web pages. This data is used to make business decisions, or for analysis.

Initially a skill set of data scientists, web scraping has now become quite popular among developers. Anyone with a solid programming background can learn how to extract data from websites. Python offers a variety of libraries for web scraping— Selenium, Beautiful Soup, Mechanical Soup, and Requests are just a few of them.

What Can You Web Scrape?

We can write a Scrapy program to extract any visible information on a web page. With more advanced programs, we can scrape data from behind login pages, auto-scroll pages, and even try to by-pass reCAPTCHA services.

Is Web Scraping Legal?

Yes. Web scraping is legal, but some sites document their own scraping rules, instructions, and requirements in the robots.txt. To read the robots.txt file of a website, append "robots.txt" to the end home url like this - https://codeboxsystems.com/robots.txt. IMDb has an explicit 'DISALLOW', to all forms of scraping and data can only be collected with express written consent. Wikipedia's robots.txt file outlines specific conditions under which a scraper could operate.

Websites without robots.txt files will sometimes indicate their web scraping policy under their Terms and Conditions. Violating these T&Cs is considered illegal. Some sites recognize the importance of web scraping and readily provide easy access to their data through APIs.

Why Scrapy?

What makes Scrapy such a popular library for web scraping?

  • It is fast
  • Can handle asynchronous requests (multiple requests, in parallel)
  • Doesn't require much memory
  • Comes with built-in components and functionality that make coding easier for the developer
  • Cross-platform framework
  • Scrapy with Splash (a Javascript rendering service) can extract data from dynamic websites that use Javascript
  • Data scraped from Scrapy can be exported as JSON, CSV, Excel, or inputted into databases
  • The Selectors functionality allows us to scrape data either by XPath or CSS selectors

Getting Started With Scrapy

Requirements

Before we can work with Scrapy, we need to have a basic understanding of HTML and inspecting the DOM.

The importance of DOM knowledge cannot be overemphasized. Web pages have their peculiar structure and tags used to display information. If we don't understand how this works, we will never be able to get the data we need.

How to Set Up a Project Skeleton

In the official Scrapy documentation, it is highly recommended to install Scrapy within a virtual environment so it doesn't cause conflicts with already-installed Python system packages.

How to Create Python Project Files with Windows 10 PowerShell 2.0+

cd ~
New-Item -ItemType "directory" -Path ".\scrapy-projects"
cd scrapy-projects
virtualenv venv
.\venv\Scripts\activate

To verify that the virtual environment is active, make sure (venv) is in the PowerShell command prompt. For example, (venv) PS C:\Users\username\scrapy-projects>

How to Create Python Project Files with Linux Ubuntu 14.04+ or macOS

cd ~
mkdir scrapy-projects
cd scrapy-projects
virtualenv -p python3 venv
source venv/bin/activate

To verify that the virtual environment is active, make sure (venv) is in the terminal command prompt.

This will create the following files and folders, and activate the virtual environment.

▾ scrapy-projects/
  ▸ venv/

How to Install Scrapy

Note: Scrapy works with Python 2.7 and Python 3.3 and above. We will be using Scrapy 2.4.1 for this tutorial.

To install Scrapy, make sure you're still inside the virtual environment, and run the following command with pip.

pip install scrapy==2.4.1

If you have anaconda installed, you can use conda in addition to the conda-forge channel. For Windows users, the official Scrapy documentation recommends we use conda for installation to avoid most installation issues.

conda install -c conda-forge scrapy

Or simply,

conda install scrapy==2.4.1

To install Scrapy on Ubuntu (or Ubuntu-based) systems, we need to install these dependencies:

sudo apt-get install Python python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

How to Create a Scrapy Project

Now that we have Scrapy installed, we need to double check to make sure the virtual environment is still activated, and run the following command from inside the scrapy-projects folder.

scrapy startproject sitescrape

After executing this command, you will notice Scrapy has created a directory called sitescrape inside scrapy-projects with the following files and folders. These are all autogenerated by Scrapy and are required to run the program.

▾ scrapy-projects/
  ▾ sitescrape/
    ▾ sitescrape/
      ▾ spiders/
          __init__.py
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
      scrapy.cfg

When creating Scrapy projects with the above command, feel free to replace sitescrape with whatever you prefer. We've chosen this name to indicate that we'll be scraping websites with this program. You should also notice the following prompt, after running this command.

We can start your first spider with:
    cd sitescrape
    scrapy genspider example example.com

If you decide to follow this prompt, Scrapy will add a scrapy-projects/sitescrape/sitescrape/spiders/example.py file to your project, with the following code:

import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    def parse(self, response):
        pass

This sets up some code that we can use to get a scraper running against a URL. We'll explain what all this code means in the next section. For now, just be aware that the boilerplate command scrapy genspider example example.com is available to you, and you can swap out the example spider name and example.com URL with the domain of the site you want to scrape. Also, notice where example.py is located in the following file tree. We are also free to manually add your own files/code if you want to skip the boilerplate command.

▾ scrapy-projects/
  ▾ sitescrape/
    ▾ sitescrape/
      ▸ __pycache__/
      ▾ spiders/
        ▸ __pycache__/
          __init__.py
          example.py
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
      scrapy.cfg
  ▸ venv/

For the purposes of this tutorial, we're going to continue with the boilerplate-generated example.py file, and update it to parse h1 tags on the page. Update the parse function in your example.py file with the following code:

import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    def parse(self, response):
        for h1_text in response.xpath('//h1/text()'):
            yield {
                'text': h1_text.extract()
            }

Explanation of the Code

Line 1: Imports the scrapy module.

Line 4: Declares an ExampleSpider class as a subclass of scrapy.Spider. This allows ExampleSpider to inherite the properties and methods of the scrapy.Spider parent class.

Line 5: Declares the name of the class, which Scrapy requires.

Line 6: According to the Scrapy docs, allowed_domains is "An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list (or their subdomains) won’t be followed if OffsiteMiddleware is enabled. Let’s say your target url is https://www.example.com/1.html, then add 'example.com' to the list."

Line 7: start_urls is the list of URLs for the Scrapy spider to begin crawling from.

Line 8: parse() is the default callback that fires after a request has executed. The response gets processed and stored into the response parameter, where further action can be taken.

Lines 10-13: In our example, we target h1 tags with the response.xpath() selector, loop through each element, and yield the text to the console output.

How to Run Scrapy

Run your crawler with one of the following commands: scrapy crawl example or scrapy runspider sitescrape/spiders/example.py. Somewhere in the output log, you should see {'txt': 'Example Domain'}, indicating that the h1 title tag was found, and its text was returned.

2020-12-05 12:44:57 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: sitescrape)
...
2020-12-05 12:44:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com/> (referer: None)
2020-12-05 12:44:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://example.com/>
{'text': 'Example Domain'}
2020-12-05 12:44:58 [scrapy.core.engine] INFO: Closing spider (finished)

Conclusion

In this tutorial, we briefly looked at an overview of web scraping and how to use Scrapy for webscraping. At this point, we have everything we need to install and start a Scrapy project. We will dive deeper into creating a more complex spiders in the next tutorial, so subscribe to get notified when it becomes available.

Book Recommendations for You

Comments

You must log in to comment. Don't have an account? Sign up for free.

Subscribe to comments for this post

Want To Receive More Free Content?

Would you like to receive free resources, tailored to help you reach your IT goals? Get started now, by leaving your email address below. We promise not to spam. You can also sign up for a free account and follow us on and engage with the community. You may opt out at any time.



Hire Us for IT and Consulting Services









Contact Us

Do you have a specific IT problem that needs solving or just have a general IT question? Use the contact form to get in touch with us and an IT professional will be with you, momentarily.

Services

We offer web development, enterprise software development, QA & testing, google analytics, domains and hosting, databases, security, IT consulting, and other IT-related services.

Free IT Tutorials

Head over to our tutorials section to learn all about working with various IT solutions.

We Noticed Adblock Running

Because we offer a variety of free programming tools and resources to our visitors, we rely on ad revenue to keep our servers up. Would you consider disabling Adblock for our site and clicking the "Refresh Page" button?

Contact