Introduction to web scraping

Web scraping is the process of extracting data from websites. Some data that is available on the web is presented in a format that makes it easier to collect and use it, for example in the form of downloadable comma-separated values (CSV) datasets that can then be imported in a spreadsheet or loaded into a data analysis script. Often however, even though it is publicly available, data is not readily available for reuse. For example it can be contained in a PDF, or a table on a website, or spread across multiple web pages.

There are a variety of ways to scrape a website to extract information for reuse. In its simplest form, this can be achieved by copying and pasting snippets from a web page, but this can be unpractical if there is a large amount of data to be extracted, or if it spread over a large number of pages. Instead, specialized tools and techniques can be used to automate this process, by defining what sites to visit, what information to look for, and whether data extraction should stop once the end of a page has been reached, or whether to follow hyperlinks and repeat the process recursively. Automating web scraping also allows to define whether the process should be run at regular intervals and capture changes in the data.

Prerequisites

As webscraping is a technique to extract data from web pages, it requires some understanding of the technologies that are used to display information on the web. This lesson therefore assumes that learners will have some familiarity with HTML and the Document Object Model (DOM).

The first part of this lesson will use browser extensions to introduce the concepts of web scraping as well as introduce the XPath syntax for selecting elements on a web page and requires no further specific knowledge. The second part will introduce the use of specialized libraries to scrape websites by writing custom computer programs and will require some familiarity with the Python programming language and object-oriented programming.

Software requirements

Refer to the Setup section to install the required software to follow along this lesson.

Under development

Please note that the contents of this lesson are still being actively developed. Any feedback is appreciated, please do not hesitate to contact the author or contribute to the lesson by forking it on GitHub.

Schedule

00:00 Introduction: What is web scraping? What is web scraping and why is it useful?
What are typical use cases for web scraping?
00:10 Selecting content on a web page with XPath How can I select a specific element on web page?
What is XPath and how can I use it?
00:55 Manually scrape data using browser extensions How can I get started scraping data off the web?
How can I use XPath to more accurately select what data to scrape?
02:00 Web scraping using Python and Scrapy How can scraping a web site be automated?
How can I setup a scraping project using the Scrapy framework for Python?
How do I tell Scrapy what elements to scrape from a webpage?
How do I tell Scrapy to follow URLs and scrape their contents?
What to do with the data extracted with Scrapy?
04:00 Conclusion When is web scraping OK and when is it not?
Is web scraping legal? Can I get into trouble?
How can I make sure I’m doing the right thing?
What can I do with the data that I’ve scraped?
04:30 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.