Python is a versatile programming language that can be used to write programs of varied applications. The number of available libraries in Python makes it one of the most useful programming languages that can be used to perform numerous tasks. Be it writing a simple Python script to automate basic shell command operations in an Operating System, or a program to perform data analysis or Machine learning, Python excels them in all, thanks to the available Python Library packages.
In this article, we will explore and learn about using Python programming language to perform one of the most common application in the world of web, HTML scraping or web scraping using Python.
All the websites we view in our favorite web browser is written using mainly 3 important web front-end programming languages – HTML, CSS and Javascript. Each of these 3 programming languages have a specific role to play in the creation of a web page. They are:
HTML – HTML is a simple Markup language used to create various HTML elements that make up a web page. The elements including Headings, Paragraphs, Lists, Images, tables, headers and footers, links etc that we see in a web page are all different HTML elements. So in other words, HTML Markup language is used to create these HTML elements that we see as part of a web page. HTML here stands for Hyper Text Markup Language.
CSS – CSS is a design style programming language that is mainly responsible for implementing the look and feel of the above mentioned HTML web page elements. You might have seen that same contents of a table are displayed in two different styles in two different websites. This is because, even though both use the same HTML Table element to create this content, the HTML Table is styled in different formats by each of these websites. This is achieved using the CSS programming language. CSS here stands for Cascading Style Sheets.
Javascript – Javascript is another programming language that was mainly developed for use in web browsers, but nowadays has made its way into all parts of web development – be it in the front end (browser side) or at the back end (server side). Javascript programming language on the front end side is used to provide interactive functionalities to the HTML elements of a web page. For example, In most of the web pages that we see these days, we might have seen the infinite scrolling feature where in only first few content elements are loaded in a web page and the rest are loaded dynamically as we scroll to the bottom of the web page. Twitter home page is a good example of this. This sort of interactive functionalities are added using Javascript language in a web page. Almost all interactivity of a web page is achieved using the help of Javascript these days.
When a web page is rendered in a browser on the user’s computer, the webpage includes all these HTML elements with all the texts and image content of the web page all embedded within themselves. So, we can actually retrieve these text and image contents from a web page using a programming language such as Python. Such a process is actually called “Web Scraping” in the web development world.
Scraping A Web Page Using Python
In order to learn how to scrape a web page using Python, we will try to scrape a table that lists mountains across the world ordered by their elevation, as seen in the the official Wikipedia website:
https://en.wikipedia.org/wiki/List_of_mountains_by_elevation
In this Wikipedia web page, we notice the presence of several tables. The first table mainly displays list of mountains having elevation of 8000 meters or above. It is this web page’s table that we would like to scrape using Python.
Introduction to BeautifulSoup library in Python
As mentioned in the beginning of this article, Python comes with myriad of useful libraries that one can use to perform complex tasks with ease by using these libraries’ APIs. One such library is called the “BeautifulSoup” library and is one of the most interesting library that one can use in Python to perform web scraping.
BeautifulSoup Python library’s functionalities
One of the most important functionality of Python’s BeautifulSoup library is its ability to parse and interpret HTML tags. All html elements are represented using what are called the HTML tags. Some examples of such tags are <h1> for main heading, <p> for paragraphs and <table> for tables. Python’s BeautifulSoup library understands these tags and can extract information present in a web page within these tags. BeautifulSoup library exposes these APIs to us to use these functionalities in our own Python programs, which we will make use of in our Python web scraper program that we are about to write.
BeautifulSoup library is available in Python libraries repository under the name of ‘bs4’ and can be installed into your computer system for developing the web scraper using the command:
pip install bs4
BeaultifulSoup library example
In order to understand how a BeautifulSoup library works, let us download a Wikipedia web page into our local system. For this example, let us download the following Wikipedia web page:
https://en.wikipedia.org/wiki/List_of_mountains_by_elevation
Let us save the web page from above link as mountains.html in our local home directory (~/).
We can then read the content of this web page using Python’s BeautifulSoup library using the following commands:
from bs4 import BeautifulSoup
input = open('~/mountains.html', 'r')
soup = BeautifulSoup(input.read(),'html.parser')
tables = soup.find_all('table')
print tables
Well, thats a mouthful of code you just read there. Let us try to understand it in a step by step manner to simplify it and understand what we are doing here:
The first line:
from bs4 import BeautifulSoup
Simply imports the BeautifulSoup library form the Python’s bs4 library we just installed. The next line:
input = open('~/mountains.html', 'r')
is simply using Python’s file operation function open( ) to open the previously downloaded mountain.html web page. In the next line:
soup = BeautifulSoup(input.read(),'html.parser')
we call the BeautifulSoup function and pass it as one of the argument, content of our mountain.html webpage using the Python’s standard file operation function read( ). Another argument that we pass along is ‘html.parser’. This tells the BeautifulSoup function to interpret the content of the passed input content as HTML data and use HTML parser to parse it. The resulting parsed HTML data is assigned to the variable ‘soup’ for later usage. In the next line we do this:
tables = soup.find_all('table')
What the above line shows is that we are now searching for all the available HTML tables in the ‘soup’ variable and assign it to a new variable tables. So, by now we should have all the HTML tables present in mountain.html file assigned to the Python list variable ‘tables’.
Finally, we print the content of this tables variable that should print all the tables found in our mountains.html web page!
While this is good and all, we did a manual download of the Wikipedia web page, saved it as mountain.html and only then used Python’s BeautifulSoup library to process it. However, wouldn’t it be great if we could eliminate this manual step and do even this programmatically? As a next step, we would do exactly this using a new Python library – urllib introduced next.
Introduction to Python Urllib library
Another important Python library that we are going to use to create our web scraper program is called the urllib library. Let us see what functionalities Python’s urllib library brings to us.
Python’s Urllib library is used to fetch contents of web page url. It provides us with APIs such as open(), read() etc to open a web page and read its contents back. Url here stands for Uniform Resource Locators. They are the static web addresses that one can use to locate a web page and read/fetch its contents back.
How to install Python Urllib library?
We can install the Python Urllib library using the following pip command:
pip install urllib
Python Urllib Example
Here is a simple example of urllib library that is used to fetch the content of a Wikipedia web page.
First we will import the urllib library into our Python program environment using Python’s import command:
import urllib
The Urllib library exposes several useful APIs for other programs to make use of. One such API is the request API that one can use to open a web page and read its content. The request API in turn exposes two more functions called the urlopen( ) function and the read( ) function. An example of a Python program using this API is given below, where we are trying to read the contents of a Wikipedia web page:
import urllib.request
content = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_mountains_by_elevation')
read_content = content.read()
We can actually combine the above two function calls of the Urllib’s request API – urlopen( ) and read( ) functions into a single line as shown below:
source = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_mountains_by_elevation').read()
Python Web Scraper using Urllib and BeautifulSoup libraries
Finally, combining the APIs provided by both BeautifulSoup and Urllib libraries, we can write our web scraper program that reads a Wikipedia page’s contents, extracts its tables, and print the content of a particular table as shown below:
from bs4 import BeautifulSoup
import urllib.request
source = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_mountains_by_elevation').read()
soup = BeautifulSoup(source,'html.parser')
tables = soup.find_all('table')
table_rows = tables[0].find_all('tr')
for tr in table_rows:
print (tr)
The above program is our intended Python web scraper program that can go fetch a Wikipedia page using urllib library. We can then extract all the contents of the web page and find a way to access each of these HTML elements using the Python BeautifulSoup library.
Here we are simply printing the first “table” element of the Wikipedia page, however BeautifulSoup can be used to perform many more complex scraping operations than what has been shown here.
I will explain more such operations one can perform using BeautifulSoup Python library in future articles, but this should serve as an entry point for someone who is just getting started with Python programming language for web scraping.