In this article, we are going to learn how to extract data from a website using Python. The term used for extracting data from a website is called “Web scraping” or “Data scraping”. We can write programs using languages such as Python to perform web scraping automatically.
In order to understand how to write a web scraper using Python, we first need to understand the basic structure of a website. We have already written an article about it here on our website. Take a quick look at it once before proceeding here to get a sense of it.
The way to scrape a webpage is to find specific HTML elements and extract its contents. So, to write a website scraper, you need to have good understanding of HTML elements and its syntax.
Assuming you have good understanding on these per-requisites, we will now proceed to learn how to extract data from website using Python.
How To Fetch A Web Page Using Python
The first step in writing a web scraper using Python is to fetch the web page from web server to our local computer. One can achieve this by making use of a readily available Python package called urllib.
We can install the Python package urllib using Python package manager pip. We just need to issue the following command to install urllib on our computer:
pip install urllib
Once we have urllib Python package installed, we can start using it to fetch the web page to scrape its data.
For the sake of this tutorial, we are going to extract data from a web page from Wikipedia on comet found here:
This wikipedia article contains a variety of HTML elements such as texts, images, tables, headings etc. We can extract each of these elements separately using Python.
How To Fetch A Web Page Using Urllib Python package.
Let us now fetch this web page using Python library urllib by issuing the following command:
import urllib.request content = urllib.request.urlopen('https://en.wikipedia.org/wiki/Comet') read_content = content.read()
The first line:
will import the urllib package’s request function into our Python program. We will make use of this request function send an HTML GET request to Wikipedia server to render us the webpage. The URL of this web page is passed as the parameter to this request.
content = urllib.request.urlopen('https://en.wikipedia.org/wiki/Comet')
As a result of this, the wikipedia server will respond back with the HTML content of this web page. It is this content that is stored in the Python program’s “content” variable.
The content variable will hold all the HTML content sent back by the Wikipedia server. This also includes certain HTML meta tags that are used as directives to web browser such as <meta> tags. However, as a web scraper we are mostly interested only in human readable content and not so much on meta content. Hence, we need extract only non meta HTML content from the “content” variable. We achieve this in the next line of the program by calling the read() function of urllib package.
read_content = content.read()
The above line of Python code will give us only those HTML elements which contain human readable contents.
At this point in our program we have extracted all the relevant HTML elements that we would be interested in. It is now time to extract individual data elements of the web page.
How To Extract Data From Individual HTML Elements Of The Web Page
In order to extract individual HTML elements from our read_content variable, we need to make use of another Python library called Beautifulsoup. Beautifulsoup is a Python package that can understand HTML syntax and elements. Using this library, we will be able to extract out the exact HTML element we are interested in.
We can install Python Beautifulsoup package into our local development system by issuing the command:
pip install bs4
Once Beautifulsoup Python package is installed, we can start using it to extract HTML elements from our web content. Hope you remember that we had earlier stored our web content in the Python variable “read_content“. We are now going to pass this variable along with the flag ‘html.parser’ to Beautifulsoup to extract html elements as shown below:
from bs4 import BeautifulSoup soup = BeautifulSoup(read_content,'html.parser')
From this point on wards, our “soup” Python variable holds all the HTML elements of the webpage. So we can start accessing each of these HTML elements by using the find and find_all built-in functions.
How To Extract All The Paragraphs Of A Web Page
For example, if we want to extract the first paragraph of the wikipedia comet article, we can do so using the code:
pAll = soup.find_all('p')
Above code will extract all the paragraphs present in the article and assign it to the variable pAll. Now pAll contains a list of all paragraphs, so each individual paragraphs can be accessed through indexing. So in order to access the first paragraph, we issue the command:
The output we obtain is:
So the first paragraph only contained a new line. What if we try the next index?
We again get a newline! Now what about the third index?
pAll.text "A comet is an icy, small Solar System body that..."
And now we get the text of the first paragraph of the article! If we continue further with indexing, we can see that we continue to get access to every other HTML <p> element of the article. In a similar way, we can extract other HTML elements too as shown in the next section.
How To Extract All The H2 Elements Of A Web Page
Extracting H2 elements of a web page can also be achieved in a similar way as how we did for the paragraphs earlier. By simply issuing the following command:
h2All = soup.find_all('h2')
we can filter and store all H2 elements into our h2All variable.
So with this we can now access each of the h2 element by indexing the h2All variable:
>>> h2All.text 'Contents' >>> h2All.text 'Physical characteristics'
So there you have it. This is how we extract data from website using Python. By making use of the two important libraries – urllib and Beautifulsoup.
We first pull the web page content from the web server using urllib and then we use Beautifulsoup over the content. Beautifulsoup will then provides us with many useful functions (find_all, text etc) to extract individual HTML elements of the web page. By making use of these functions, we can address individual elements of the web page.
So far we have seen how we could extract paragraphs and h2 elements from our web page. But we do not stop there. We can extract any type of HTML elements using similar approach – be it images, links, tables etc. If you want to verify this, checkout this other article where we have taken similar approach to extract table elements from another wikipedia article.