Categories
DATA MINING DATA SCIENCE HTML JAVASCRIPT PROGRAMMING PYTHON STATIC WEBSITES TUTORIALS WEB DEVELOPMENT WEB SCRAPING

How To Extract Data From A Website Using Python

In this article, we are going to learn how to extract data from a website using Python. The term used for extracting data from a website is called “Web scraping” or “Data scraping”. We can write programs using languages such as Python to perform web scraping automatically.

In order to understand how to write a web scraper using Python, we first need to understand the basic structure of a website. We have already written an article about it here on our website. Take a quick look at it once before proceeding here to get a sense of it.

The way to scrape a webpage is to find specific HTML elements and extract its contents. So, to write a website scraper, you need to have good understanding of HTML elements and its syntax.

Assuming you have good understanding on these per-requisites, we will now proceed to learn how to extract data from website using Python.

Python logo on extracting data from a web page using Python
Python Web Scraper Development

How To Fetch A Web Page Using Python

The first step in writing a web scraper using Python is to fetch the web page from web server to our local computer. One can achieve this by making use of a readily available Python package called urllib.

We can install the Python package urllib using Python package manager pip. We just need to issue the following command to install urllib on our computer:

pip install urllib

Once we have urllib Python package installed, we can start using it to fetch the web page to scrape its data.

For the sake of this tutorial, we are going to extract data from a web page from Wikipedia on comet found here:

https://en.wikipedia.org/wiki/Comet

This wikipedia article contains a variety of HTML elements such as texts, images, tables, headings etc. We can extract each of these elements separately using Python.

How To Fetch A Web Page Using Urllib Python package.

Let us now fetch this web page using Python library urllib by issuing the following command:

import urllib.request
content = urllib.request.urlopen('https://en.wikipedia.org/wiki/Comet')

read_content = content.read()

The first line:

import urllib.request

will import the urllib package’s request function into our Python program. We will make use of this request function send an HTML GET request to Wikipedia server to render us the webpage. The URL of this web page is passed as the parameter to this request.

content = urllib.request.urlopen('https://en.wikipedia.org/wiki/Comet')

As a result of this, the wikipedia server will respond back with the HTML content of this web page. It is this content that is stored in the Python program’s “content” variable.

The content variable will hold all the HTML content sent back by the Wikipedia server. This also includes certain HTML meta tags that are used as directives to web browser such as <meta> tags. However, as a web scraper we are mostly interested only in human readable content and not so much on meta content. Hence, we need extract only non meta HTML content from the “content” variable. We achieve this in the next line of the program by calling the read() function of urllib package.

read_content = content.read()

The above line of Python code will give us only those HTML elements which contain human readable contents.

At this point in our program we have extracted all the relevant HTML elements that we would be interested in. It is now time to extract individual data elements of the web page.

How To Extract Data From Individual HTML Elements Of The Web Page

In order to extract individual HTML elements from our read_content variable, we need to make use of another Python library called Beautifulsoup. Beautifulsoup is a Python package that can understand HTML syntax and elements. Using this library, we will be able to extract out the exact HTML element we are interested in.

We can install Python Beautifulsoup package into our local development system by issuing the command:

pip install bs4

Once Beautifulsoup Python package is installed, we can start using it to extract HTML elements from our web content. Hope you remember that we had earlier stored our web content in the Python variable “read_content“. We are now going to pass this variable along with the flag ‘html.parser’ to Beautifulsoup to extract html elements as shown below:

from bs4 import BeautifulSoup
soup = BeautifulSoup(read_content,'html.parser')

From this point on wards, our “soup” Python variable holds all the HTML elements of the webpage. So we can start accessing each of these HTML elements by using the find and find_all built-in functions.

How To Extract All The Paragraphs Of A Web Page

For example, if we want to extract the first paragraph of the wikipedia comet article, we can do so using the code:

pAll = soup.find_all('p')

Above code will extract all the paragraphs present in the article and assign it to the variable pAll. Now pAll contains a list of all paragraphs, so each individual paragraphs can be accessed through indexing. So in order to access the first paragraph, we issue the command:

pAll[0].text

The output we obtain is:

\n

So the first paragraph only contained a new line. What if we try the next index?

pAll[1].text
'\n'

We again get a newline! Now what about the third index?

pAll[2].text
"A comet is an icy, small Solar System body that..."

And now we get the text of the first paragraph of the article! If we continue further with indexing, we can see that we continue to get access to every other HTML <p> element of the article. In a similar way, we can extract other HTML elements too as shown in the next section.

How To Extract All The H2 Elements Of A Web Page

Extracting H2 elements of a web page can also be achieved in a similar way as how we did for the paragraphs earlier. By simply issuing the following command:

h2All = soup.find_all('h2')

we can filter and store all H2 elements into our h2All variable.

So with this we can now access each of the h2 element by indexing the h2All variable:

>>> h2All[0].text
'Contents'
>>> h2All[2].text
'Physical characteristics[edit]'

Conclusion

So there you have it. This is how we extract data from website using Python. By making use of the two important libraries – urllib and Beautifulsoup.

We first pull the web page content from the web server using urllib and then we use Beautifulsoup over the content. Beautifulsoup will then provides us with many useful functions (find_all, text etc) to extract individual HTML elements of the web page. By making use of these functions, we can address individual elements of the web page.

So far we have seen how we could extract paragraphs and h2 elements from our web page. But we do not stop there. We can extract any type of HTML elements using similar approach – be it images, links, tables etc. If you want to verify this, checkout this other article where we have taken similar approach to extract table elements from another wikipedia article.

How to scrape HTML tables using Python

Categories
100DaysOfCode EMBEDDED PROGRAMMING PROGRAMMING PYTHON RASPBERRY PI TUTORIALS

Python Program To Add Two Numbers

In this article, we will look at how to write a Python program to add two numbers. This is a very simple Python program that is suitable for even a beginner in Python programming to work on it for getting hands on practice with Python.

In order to add two numbers in Python program, we need to first break down the problem into the following steps:

Breakdown of the problem of Adding two numbers

  1. Receive the first input number from the user and store it in a Python variable.
  2. Receive the second input number from the user and store it in another Python variable.
  3. Add the two numbers by adding the two Python variables using a Python statement (Learn about Python statement here)
  4. Store the final result in another Python variable called “result”
  5. Print the value of the “result” variable.

In the above breakdown of the problem, you will notice that the Python program we will be written in such a way that the two numbers that need to be added will not be hard coded directly into the program itself but instead is written in a generic way such that we prompt the user to enter these two input values every time the Python program to add two numbers is run. This type of programming approach is often called general programming as the program is generic enough to receive any two different values each time it is run.

Pseudo-code To Add Two Numbers Using Python Programming Language

num1 = Receive First Input Number From The User
num1 = Receive Second Input Number From The User
result = (num1) Added to (num2)
Print the (result) on the screen

We can see from the above psuedo code that this is a simple program that receives two numbers from the user, adds them and print their results back on the screen. The above pseudo code gives us a nice little framework on how to write our program. The same pseudo-code can now be used to write a program to add two numbers using any programming language and not just Python!

Now that we have our problem broken down and pseudo code written, it is time for us to replace the pseudo code with actual Python programming code instructions.

Python Program To Add Two Numbers And Print Its Result

Fire up your Python IDLE interpreter and start typing in the code along with me for you to be able to understand this program better:

Python 3.5.2 (default, Oct  8 2019, 13:06:37) 
 [GCC 5.4.0 20160609] on linux
 Type "copyright", "credits" or "license()" for more information.
>>>

Once in our python interpreter, let us start typing in our Python program commands. The first thing we need to do according to the pseudo code written above is the receive the first input number from our user. In Python program, the instruction code to be used to receive a value from its user is by calling the input() function. Input function will accept a string as its parameter that will be displayed to the program user when the program is run. So, with this knowledge, our first line of the Python program to add two numbers will be:

>>> num1 = input('Enter the first number\n')
Enter the first number
10
>>> 

As you can see from the above code block, we have used the input() function to prompt our program user with the string “Enter the first number“. We are also saving the value entered by the user to a Python variable called num1.

The input function will then prompt with the above string and wait until the user enters a number. In the above code snippet, I had entered a value of 10 which is now stored in the variable num1.

In a similar way, we will write the next line of code which will prompt the user to enter a second input number that is to be added to the first number. This is achieved with the following piece of code:

>>> num2 = input('Enter the second number\n')
Enter the second number
20
>>> 

Again over here, I have entered my second number as 20 when prompted.

Now that we have the two numbers in our Python variables num1 and num2, it is time add them to get the final result that we are going to store in our third Python variable called result. This is achieved using the following Python statement:

>>> result = int(num1) + int(num2)
>>>

So from the above python code, we have now added the numbers num1 and num2 using the Python arithmetic operator “+” and the typecast operator called “int()“. We had to use the typecast operator int because by default all inputs from the user into a python program will be interpreted and stored as a string. We can check this by issuing the following command in the interpreter:

>>> print(num1)
10
>>> type(num1)
<class 'str'>
>>> 

So, by calling the typecast operator int, we are converting this string value to an integer value.

>>> type(int(num1))
<class 'int'>
>>> 

finally stored the end result in a new variable called result. Since we have not issued a Python command to print the value of result, nothing gets printed on the IDLE prompt yet. So, the final step is exactly that. To print the value of the result variable onto the screen. This is achieved using another Python function called the print function.

>>> print (result)
30
>>>

Here is the full program that we can store in a file called add2num.py and run it using the command python3 add2num.py everytime we want to add any two numbers!

num1 = input('Enter the first number\n')
num2 = input('Enter the second number\n')
result = int(num1) + int(num2)
print (result)

This concludes our Python program to add two numbers.

Additional Side Notes On Python Programming Language:

Python is an interpreted programming language that gets interpreted and executed on the fly and hence the program written in Python do not need to be compiled like in the case of a C or Java programming language.

Categories
PYTHON TUTORIALS

How to scrape HTML tables using Python

Python is a versatile programming language that can be used to write programs of varied applications. The number of available libraries in Python makes it one of the most useful programming languages that can be used to perform numerous tasks. Be it writing a simple Python script to automate basic shell command operations in an Operating System, or a program to perform data analysis or Machine learning, Python excels them in all, thanks to the available Python Library packages.

In this article, we will explore and learn about using Python programming language to perform one of the most common application in the world of web, HTML scraping or web scraping using Python.

Web scraper Illustrative picture

All the websites we view in our favorite web browser is written using mainly 3 important web front-end programming languages – HTML, CSS and Javascript. Each of these 3 programming languages have a specific role to play in the creation of a web page. They are:

HTML – HTML is a simple Markup language used to create various HTML elements that make up a web page. The elements including Headings, Paragraphs, Lists, Images, tables, headers and footers, links etc that we see in a web page are all different HTML elements. So in other words, HTML Markup language is used to create these HTML elements that we see as part of a web page. HTML here stands for Hyper Text Markup Language.

CSS – CSS is a design style programming language that is mainly responsible for implementing the look and feel of the above mentioned HTML web page elements. You might have seen that same contents of a table are displayed in two different styles in two different websites. This is because, even though both use the same HTML Table element to create this content, the HTML Table is styled in different formats by each of these websites. This is achieved using the CSS programming language. CSS here stands for Cascading Style Sheets.

Javascript – Javascript is another programming language that was mainly developed for use in web browsers, but nowadays has made its way into all parts of web development – be it in the front end (browser side) or at the back end (server side).  Javascript programming language on the front end side is used to provide interactive functionalities to the HTML elements of a web page. For example, In most of the web pages that we see these days, we might have seen the infinite scrolling feature where in only first few content elements are loaded in a web page and the rest are loaded dynamically as we scroll to the bottom of the web page. Twitter home page is a good example of this. This sort of interactive functionalities are added using Javascript language in a web page. Almost all interactivity of a web page is achieved using the help of Javascript these days.

When a web page is rendered in a browser on the user’s computer, the webpage includes all these HTML elements with all the texts and image content of the web page all embedded within themselves. So, we can actually retrieve these text and image contents from a web page using a programming language such as Python. Such a process is actually called “Web Scraping” in the web development world.

Scraping A Web Page Using Python

In order to learn how to scrape a web page using Python, we will try to scrape a table that lists mountains across the world ordered by their elevation, as seen in the the official Wikipedia website:

https://en.wikipedia.org/wiki/List_of_mountains_by_elevation

In this Wikipedia web page, we notice the presence of several tables. The first table mainly displays list of mountains having elevation of 8000 meters or above. It is this web page’s table that we would like to scrape using Python.

Introduction to BeautifulSoup library in Python

As mentioned in the beginning of this article, Python comes with myriad of useful libraries that one can use to perform complex tasks with ease by using these libraries’ APIs. One such library is called the “BeautifulSoup” library and is one of the most interesting library that one can use in Python to perform web scraping.

BeautifulSoup Python library’s functionalities

One of the most important functionality of Python’s BeautifulSoup library is its ability to parse and interpret HTML tags. All html elements are represented using what are called the HTML tags. Some examples of such tags are <h1> for main heading, <p> for paragraphs and <table> for tables. Python’s BeautifulSoup library understands these tags and can extract information present in a web page within these tags. BeautifulSoup library exposes these APIs to us to use these functionalities in our own Python programs, which we will make use of in our Python web scraper program that we are about to write.

BeautifulSoup library is available in Python libraries repository under the name of ‘bs4’ and can be installed into your computer system for developing the web scraper using the command:

pip install bs4

BeaultifulSoup library example

In order to understand how a BeautifulSoup library works, let us download a Wikipedia web page into our local system. For this example, let us download the following Wikipedia web page:

https://en.wikipedia.org/wiki/List_of_mountains_by_elevation

Let us save the web page from above link as mountains.html in our local home directory (~/).

We can then read the content of this web page using Python’s BeautifulSoup library using the following commands:

from bs4 import BeautifulSoup

input = open('~/mountains.html', 'r')

soup = BeautifulSoup(input.read(),'html.parser')

tables = soup.find_all('table')

print tables

Well, thats a mouthful of code you just read there. Let us try to understand it in a step by step manner to simplify it and understand what we are doing here:
The first line:

from bs4 import BeautifulSoup

Simply imports the BeautifulSoup library form the Python’s bs4 library we just installed. The next line:

input = open('~/mountains.html', 'r')

is simply using Python’s file operation function open( ) to open the previously downloaded mountain.html web page. In the next line:

soup = BeautifulSoup(input.read(),'html.parser') 

we call the BeautifulSoup function and pass it as one of the argument, content of our mountain.html webpage using the Python’s standard file operation function read( ). Another argument that we pass along is ‘html.parser’. This tells the BeautifulSoup function to interpret the content of the passed input content as HTML data and use HTML parser to parse it. The resulting parsed HTML data is assigned to the variable ‘soup’ for later usage. In the next line we do this:

tables = soup.find_all('table')

What the above line shows is that we are now searching for all the available HTML tables in the ‘soup’ variable and assign it to a new variable tables. So, by now we should have all the HTML tables present in mountain.html file assigned to the Python list variable ‘tables’.

Finally, we print the content of this tables variable that should print all the tables found in our mountains.html web page!

While this is good and all, we did a manual download of the Wikipedia web page, saved it as mountain.html and only then used Python’s BeautifulSoup library to process it. However, wouldn’t it be great if we could eliminate this manual step and do even this programmatically? As a next step, we would do exactly this using a new Python library – urllib introduced next.

Introduction to Python Urllib library

Another important Python library that we are going to use to create our web scraper program is called the urllib library. Let us see what functionalities Python’s urllib library brings to us.

Python’s Urllib library is used to fetch contents of web page url. It provides us with APIs such as open(), read() etc to open a web page and read its contents back. Url here stands for Uniform Resource Locators. They are the static web addresses that one can use to locate a web page and read/fetch its contents back.

How to install Python Urllib library?

We can install the Python Urllib library using the following pip command:

pip install urllib

Python Urllib Example

Here is a simple example of urllib library that is used to fetch the content of a Wikipedia web page.

First we will import the urllib library into our Python program environment using Python’s import command:

import urllib

The Urllib library exposes several useful APIs for other programs to make use of. One such API is the request API that one can use to open a web page and read its content. The request API in turn exposes two more functions called the urlopen( ) function and the read( ) function. An example of a Python program using this API is given below, where we are trying to read the contents of a Wikipedia web page:

import urllib.request

content = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_mountains_by_elevation')

read_content = content.read()

We can actually combine the above two function calls of the Urllib’s request API – urlopen( ) and read( ) functions into a single line as shown below:

source = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_mountains_by_elevation').read()

Python Web Scraper using Urllib and BeautifulSoup libraries

Finally, combining the APIs provided by both BeautifulSoup and Urllib libraries, we can write our web scraper program that reads a Wikipedia page’s contents, extracts its tables, and print the content of a particular table as shown below:

from bs4 import BeautifulSoup
import urllib.request

source = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_mountains_by_elevation').read()
soup = BeautifulSoup(source,'html.parser')
tables = soup.find_all('table')
table_rows = tables[0].find_all('tr')
for tr in table_rows:
print (tr)

The above program is our intended Python web scraper program that can go fetch a Wikipedia page using urllib library. We can then extract all the contents of the web page and find a way to access each of these HTML elements using the Python BeautifulSoup library.

Here we are simply printing the first “table” element of the Wikipedia page, however BeautifulSoup can be used to perform many more complex scraping operations than what has been shown here.

I will explain more such operations one can perform using BeautifulSoup Python library in future articles, but this should serve as an entry point for someone who is just getting started with Python programming language for web scraping.