Categories
PYTHON TUTORIALS

How to scrape HTML tables using Python

Python is a versatile programming language that can be used to write programs of varied applications. The number of available libraries in Python makes it one of the most useful programming languages that can be used to perform numerous tasks. Be it writing a simple Python script to automate basic shell command operations in an Operating System, or a program to perform data analysis or Machine learning, Python excels them in all, thanks to the available Python Library packages.

In this article, we will explore and learn about using Python programming language to perform one of the most common application in the world of web, HTML scraping or web scraping using Python.

Web scraper Illustrative picture

All the websites we view in our favorite web browser is written using mainly 3 important web front-end programming languages – HTML, CSS and Javascript. Each of these 3 programming languages have a specific role to play in the creation of a web page. They are:

HTML – HTML is a simple Markup language used to create various HTML elements that make up a web page. The elements including Headings, Paragraphs, Lists, Images, tables, headers and footers, links etc that we see in a web page are all different HTML elements. So in other words, HTML Markup language is used to create these HTML elements that we see as part of a web page. HTML here stands for Hyper Text Markup Language.

CSS – CSS is a design style programming language that is mainly responsible for implementing the look and feel of the above mentioned HTML web page elements. You might have seen that same contents of a table are displayed in two different styles in two different websites. This is because, even though both use the same HTML Table element to create this content, the HTML Table is styled in different formats by each of these websites. This is achieved using the CSS programming language. CSS here stands for Cascading Style Sheets.

Javascript – Javascript is another programming language that was mainly developed for use in web browsers, but nowadays has made its way into all parts of web development – be it in the front end (browser side) or at the back end (server side).  Javascript programming language on the front end side is used to provide interactive functionalities to the HTML elements of a web page. For example, In most of the web pages that we see these days, we might have seen the infinite scrolling feature where in only first few content elements are loaded in a web page and the rest are loaded dynamically as we scroll to the bottom of the web page. Twitter home page is a good example of this. This sort of interactive functionalities are added using Javascript language in a web page. Almost all interactivity of a web page is achieved using the help of Javascript these days.

When a web page is rendered in a browser on the user’s computer, the webpage includes all these HTML elements with all the texts and image content of the web page all embedded within themselves. So, we can actually retrieve these text and image contents from a web page using a programming language such as Python. Such a process is actually called “Web Scraping” in the web development world.

Scraping A Web Page Using Python

In order to learn how to scrape a web page using Python, we will try to scrape a table that lists mountains across the world ordered by their elevation, as seen in the the official Wikipedia website:

https://en.wikipedia.org/wiki/List_of_mountains_by_elevation

In this Wikipedia web page, we notice the presence of several tables. The first table mainly displays list of mountains having elevation of 8000 meters or above. It is this web page’s table that we would like to scrape using Python.

Introduction to BeautifulSoup library in Python

As mentioned in the beginning of this article, Python comes with myriad of useful libraries that one can use to perform complex tasks with ease by using these libraries’ APIs. One such library is called the “BeautifulSoup” library and is one of the most interesting library that one can use in Python to perform web scraping.

BeautifulSoup Python library’s functionalities

One of the most important functionality of Python’s BeautifulSoup library is its ability to parse and interpret HTML tags. All html elements are represented using what are called the HTML tags. Some examples of such tags are <h1> for main heading, <p> for paragraphs and <table> for tables. Python’s BeautifulSoup library understands these tags and can extract information present in a web page within these tags. BeautifulSoup library exposes these APIs to us to use these functionalities in our own Python programs, which we will make use of in our Python web scraper program that we are about to write.

BeautifulSoup library is available in Python libraries repository under the name of ‘bs4’ and can be installed into your computer system for developing the web scraper using the command:

pip install bs4

BeaultifulSoup library example

In order to understand how a BeautifulSoup library works, let us download a Wikipedia web page into our local system. For this example, let us download the following Wikipedia web page:

https://en.wikipedia.org/wiki/List_of_mountains_by_elevation

Let us save the web page from above link as mountains.html in our local home directory (~/).

We can then read the content of this web page using Python’s BeautifulSoup library using the following commands:

from bs4 import BeautifulSoup

input = open('~/mountains.html', 'r')

soup = BeautifulSoup(input.read(),'html.parser')

tables = soup.find_all('table')

print tables

Well, thats a mouthful of code you just read there. Let us try to understand it in a step by step manner to simplify it and understand what we are doing here:
The first line:

from bs4 import BeautifulSoup

Simply imports the BeautifulSoup library form the Python’s bs4 library we just installed. The next line:

input = open('~/mountains.html', 'r')

is simply using Python’s file operation function open( ) to open the previously downloaded mountain.html web page. In the next line:

soup = BeautifulSoup(input.read(),'html.parser') 

we call the BeautifulSoup function and pass it as one of the argument, content of our mountain.html webpage using the Python’s standard file operation function read( ). Another argument that we pass along is ‘html.parser’. This tells the BeautifulSoup function to interpret the content of the passed input content as HTML data and use HTML parser to parse it. The resulting parsed HTML data is assigned to the variable ‘soup’ for later usage. In the next line we do this:

tables = soup.find_all('table')

What the above line shows is that we are now searching for all the available HTML tables in the ‘soup’ variable and assign it to a new variable tables. So, by now we should have all the HTML tables present in mountain.html file assigned to the Python list variable ‘tables’.

Finally, we print the content of this tables variable that should print all the tables found in our mountains.html web page!

While this is good and all, we did a manual download of the Wikipedia web page, saved it as mountain.html and only then used Python’s BeautifulSoup library to process it. However, wouldn’t it be great if we could eliminate this manual step and do even this programmatically? As a next step, we would do exactly this using a new Python library – urllib introduced next.

Introduction to Python Urllib library

Another important Python library that we are going to use to create our web scraper program is called the urllib library. Let us see what functionalities Python’s urllib library brings to us.

Python’s Urllib library is used to fetch contents of web page url. It provides us with APIs such as open(), read() etc to open a web page and read its contents back. Url here stands for Uniform Resource Locators. They are the static web addresses that one can use to locate a web page and read/fetch its contents back.

How to install Python Urllib library?

We can install the Python Urllib library using the following pip command:

pip install urllib

Python Urllib Example

Here is a simple example of urllib library that is used to fetch the content of a Wikipedia web page.

First we will import the urllib library into our Python program environment using Python’s import command:

import urllib

The Urllib library exposes several useful APIs for other programs to make use of. One such API is the request API that one can use to open a web page and read its content. The request API in turn exposes two more functions called the urlopen( ) function and the read( ) function. An example of a Python program using this API is given below, where we are trying to read the contents of a Wikipedia web page:

import urllib.request

content = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_mountains_by_elevation')

read_content = content.read()

We can actually combine the above two function calls of the Urllib’s request API – urlopen( ) and read( ) functions into a single line as shown below:

source = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_mountains_by_elevation').read()

Python Web Scraper using Urllib and BeautifulSoup libraries

Finally, combining the APIs provided by both BeautifulSoup and Urllib libraries, we can write our web scraper program that reads a Wikipedia page’s contents, extracts its tables, and print the content of a particular table as shown below:

from bs4 import BeautifulSoup
import urllib.request

source = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_mountains_by_elevation').read()
soup = BeautifulSoup(source,'html.parser')
tables = soup.find_all('table')
table_rows = tables[0].find_all('tr')
for tr in table_rows:
print (tr)

The above program is our intended Python web scraper program that can go fetch a Wikipedia page using urllib library. We can then extract all the contents of the web page and find a way to access each of these HTML elements using the Python BeautifulSoup library.

Here we are simply printing the first “table” element of the Wikipedia page, however BeautifulSoup can be used to perform many more complex scraping operations than what has been shown here.

I will explain more such operations one can perform using BeautifulSoup Python library in future articles, but this should serve as an entry point for someone who is just getting started with Python programming language for web scraping.


Categories
STATIC WEBSITES TUTORIALS

Getting started with Pelican: One click installer to install Pelican

Pelican is a Python based static website generator written in Python. Using Pelican, one can start creating static websites that can later be deployed to a simple file web server on the cloud. Some of the cloud web service providers include Amazon Web Services, Digital Ocean, Vultr etc. One can also host these static websites on a static host providers such as Netlify, Contentful etc. But first, lets understand more about static websites and how to use Python’s Pelican to create your static website.

What is a static website and why should you use one?

The internet today is made up of both dynamic websites as well as static websites. A dynamic website is one which usually consists of a database and the server creates dynamic html web pages on the fly, usually specific to the user who requested it. On the contrary, a static website is made up of contents that are just that – static and is served as the same to all its requesting users.

So which one should you be using for your usecase? A static website or a dynamic website? To answer this question, you should first look into this article that discusses the advantages and disadvantages of a static vs dynamic website.

So with the above introduction, its time to move into the technical aspects of Pelican. First let us discuss about the installation aspect of Pelican.

How To Install Pelican

Jump to the end of this article if you just want a one click installer to install and try Pelican

In order to install Pelican, you need to have both pip and Python installed on your system. If you dont have them installed, you can do so using the following commands:

For Ubuntu

sudo apt-get install python3 python3-pip

For Fedora Linux

sudo yum install python3 python3-pip

In this installation process, we are using Python 3 version. However note that Pelican works on both Python 2.7 as well as latest version of Python 3, so which one to use is solely left to your discretion.

Once Python and Pip are installed, we can proceed with installing Pelican onto our computers. To do so, we issue the following command:

pip3 install pelican markdown

We can note here that we are installing two Python packages from pip, one is the Pelican static site generator and the other is a markdown package. If you are unfamiliar with markdown, it is a set of standard markup language used to write contents in a way that can later be processed to format the content it surrounds. You can read more about Markdown on Wikipedia.

Once they are installed, we can create a new directory using command line to store our project files. In this case, we are creating a directory called Pelican_Demo and then moving to it.

mkdir Pelican_Demo
cd Pelican_Demo

Once inside the newly created directory, we start creating our Pelican website. To do so, we call a Python executable script called pelican-quickstart that was installed to us in our /usr/local/bin directory. So we can run this script simply by calling it as follows:

pelican-quickstart

This would kick start our Pelican static website generator which then proceeds with a series of questions that you need to answer to finally create your static website.

What these set of questions actually does to your Python based Pelican static website will be a topic for another post. But for now, you should be good to go using your website.

If you want to just get your hands dirty and try to get Pelican up and running without wanting to dig deeper into investigating how it works, then you can use the following script to get going.

https://github.com/digitallyamar/Python3-Pelican-Installer

This script will install all the required packages and answer all the questions of pelican-quickstart automatically for you so that you can simply run it and jump to view the newly created Pelican static website. Follow the instructions given in that Python3-Pelican-Installer github project to get it up and running in no time to get a taste of what Pelican static website looks and feels like.

Good luck!

Categories
STATIC WEBSITES TUTORIALS

Advantages And Disadvantages Of Static Vs Dynamic Websites

Static websites are gaining popularity these days. A static website can be built using static website generators such as Jekyll (Ruby), Next (Javascript), Hugo (GO), Pelican (Python) etc. But very few people understand the benefits and disadvantages of using a static website. This article will try to explain this in a way that should hopefully make it easy for someone looking to decide between static website vs dynamic website for their purposes.

What is a static website?

Most of the websites we use these days are often dynamic websites. These dynamic websites have databases through which the content of a webpage is generated on the server dynamically and then sent to the user’s browser. Advantage of this is that each of the users get customized contents specific to them that are different from what would be delivered to other users. An example of this can be Facebook homepage of a user who get to see the posts from his friends and network. Google search result page is another example of a dynamic page that varies from person to person for the same query based on his browsing history.

Contrary to this, a static website is usually made up of static content (mostly using only HTML & CSS) that are already stored as complete files on the server. Thus, each of the users who request a particular webpage from this server will always receive the same content. Usually, these webpages are pre-built and stored on a file server and this file as a whole is then just sent back to the user’s browser when requested.

Advantages Of A Static Website

  • Fast: As these websites serve prebuilt HTML webpages, they are extremely fast.
  • Secure: As these websites do not possess a database but just a set of files served from a simple web based file server, there are no security threats seen that comes with using a database.
  • Cheap: The cost of hosting a static website is in pennies compared to a dynamic website as it just needs a simple web enabled file server.

Disadvantages Of A Static Website

  • As the contents are static and created in advance, no dynamic contents can be added to the web page.
  • User interactivity is limited due to the static nature of the website.
  • Usually static websites lack components such as comments, user login, recommendation engine, real time notifications etc. However, these can still be added through some 3rd party external services.
  • Programming knowledge is required to work with static websites. As we need to use static website generator tools that are quite technical in nature, users who wish to use static websites should be technically capable.
  • Content Management Systems (CMS) are usually missing in static websites. However there does exist some 3rd party CMS services such as Contentful that can overcome this issue.
  • Each time a new article is to be added, the static website generator builds the entire website and redeploy to the web server. This can be time consuming and can also be prone to unforeseen technical errors.
  • Not suitable for a large website with thousands of articles as updating such static website can be extremely slow.

Conclusion

Each of these static vs dynamic websites brings about their own set of advantages and challenges. So a decision as to which one is better for you completely boils down to how familiar you are with programming to work with static website generators, your website content types and its requirements.

If you are just looking for simple blog type of website to operate at a cheap cost, you can definitely opt for a static website. On the other hand, if you are looking to create a website having thousands of web pages or contents that are to be customized specific to each user, then dynamic website is the way to go!

Categories
PYTHON TUTORIALS

Difference between expression and statement in Python

A Python expression can be defined as any element in our program that evaluates to some value. Well, what does this mean? To understand it better, let us fire up our Python interpreter and take a deep dive into this topic on Python expressions with these examples.

Once in our python interpreter, let us type the following command:

Python 3.5.2 (default, Nov 12 2018, 13:43:14) 
[GCC 5.4.0 20160609] on linux
Type "copyright", "credits" or "license()" for more information.
>>> 4
4
>>> 

We can see that by simply entering the number ‘4’ into our Python interpreter, it was accepted and evaluated to be of a value of integer 4. Hence, we can say that the input ‘4’ we entered is a type of expression.

Similarly, if we input the command ‘4 + 1’ to the Python interpreter:

Python 3.5.2 (default, Nov 12 2018, 13:43:14) 
[GCC 5.4.0 20160609] on linux
Type "copyright", "credits" or "license()" for more information.
>>> 4
4
>>> 4 + 1
5
>>>

Our interpreter goes ahead and computes a value of 4 from this and results in a value of 5. Here too, the input ‘4+1’ can be called an expression as it resulted in a value of 5.

Similarly, if we enter this code to the Python interpreter we get,

Python 3.5.2 (default, Nov 12 2018, 13:43:14) 
[GCC 5.4.0 20160609] on linux
Type "copyright", "credits" or "license()" for more information.
>>> 4
4
>>> 4 + 1
5
>>> "Hello" + "World"
'HelloWorld'
>>> 

This too shows that irrespective of the data type used (string in this case as opposed to integers in the earlier examples), a Python expression results in the evaluation of the data (“Hello” and “World”) to a final value (“HelloWorld”). Thus “Hello” + “World” is also a Python expression.

On the other hand, if we take a look at this example:

Python 3.5.2 (default, Nov 12 2018, 13:43:14) 
[GCC 5.4.0 20160609] on linux
Type "copyright", "credits" or "license()" for more information.
>>> 4
4
>>> 4 + 1
5
>>> "Hello" + "World"
'HelloWorld'
>>> result = "Hello" + "World"
>>> result
'HelloWorld'
>>> 

Here we are assigning the final evaluated expression value to another variable ‘result’. This type of command where a value is assigned to a variable is called a Python Statement.

So in other words, we can see that a Python statement is made up of one or more Python expressions.

Expression Vs Statement

  • Expression
    • Expressions always returns a value
    • Functions are also expressions. Even a non returning function will still return None value, so it is an expression.
    • Can print the result value
    • Examples Of Python Expressions: “Hello” + “World”, 4 + 5 etc.
  • Statement
    • A statement never returns a value
    • Cannot print any result
    • Examples Of Python Statements: Assignment statements, conditional branching, loops, classes, import, def, try, except, pass, del etc

Summary

In simpler terms, we can say that anything that evaluates to something is a Python expression, while on the other hand, anything that does something is a Python statement. Curious to learn further? Follow our other articles in this blog to know more!