Categories
DATA MINING DATA SCIENCE HTML JAVASCRIPT PROGRAMMING PYTHON STATIC WEBSITES TUTORIALS WEB DEVELOPMENT WEB SCRAPING

How To Extract Data From A Website Using Python

In this article, we are going to learn how to extract data from a website using Python. The term used for extracting data from a website is called “Web scraping” or “Data scraping”. We can write programs using languages such as Python to perform web scraping automatically.

In order to understand how to write a web scraper using Python, we first need to understand the basic structure of a website. We have already written an article about it here on our website. Take a quick look at it once before proceeding here to get a sense of it.

The way to scrape a webpage is to find specific HTML elements and extract its contents. So, to write a website scraper, you need to have good understanding of HTML elements and its syntax.

Assuming you have good understanding on these per-requisites, we will now proceed to learn how to extract data from website using Python.

Python logo on extracting data from a web page using Python
Python Web Scraper Development

How To Fetch A Web Page Using Python

The first step in writing a web scraper using Python is to fetch the web page from web server to our local computer. One can achieve this by making use of a readily available Python package called urllib.

We can install the Python package urllib using Python package manager pip. We just need to issue the following command to install urllib on our computer:

pip install urllib

Once we have urllib Python package installed, we can start using it to fetch the web page to scrape its data.

For the sake of this tutorial, we are going to extract data from a web page from Wikipedia on comet found here:

https://en.wikipedia.org/wiki/Comet

This wikipedia article contains a variety of HTML elements such as texts, images, tables, headings etc. We can extract each of these elements separately using Python.

How To Fetch A Web Page Using Urllib Python package.

Let us now fetch this web page using Python library urllib by issuing the following command:

import urllib.request
content = urllib.request.urlopen('https://en.wikipedia.org/wiki/Comet')

read_content = content.read()

The first line:

import urllib.request

will import the urllib package’s request function into our Python program. We will make use of this request function send an HTML GET request to Wikipedia server to render us the webpage. The URL of this web page is passed as the parameter to this request.

content = urllib.request.urlopen('https://en.wikipedia.org/wiki/Comet')

As a result of this, the wikipedia server will respond back with the HTML content of this web page. It is this content that is stored in the Python program’s “content” variable.

The content variable will hold all the HTML content sent back by the Wikipedia server. This also includes certain HTML meta tags that are used as directives to web browser such as <meta> tags. However, as a web scraper we are mostly interested only in human readable content and not so much on meta content. Hence, we need extract only non meta HTML content from the “content” variable. We achieve this in the next line of the program by calling the read() function of urllib package.

read_content = content.read()

The above line of Python code will give us only those HTML elements which contain human readable contents.

At this point in our program we have extracted all the relevant HTML elements that we would be interested in. It is now time to extract individual data elements of the web page.

How To Extract Data From Individual HTML Elements Of The Web Page

In order to extract individual HTML elements from our read_content variable, we need to make use of another Python library called Beautifulsoup. Beautifulsoup is a Python package that can understand HTML syntax and elements. Using this library, we will be able to extract out the exact HTML element we are interested in.

We can install Python Beautifulsoup package into our local development system by issuing the command:

pip install bs4

Once Beautifulsoup Python package is installed, we can start using it to extract HTML elements from our web content. Hope you remember that we had earlier stored our web content in the Python variable “read_content“. We are now going to pass this variable along with the flag ‘html.parser’ to Beautifulsoup to extract html elements as shown below:

from bs4 import BeautifulSoup
soup = BeautifulSoup(read_content,'html.parser')

From this point on wards, our “soup” Python variable holds all the HTML elements of the webpage. So we can start accessing each of these HTML elements by using the find and find_all built-in functions.

How To Extract All The Paragraphs Of A Web Page

For example, if we want to extract the first paragraph of the wikipedia comet article, we can do so using the code:

pAll = soup.find_all('p')

Above code will extract all the paragraphs present in the article and assign it to the variable pAll. Now pAll contains a list of all paragraphs, so each individual paragraphs can be accessed through indexing. So in order to access the first paragraph, we issue the command:

pAll[0].text

The output we obtain is:

\n

So the first paragraph only contained a new line. What if we try the next index?

pAll[1].text
'\n'

We again get a newline! Now what about the third index?

pAll[2].text
"A comet is an icy, small Solar System body that..."

And now we get the text of the first paragraph of the article! If we continue further with indexing, we can see that we continue to get access to every other HTML <p> element of the article. In a similar way, we can extract other HTML elements too as shown in the next section.

How To Extract All The H2 Elements Of A Web Page

Extracting H2 elements of a web page can also be achieved in a similar way as how we did for the paragraphs earlier. By simply issuing the following command:

h2All = soup.find_all('h2')

we can filter and store all H2 elements into our h2All variable.

So with this we can now access each of the h2 element by indexing the h2All variable:

>>> h2All[0].text
'Contents'
>>> h2All[2].text
'Physical characteristics[edit]'

Conclusion

So there you have it. This is how we extract data from website using Python. By making use of the two important libraries – urllib and Beautifulsoup.

We first pull the web page content from the web server using urllib and then we use Beautifulsoup over the content. Beautifulsoup will then provides us with many useful functions (find_all, text etc) to extract individual HTML elements of the web page. By making use of these functions, we can address individual elements of the web page.

So far we have seen how we could extract paragraphs and h2 elements from our web page. But we do not stop there. We can extract any type of HTML elements using similar approach – be it images, links, tables etc. If you want to verify this, checkout this other article where we have taken similar approach to extract table elements from another wikipedia article.

How to scrape HTML tables using Python

Categories
HTML JAVASCRIPT LAMP PHP STATIC WEBSITES TUTORIALS WEB DEVELOPMENT WEB SERVER

Beginners Tutorial – What Is A Website?

We have all heard about different websites present on the internet such as Google, Youtube, Facebook, Twitter etc. But what exactly is a website? What is it made up of?

Imagine you just found a 100 year old book in your attic that contains a wealth of information about the world wars. It also has a collection of photographs that revealed some secrets that are lesser known to the public. You wanted to share this information with the world, but how?

One way is to go to the newspapers and get it published, but it can still not reach all the people across the world. What would be an easy and best possible way to share this information to people around the world? Publish it on the internet!

When you publish it on the internet, anyone from around the world having a computer or a smartphone with an access to the internet will be able to consume your content.

You can publish this information on social networking websites such as Facebook, Twitter or Youtube or create your own website to publish it.

A website example

So, a website, simply put, is a collection of information present in different formats such as texts, images, videos, graphs that is published on the internet to helps its users consume them.

But just like you, there are millions of people who are sharing information on the internet, so how do you make your contents accessible from other people’s information? This problem is very similar to having thousands of building in a city and needing to find a particular building. How do we do that? We will discuss about this in the future articles

Categories
JAVASCRIPT STATIC WEBSITES TUTORIALS WEB BROWSER WEB DEVELOPMENT WEB SERVER

This one value in Javascript is not equal to itself!

We know that Javascript supports all kinds of values such as strings, numbers, constants etc. All these values are deterministic values in that their weights always remains the same. For example, an integer value of 25 is always equal to 25 in Javascript no matter what. Similarly a string value of “Hello” is also always equal to another string “Hello”.

In other words, these values in Javascript can always be compared with another value to determine if they are the same or different. To understand this better with an example, let us open up our browser console. In my case I am using Google Chrome browser console where we will create 3 variables with these values:

We can note from the above Javascript demo video that when the variable a and c are compared, since both their values are same holding a value of 2, they return true when compared with each other. On the other hand when variable a was compared with b, since their values were different, the comparison resulted in a return value of false.

This is true for all type of values present in Javascript – be it string, integers, floats, booleans anything you can think of.

However, there exists one special value in Javascript that is never equal to another variable having the same value. In other words its value is never equal to itself. This value is the NaN value!

NaN in Javascript stands for “Not a Number” and it is that one special value in Javascript which does not return true if it is compared with itself.

Why does NaN not equal to itself in Javascript?

Now you might be wondering why a NaN value does not equal to itself? The answer for this lies in the way Javascript language has been designed.

NaN or Not a Number is a special value in Javascript which is used to represent a nonsensical value – that is it is the value returned whenever a non sensical operation is performed. Now this is where it gets interesting. Why does a non sensical value not be the same all the time or at all the place? In other words, why is this happening here:

Why is it returning false?

The answer is that NaN as mentioned earlier is a value that is used to represent a non sensical values. So if the result of an operation performed is something that cannot be represented by ordinary or normal values, Javascript returns a value of NaN.

Now, if two operations results in non sensical values, they are not necessarily equal. Each of these operations can be returning two non sensical values of different weights. However, they both need to be represented by the value of NaN. Hence, the Javascript treats two NaNs as two different values and never equal to each other.

I learnt about this and many other similar anomalies in the book Eloquent Javascript. This is a very good book to learn and understand such interesting things about Javascript so I will definitely recommend this to anyone interesting in learning Javascript in depth.

If you are also aware of any other similar interesting things about Javascript do let me know in the comments below. Until then, happy coding! 🙂

Categories
HTML STATIC WEBSITES TUTORIALS WEB DEVELOPMENT WORDPRESS

Difference Between Link & Anchor Tag In HTML

HTML provides us with 2 different tags called the Anchor tag (represented by <a>) and the Link tag (represented by <link>). But what is the difference between the two? Can they be interchanged with one another? We will explore these queries in this article.

What is an HTML Anchor tag?

An anchor tag is used in an HTML file to help us link our HTML file’s texts to certain other web pages. In other words, it helps in the creation of hyper texts (<a>these are the hyper texts</a>) in HTML documents.

These other web pages that is linked into by the anchor tags can be present within the same website, or can be residing in a different website on a different web server altogether. But what is important is that the Anchor tags <a></a> are responsible for linking different web documents together through hyper texts and help user navigate from one web page to other in a simple mouse click!

It must be noted that all anchor tags are present only within the <body></body> section of an HTML page. They are never used in the <head></head> section of the HTML page. This is because the primary function of the Anchor tags <a></a> is to let web page users to move from one page to the other page. As the web page users can never see the content of <head> section, anchor tags are never used there!

So, in other words, we can say that Anchor tags <a> are used in the user visible part of an HTML web page so that the web page user can click on the Hyper Texts created by these Anchor tags and move from one web page to another.

What is an HTML Link tag?

On the other hand, our web page is not just made up of HTML web pages, but consists of one or more of Cascading Style Sheets (CSS) that are used to style the web page. It also consists of one or more Javascript files to provide interactive functionalities to the web page. These CSS files and the Javascript files are thus part and parcel of the web page and hence need to be linked to the web page. This is achieved using the <link> HTML tag.

The <link> HTML tag is hence used to link up all external resources such as CSS and Javascript files associated with a web page. This linking does not need to be seen by the final user as he only needs to get an integrated HTML web page rendered. As a result, the linking of these files, done through the use of <link> tag is done in the <head> meta section of the web page.

So these were the difference between Link tag Vs Anchor tag in HTML. Each of these tags have very unique functionalities to fulfill in their own roles and cannot be used interchangeably as we suspected in the beginning of the article.

With this, hope it became clear on the differences between Link and Anchor tags’ purpose in the creation of an HTML document. If you have any more queries or doubts between Link and anchor tags, let me know in the comments below and I will try to clarify your doubts.

Until next time, happy coding! 🙂

Categories
HTML STATIC WEBSITES TUTORIALS WEB SERVER WORDPRESS

What does Hyper Text Markup Language (HTML) even mean?

HTML is the language used to develop web pages for the world wide web. But what does HTML even mean? We know that HTML stands for Hyper Text Markup Language, but what does each of these terms even signify? This post will try to discuss upon each of these words and their significance in the world of the web.

What is Hyper Text?

Hyper text refers to the ability of web pages to link with each other, within the same website or between websites on the internet.

What is a Markup Language?

HTML is a markup language. But what does markup language even mean? It means that all the building blocks that make up a HTML web page are made up of HTML elements defined by HTML markups. In other words, HTML markups are special tags used to define all the HTML elements that make up a web page. These HTML markups includes tags like <head>, <title>, <body>, <footer> etc.

Why use semantic HTML markups for your web page?

With the recent HTML 5 specification, there has been an increase in the amount of these HTML markups (or tags) defined by the specification so that there are more semantically meaningful HTML markup tags available to define a web page. These new semantically appropriate HTML markup tags help improve the readability and understanding of the sections of these web pages by machines (aka algorithms) such as search engines.

Categories
CMS HTML STATIC WEBSITES TUTORIALS WEB SERVER WORDPRESS

How HTML Anchor Tag Could Be Used To Perform DDOS Attacks

Chinese attackers have been using HTML Anchor tags to perform DDOS attacks across the world these days. This is one such instance where a seemingly benign feature addition done to the HTML technical specification has inadvertently opened a Pandora box of its misuse/abuse by hackers and attackers.

As mentioned in my introductory article to HTML Anchor Tags, Anchor tags are used to link documents present on the word wide web so that users of a web page can easily navigate to a new web page seamlessly from their web browsers.

An example of HTML code using anchor tags looks something like this:

<a href="https://muddoo.com" title="Muddoo Home">Muddoo</a>

While the above code is a standard way of using HTML anchor tags, there are also additional anchor tag attributes one can use to add new features to the anchor tag’s overall functionality. In our previous article we looked at the noopener attribute that ensures that when the respective anchor links are opened in a new window, they are opened in a separate thread all together and have no relationship to the parent web page in anyways. This ensured that Cross Site Script (XSS) attacks could not be made from child web page to the parent page.

Just like the noopener attribute, we have another attribute associated with the anchor tags that some hackers are misusing to perform DDOS attacks on other websites. This attribute is the “ping” attribute of the anchor tags!

What is HTML Ping Attribute?

Ping is a new attribute of an Anchor tag that was introduced in HTML5 specification. Ping attribute would list a set of one or more URLs that are pinged back whenever a user of a web page follows a hyperlink from that anchor tag.

The idea of introducing Ping attribute to anchor tags was to enable web administrators track clicks on that hyperlink. An example of how this attribute looks like is shown below:

<a href="https://google.com" ping="https://muddoo.com/tracker">Go to Google</a>

So in the above example, whenever a user clicks on “Go to Google” hyperlink, he will be taken to the Google home page, but at the same time, a ping POST message is sent back to the https://muddoo.com/tracker webpage for muddoo.com website to keep track of number of users going to Google through that hyperlink.

But the problem occurred when some of the Chinese hackers started using this innocuous feature to perform DDOS attacks on many websites. They simply created web page with links to standard websites such as Alibaba or Tabao, while using ping back links to their target websites. They specifically targeted people using QQBrowser (from Chinese giant Tencent) to use their web pages to reach standard websites. This resulted in millions of Ping request going back to targeted websites thus acting as a DDOS attack on these websites.

How to prevent Anchor Tag Ping attacks from your web pages?

With good understanding of how the attack is being performed, you must be wondering how you can prevent such DDOS attacks originating from your websites or getting attacked by one. But unfortunately, there are no clear solutions in place as the support for Ping requests are part of HTML 5 specifications so all browsers will be supporting it (well, more or less), so your only best possibilities will be to keep monitoring such activities on your web server and take appropriate action at the right moment.

Hope this article gave good introduction to the possible Ping DDOS attacks happening due to the presence of Ping attribute in the HTML Anchor tags. This article has been part of series of articles that I have been writing about HTML tags with this being the third article on HTML Anchor Tags.

If you would like to take a look at other two articles, you can follow these links:

Introduction To HTML Anchor Tag

What is noopener vulnerability found in anchor tags of HTML?

Until next time, happy coding! 🙂

Categories
HTML STATIC WEBSITES TUTORIALS WEB SERVER WORDPRESS

What is noopener vulnerability found in anchor tags of HTML?

HTML anchor tags are used to link to different web pages available on the internet. We also frequently use “target” attribute with the anchor tags so that the linked web page is opened in a separate new window. This is achieved by using the anchor tag like this:

<a href="https://muddoo.com" target="_blank"title="Muddoo Home">Home</a>

Note that in the above code we set the “target” value to be _blank, which would result in the linked web page (https://muddoo.com in this case) to be opened in a new window.

However, it has been found that this can leave a possible vulnerability where in the remotely linked web page can take over control of your web page.

Why does this vulnerability happen?

This vulnerability of remotely linked web page taking over your web page (that is having the anchor tag) is because of the following reasons:

  1. In normal scenario, whenever you open a new web page in your browser in a new window, the web page is running in its own separate thread.
  2. Now when we open a link present in that web page, the new linked web page gets opened in a new window due to the presence of “target” attribute of the anchor tag. However, in this scenario, the newly opened web page is also running under its parent’s thread itself instead of its own thread.
  3. As a result, the newly opened external web page has controls over its parent’s thread. There by creating a vulnerable situation!

How to overcome anchor tag’s “target” vulnerability?

We can overcome this “target” thread control vulnerability simply by introducing a new attribute to your anchor tags called the rel=”noopener” attribute.

Thus, the new fixed anchor tag would look something like this:

<a href="https://muddoo.com" target="_blank"title="Muddoo Home" rel="noopener">Home</a>

With this simple change, we can ensure that the newly opened web page runs in it’s own thread there by having no link to it’s parent thread in any way!

Hope you are now aware of this possible vulnerability and ensure you start using the rel=”noopener” attributes to all your web pages’ external links!

Happy coding! 🙂

Note: This article is continuation of my previous article Introduction To HTML Anchor Tag

Categories
CMS HTML STATIC WEBSITES TUTORIALS WEB SERVER WORDPRESS

Introduction To HTML Anchor Tag

Anchor tag is an HTML tag that is used to mark the beginning and end of a hyperlink text in the HTML document.

A website is made up of one or more HTML documents that contains all the information parts of the website. But word wide web as a whole mainly works because of the ability of these HTML web documents to link (or refer) to each other. This inter-linking of web pages is achieved by using the HTML anchor tags.

A typical structure of an anchor tag looks like this:

<a href="https://muddoo.com" title="Muddoo Home">Muddoo</a>
Brief structure of an HTML Anchor tag
Typical HTML Anchor tag usage in web documents

From the above, we note that an anchor tag starts and ends between notations like <a> and </a>. In other words, HTML anchor tags have both opening and closing tags. Text between this opening and closing tags is called the anchor text and is responsible for taking the user to a new document upon being clicked. In the above example, “Muddoo” is the anchor text.

But where does the user go on clicking the Anchor text? This is determined by the href attribute of the HTML anchor tag. The url in the href attribute of anchor tag is the destination web page’s address where the user will be taken to.

In addition to href attribute, the HTML anchor tag also has another attribute called “title”. The title attribute of the HTML anchor tag holds a piece of text that the user will see upon hovered over by the mouse. It is also helpful as an accessibility feature for people using screen readers as it gets read out by the screen readers.

Finally, there are also a few other attributes such as “target” attribute which provides additional functions such as determining if the destination web page is to be opened in the same window or a new window. These type of additional attributes can be looked upon in the official w3c html specification document.

But all in all, the Anchor tags are the fundamental elements of the world wide web that weaves the inter-connected paths between various web documents that helps the web users to seamlessly navigate between various websites and documents without any hassles.

Hope this gave a brief introduction to the HTML anchor tags. HTM Anchor tags are tags that are going to be used regularly while creating a HTML web page so having a clear understanding of its structure and how it works becomes essential. In the same line, I will continue to document more about other HTML tags in the future that are bare essential for web development.

Until then, happy coding!

Categories
HTML STATIC WEBSITES TUTORIALS WEB SERVER WORDPRESS

What is DOCTYPE in HTML? Why do we use it?

When we sit down to write an html document, one of the first line of code we write is <!DOCTYPE html>. But what does this line do? What is significance of this line to a web browser? What happens if we miss including the DOCTYPE tag? Is it even a tag in the first place? We will answer these questions in this article.

What is DOCTYPE in HTML?

DOCTYPE is a type of directive that tells our web browser what type of document it is dealing with. As there are multiple versions of HTML documents that a web browser need to deal with, each following a different version of HTML definition standards or non HTML documents such as XML files, mentioning the “type” of this particular document helps a web browser to decide and adjust itself to render the specified document appropriately.

What happens if the DOCTYPE is not mentioned in an HTML document?

Specifying the DOCTYPE of a document will help a web browser to make appropriate decisions in rendering that file to its user successfully. In the event a web document does not have appropriate DOCTYPE specified, the web browser will try to make a best guess and try to render the document accordingly. However, a result of this could be that the rendering might not be happening in the most optimal way and as a result, some of the documents might not get rendered properly.

Is DOCTYPE even a HTML tag?

Actually, No! 😮

DOCTYPE according to HTML specification is not a HTML tag, but a declaration for the browser to make use of.

Does DOCTYPE have an end tag?

The answer is NO. DOCTYPE is not a HTML tag and does not have an explicit end tag to itself.

In the earlier days, web documents used DOCTYPE declaration effectively to let the browser know what type of HTML standard specification the document was following. As a result, the first line of a web page that declared the DOCTYPE had a very lengthy string to it, something like this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

But these days, with most of the websites using HTML 5 specification, the actual usage of DOCTYPE has become more or less redundant and hence we can get away by just declaring the DOCTYPE as:

<!DOCTYPE html>

and browser will still be able to handle it perfectly.

Hope this gave a bit of clarity on some of the doubts you had around DOCTYPE declaration in HTML. If you have an inputs or queries regarding it, don’t hesitate to ask about it in the comment section below!

Until then, Happy Coding!

Categories
STATIC WEBSITES TUTORIALS

Getting started with Pelican: One click installer to install Pelican

Pelican is a Python based static website generator written in Python. Using Pelican, one can start creating static websites that can later be deployed to a simple file web server on the cloud. Some of the cloud web service providers include Amazon Web Services, Digital Ocean, Vultr etc. One can also host these static websites on a static host providers such as Netlify, Contentful etc. But first, lets understand more about static websites and how to use Python’s Pelican to create your static website.

What is a static website and why should you use one?

The internet today is made up of both dynamic websites as well as static websites. A dynamic website is one which usually consists of a database and the server creates dynamic html web pages on the fly, usually specific to the user who requested it. On the contrary, a static website is made up of contents that are just that – static and is served as the same to all its requesting users.

So which one should you be using for your usecase? A static website or a dynamic website? To answer this question, you should first look into this article that discusses the advantages and disadvantages of a static vs dynamic website.

So with the above introduction, its time to move into the technical aspects of Pelican. First let us discuss about the installation aspect of Pelican.

How To Install Pelican

Jump to the end of this article if you just want a one click installer to install and try Pelican

In order to install Pelican, you need to have both pip and Python installed on your system. If you dont have them installed, you can do so using the following commands:

For Ubuntu

sudo apt-get install python3 python3-pip

For Fedora Linux

sudo yum install python3 python3-pip

In this installation process, we are using Python 3 version. However note that Pelican works on both Python 2.7 as well as latest version of Python 3, so which one to use is solely left to your discretion.

Once Python and Pip are installed, we can proceed with installing Pelican onto our computers. To do so, we issue the following command:

pip3 install pelican markdown

We can note here that we are installing two Python packages from pip, one is the Pelican static site generator and the other is a markdown package. If you are unfamiliar with markdown, it is a set of standard markup language used to write contents in a way that can later be processed to format the content it surrounds. You can read more about Markdown on Wikipedia.

Once they are installed, we can create a new directory using command line to store our project files. In this case, we are creating a directory called Pelican_Demo and then moving to it.

mkdir Pelican_Demo
cd Pelican_Demo

Once inside the newly created directory, we start creating our Pelican website. To do so, we call a Python executable script called pelican-quickstart that was installed to us in our /usr/local/bin directory. So we can run this script simply by calling it as follows:

pelican-quickstart

This would kick start our Pelican static website generator which then proceeds with a series of questions that you need to answer to finally create your static website.

What these set of questions actually does to your Python based Pelican static website will be a topic for another post. But for now, you should be good to go using your website.

If you want to just get your hands dirty and try to get Pelican up and running without wanting to dig deeper into investigating how it works, then you can use the following script to get going.

https://github.com/digitallyamar/Python3-Pelican-Installer

This script will install all the required packages and answer all the questions of pelican-quickstart automatically for you so that you can simply run it and jump to view the newly created Pelican static website. Follow the instructions given in that Python3-Pelican-Installer github project to get it up and running in no time to get a taste of what Pelican static website looks and feels like.

Good luck!