Categories
DATA MINING DATA SCIENCE HTML JAVASCRIPT PROGRAMMING PYTHON STATIC WEBSITES TUTORIALS WEB DEVELOPMENT WEB SCRAPING

How To Extract Data From A Website Using Python

In this article, we are going to learn how to extract data from a website using Python. The term used for extracting data from a website is called “Web scraping” or “Data scraping”. We can write programs using languages such as Python to perform web scraping automatically.

In order to understand how to write a web scraper using Python, we first need to understand the basic structure of a website. We have already written an article about it here on our website. Take a quick look at it once before proceeding here to get a sense of it.

The way to scrape a webpage is to find specific HTML elements and extract its contents. So, to write a website scraper, you need to have good understanding of HTML elements and its syntax.

Assuming you have good understanding on these per-requisites, we will now proceed to learn how to extract data from website using Python.

Python logo on extracting data from a web page using Python
Python Web Scraper Development

How To Fetch A Web Page Using Python

The first step in writing a web scraper using Python is to fetch the web page from web server to our local computer. One can achieve this by making use of a readily available Python package called urllib.

We can install the Python package urllib using Python package manager pip. We just need to issue the following command to install urllib on our computer:

pip install urllib

Once we have urllib Python package installed, we can start using it to fetch the web page to scrape its data.

For the sake of this tutorial, we are going to extract data from a web page from Wikipedia on comet found here:

https://en.wikipedia.org/wiki/Comet

This wikipedia article contains a variety of HTML elements such as texts, images, tables, headings etc. We can extract each of these elements separately using Python.

How To Fetch A Web Page Using Urllib Python package.

Let us now fetch this web page using Python library urllib by issuing the following command:

import urllib.request
content = urllib.request.urlopen('https://en.wikipedia.org/wiki/Comet')

read_content = content.read()

The first line:

import urllib.request

will import the urllib package’s request function into our Python program. We will make use of this request function send an HTML GET request to Wikipedia server to render us the webpage. The URL of this web page is passed as the parameter to this request.

content = urllib.request.urlopen('https://en.wikipedia.org/wiki/Comet')

As a result of this, the wikipedia server will respond back with the HTML content of this web page. It is this content that is stored in the Python program’s “content” variable.

The content variable will hold all the HTML content sent back by the Wikipedia server. This also includes certain HTML meta tags that are used as directives to web browser such as <meta> tags. However, as a web scraper we are mostly interested only in human readable content and not so much on meta content. Hence, we need extract only non meta HTML content from the “content” variable. We achieve this in the next line of the program by calling the read() function of urllib package.

read_content = content.read()

The above line of Python code will give us only those HTML elements which contain human readable contents.

At this point in our program we have extracted all the relevant HTML elements that we would be interested in. It is now time to extract individual data elements of the web page.

How To Extract Data From Individual HTML Elements Of The Web Page

In order to extract individual HTML elements from our read_content variable, we need to make use of another Python library called Beautifulsoup. Beautifulsoup is a Python package that can understand HTML syntax and elements. Using this library, we will be able to extract out the exact HTML element we are interested in.

We can install Python Beautifulsoup package into our local development system by issuing the command:

pip install bs4

Once Beautifulsoup Python package is installed, we can start using it to extract HTML elements from our web content. Hope you remember that we had earlier stored our web content in the Python variable “read_content“. We are now going to pass this variable along with the flag ‘html.parser’ to Beautifulsoup to extract html elements as shown below:

from bs4 import BeautifulSoup
soup = BeautifulSoup(read_content,'html.parser')

From this point on wards, our “soup” Python variable holds all the HTML elements of the webpage. So we can start accessing each of these HTML elements by using the find and find_all built-in functions.

How To Extract All The Paragraphs Of A Web Page

For example, if we want to extract the first paragraph of the wikipedia comet article, we can do so using the code:

pAll = soup.find_all('p')

Above code will extract all the paragraphs present in the article and assign it to the variable pAll. Now pAll contains a list of all paragraphs, so each individual paragraphs can be accessed through indexing. So in order to access the first paragraph, we issue the command:

pAll[0].text

The output we obtain is:

\n

So the first paragraph only contained a new line. What if we try the next index?

pAll[1].text
'\n'

We again get a newline! Now what about the third index?

pAll[2].text
"A comet is an icy, small Solar System body that..."

And now we get the text of the first paragraph of the article! If we continue further with indexing, we can see that we continue to get access to every other HTML <p> element of the article. In a similar way, we can extract other HTML elements too as shown in the next section.

How To Extract All The H2 Elements Of A Web Page

Extracting H2 elements of a web page can also be achieved in a similar way as how we did for the paragraphs earlier. By simply issuing the following command:

h2All = soup.find_all('h2')

we can filter and store all H2 elements into our h2All variable.

So with this we can now access each of the h2 element by indexing the h2All variable:

>>> h2All[0].text
'Contents'
>>> h2All[2].text
'Physical characteristics[edit]'

Conclusion

So there you have it. This is how we extract data from website using Python. By making use of the two important libraries – urllib and Beautifulsoup.

We first pull the web page content from the web server using urllib and then we use Beautifulsoup over the content. Beautifulsoup will then provides us with many useful functions (find_all, text etc) to extract individual HTML elements of the web page. By making use of these functions, we can address individual elements of the web page.

So far we have seen how we could extract paragraphs and h2 elements from our web page. But we do not stop there. We can extract any type of HTML elements using similar approach – be it images, links, tables etc. If you want to verify this, checkout this other article where we have taken similar approach to extract table elements from another wikipedia article.

How to scrape HTML tables using Python

Categories
TUTORIALS VPS WEB DEVELOPMENT

Tutorial – Setting up a Ubuntu 16.04 VPS Instance on Vultr

In this article, we will learn about how to create a simple Virtual Private Server (VPS) running 64-bit Ubuntu 16.04 Operating System.

In case you have not created an account yet on Vultr, you can do so by visiting the link in the next paragraph and get $50 worth FREE CREDITS that you can use to create and use your Vultr Ubuntu VPS instances

Get Vultr VPS Worth $50 FOR FREE If you too would like to use Vultr VPS instance (which I strongly advise) while following these tutorial series, you can use the following Link to create your Vultr account and get $50 Free Credit which is more than sufficient to use and learn all about Linux, Web Development and much more for FREE!

Once you have created your Vultr account using the link above, you log into your Vultr account by visiting the Vultr Web App. Once you have logged in, you should be in the Products tab which would list all the Vultr instances you have created until now. Since your Vultr account is new, you will not have any instances listed there.

Vultr Dashboard displaying list of active Vultr instances created until now.
Vultr Dashboard displaying list of active Vultr instances created until now

But do not worry, this is about to change now.

Create a new Vultr VPS by clicking on the link that read “deploy now” at the end of that Vultr dashboard web page. You will then be taken to a new web page as shown below:

Vultr Deployment screen where new VPS instance can be created and deployed.
Vultr Deployment screen where new VPS instance can be created and deployed

Do not get perplexed by such a long web page with numerous options. While they may look baffling at first, it is actually pretty easy to use to create your first Vultr VPS instance. We will go through each of these options in a step by step fashion so that it is easier for you to follow and replicate.

Step 1 – Choose Server: Vultr not just provides services to deploy a VPS server, but also many more other products including Bare Metal Machine, Dedicated Cloud etc. However, in our case, we are only interested in deploying a simple VPS server running 64-bit Ubuntu 16.04 OS. Hence, we will simply choose “Cloud Compute” option which is what creates a VPS server.

Step 2 – Server Location: Next, we need to select where we want our VPS server to reside at. Vultr has its data centers spread across the globe and hence we have option to choose our VPS servers from various cities across the world as listed in the option. Choose the one which is closest to your and your web app’s visitors location is. This is because you will get a quick turn around time (time taken by the server’s response to be received) from your server if it is closest to your own location.

Step 3 – Server Type: In this option, we need to choose the type of Operating System (OS) we need to use. If for example we want to install 64-bit Ubuntu 16.04 OS, we select it over here.

Step 4 – Server Size: Next comes the size of the VPS server you want to deploy. This depends on a number of factors such as the amount of data size your app is going to use, the amount of traffic it gets, the speed of the CPU and the number of cores it holds etc. For tutorials and experiments, I usually just use the default selection of $10 per month VPS instance which gives comfortable performance for my requirements.

Step 5 – Additional Features: These are some advanced options which are not selected by default. These includes option to select IPV6 network addresses, backups etc. which I usually leave at its default unchecked state (in other words, I do not use it).

Step 6 – Startup Scripts: This option is useful if you need to run any additional scripts at the startup of your VPS instance. I have never used it until now so I may not be the right person to comment much about it! Sorry!! 😛

Step 7 – SSH Keys: SSH keys are special software keys that are used to create a secured shell protocol connection between your laptop/computer to your Vultr VPS server. In this option, you can generate SSH keys (using this tutorial) for your laptop and take the public part of the SSH keys and paste it here on the Vultr dashboard under this option. This way, you would not need to type in login and password every time you want to connect to your Vultr VPS server from your laptop’s command prompt.

Step 8 – Server Hostname & Label: Finally, you can create a new Hostname and label for your VPS server. This will result in the Vultr dashboard displaying this instance of the VPS server using this Hostname & Label.

Once you are done with filling up with all the above details in your Vultr dashboard, you can click on the “Deploy Now” button to create and deploy your Vultr VPS server. It may take a few minutes since clicking on the button after which your Vultr VPS server should be ready for use!

Hope this article gave you an insight into how to create and deploy a new Vultr VPS instance using Vultr dashboard. If you have any queries or any feedback on this article, do let me know in the comments below. Until next time, happy coding! 🙂

Categories
EDITORS TUTORIALS VISUAL STUDIO CODE

Best Visual Studio Code Extensions For HTML (VS Code Extensions)

Looking for best Visual Studio Code extensions for my web development activities, I came across a plethora of VS Code extensions made available by various developers not just for web development, but for various other types of programming activities as well.

In this article, I will list out few of these Visual Studio Code extensions suitable for HTML coding activities.

List of Visual Studio Code extensions for HTML

Intellisense (Built-in, no extension required)

I agree that this blog post started as showcasing a list of HTML Visual Studio Code extensions, but I would be doing a disservice to the developers of VS Code if I did not mention the excellent support that has been provided as built in functionality in the Visual Studio Code itself via Intellisense. VS Code Intellisense provides support for suggestions and auto completion of basic HTML tags.

Visual Studio Code’s Intellisense auto-completion support for HTML

Emmet Feature In VS Code (Again, built in, no extension required)

Emmet is my next go to feature that is built into VS Code now that I highly recommend to everyone out there that is working with HTML coding or development using Visual Studio Code.

One of the main functionality of Emmet on VS Code is to provide basic abbreviations for most of HTML code.

So say for example you are about to create a new HTML page that you want it to be mobile friendly and descibes all the basic structure of an HTML page such as UTF charset meta data, viewport type, language type etc. You can do so by simply typing “html:5” at the beginning of the document and pressing TAB key. This will trigger the Emmet’s abbreviation feature resulting in autocompleting the basic structure of a web page as shown in the GIF below:

Emmet’s Abbreviation feature in action for HTML 5

HTML5-Boilerplate VS Code HTML Extension

The next Visual Studio Code extension for HTML deals specifically with HTML 5 and is called HTML5-Boilerplate VS Code extension.

The HTML5-Boilerplate VS Code extension is very similar to that of Emmet we had discussed earlier, but differs in the fact that it specifically deals with generating boilerplate code for HTML 5. Below is a GIF showing HTML5-Boilerplate Visual Studio Code (VS Code) extension in action:

Visual Studio Code (VS Code) HTML5-Boilerplate extension in action
Installation Code: ext install sidthesloth.html5-boilerplate

HTML Live Preview VS Code HTML Extension

HTML Live Preview is another Visual Studio Code extension that as the name suggests, helps its users to do a live preview of their HTML web page during its development. What is interesting is that the HTML Live Preview VS Code extension does this at real time as shown in the GIF below:

Visual Studio HTML Live Preview extension in action
Installation Code: ext install hdg.live-html-previewer

These are some of the HTML extensions for Visual Studio Code editor that I have come across up until now. I am pretty sure I might have missed out a lot more useful HTML extensions for VS Code, which I would continue to add to my toolkit upon discovery and update this article accordingly. For now, these extensions are bound to make my life easy while developing HTML code for my web development activities.

If you are aware of any more Visual Studio Code extensions for HTML that you found useful and you think I should try and recommend it to others, do let me know in the comments below and I will definitely look into it.

Till then, happy coding!