Chinese attackers have been using HTML Anchor tags to perform DDOS attacks across the world these days. This is one such instance where a seemingly benign feature addition done to the HTML technical specification has inadvertently opened a Pandora box of its misuse/abuse by hackers and attackers.
As mentioned in my introductory article to HTML Anchor Tags, Anchor tags are used to link documents present on the word wide web so that users of a web page can easily navigate to a new web page seamlessly from their web browsers.
An example of HTML code using anchor tags looks something like this:
While the above code is a standard way of using HTML anchor tags, there are also additional anchor tag attributes one can use to add new features to the anchor tag’s overall functionality. In our previous article we looked at the noopener attribute that ensures that when the respective anchor links are opened in a new window, they are opened in a separate thread all together and have no relationship to the parent web page in anyways. This ensured that Cross Site Script (XSS) attacks could not be made from child web page to the parent page.
Just like the noopener attribute, we have another attribute associated with the anchor tags that some hackers are misusing to perform DDOS attacks on other websites. This attribute is the “ping” attribute of the anchor tags!
What is HTML Ping Attribute?
Ping is a new attribute of an Anchor tag that was introduced in HTML5 specification. Ping attribute would list a set of one or more URLs that are pinged back whenever a user of a web page follows a hyperlink from that anchor tag.
The idea of introducing Ping attribute to anchor tags was to enable web administrators track clicks on that hyperlink. An example of how this attribute looks like is shown below:
<a href="https://google.com" ping="https://muddoo.com/tracker">Go to Google</a>
So in the above example, whenever a user clicks on “Go to Google” hyperlink, he will be taken to the Google home page, but at the same time, a ping POST message is sent back to the https://muddoo.com/tracker webpage for muddoo.com website to keep track of number of users going to Google through that hyperlink.
But the problem occurred when some of the Chinese hackers started using this innocuous feature to perform DDOS attacks on many websites. They simply created web page with links to standard websites such as Alibaba or Tabao, while using ping back links to their target websites. They specifically targeted people using QQBrowser (from Chinese giant Tencent) to use their web pages to reach standard websites. This resulted in millions of Ping request going back to targeted websites thus acting as a DDOS attack on these websites.
How to prevent Anchor Tag Ping attacks from your web pages?
With good understanding of how the attack is being performed, you must be wondering how you can prevent such DDOS attacks originating from your websites or getting attacked by one. But unfortunately, there are no clear solutions in place as the support for Ping requests are part of HTML 5 specifications so all browsers will be supporting it (well, more or less), so your only best possibilities will be to keep monitoring such activities on your web server and take appropriate action at the right moment.
Hope this article gave good introduction to the possible Ping DDOS attacks happening due to the presence of Ping attribute in the HTML Anchor tags. This article has been part of series of articles that I have been writing about HTML tags with this being the third article on HTML Anchor Tags.
If you would like to take a look at other two articles, you can follow these links:
HTML anchor tags are used to link to different web pages available on the internet. We also frequently use “target” attribute with the anchor tags so that the linked web page is opened in a separate new window. This is achieved by using the anchor tag like this:
Note that in the above code we set the “target” value to be _blank, which would result in the linked web page (https://muddoo.com in this case) to be opened in a new window.
However, it has been found that this can leave a possible vulnerability where in the remotely linked web page can take over control of your web page.
Why does this vulnerability happen?
This vulnerability of remotely linked web page taking over your web page (that is having the anchor tag) is because of the following reasons:
In normal scenario, whenever you open a new web page in your browser in a new window, the web page is running in its own separate thread.
Now when we open a link present in that web page, the new linked web page gets opened in a new window due to the presence of “target” attribute of the anchor tag. However, in this scenario, the newly opened web page is also running under its parent’s thread itself instead of its own thread.
As a result, the newly opened external web page has controls over its parent’s thread. There by creating a vulnerable situation!
How to overcome anchor tag’s “target” vulnerability?
We can overcome this “target” thread control vulnerability simply by introducing a new attribute to your anchor tags called the rel=”noopener” attribute.
Thus, the new fixed anchor tag would look something like this:
Anchor tag is an HTML tag that is used to mark the beginning and end of a hyperlink text in the HTML document.
A website is made up of one or more HTML documents that contains all the information parts of the website. But word wide web as a whole mainly works because of the ability of these HTML web documents to link (or refer) to each other. This inter-linking of web pages is achieved by using the HTML anchor tags.
A typical structure of an anchor tag looks like this:
From the above, we note that an anchor tag starts and ends between notations like <a> and </a>. In other words, HTML anchor tags have both opening and closing tags. Text between this opening and closing tags is called the anchor text and is responsible for taking the user to a new document upon being clicked. In the above example, “Muddoo” is the anchor text.
But where does the user go on clicking the Anchor text? This is determined by the href attribute of the HTML anchor tag. The url in the href attribute of anchor tag is the destination web page’s address where the user will be taken to.
In addition to href attribute, the HTML anchor tag also has another attribute called “title”. The title attribute of the HTML anchor tag holds a piece of text that the user will see upon hovered over by the mouse. It is also helpful as an accessibility feature for people using screen readers as it gets read out by the screen readers.
Finally, there are also a few other attributes such as “target” attribute which provides additional functions such as determining if the destination web page is to be opened in the same window or a new window. These type of additional attributes can be looked upon in the official w3c html specification document.
But all in all, the Anchor tags are the fundamental elements of the world wide web that weaves the inter-connected paths between various web documents that helps the web users to seamlessly navigate between various websites and documents without any hassles.
Hope this gave a brief introduction to the HTML anchor tags. HTM Anchor tags are tags that are going to be used regularly while creating a HTML web page so having a clear understanding of its structure and how it works becomes essential. In the same line, I will continue to document more about other HTML tags in the future that are bare essential for web development.
Git is a version control tool that is used to maintain a continuous set of copies of file(s), with each of these copies having different content built on top of previous content.
In the above diagram, we see a file who’s content keeps changing as time progresses, thereby creating different versions of a file from 1 to 4.
Why do we use Git?
If we were working on such a file on our computer, and we are creating its content for the first time, chances are we will write some content, think for a bit and then decide to back and edit some previous content again, get back and continue and so on.
As a result, what we normally tend to do is to keep saving the file at different point in time with different names, there by ending with a set of files that looks like this:
Do you see a problem here?!
Git or any version control for that matter is used to avoid exactly this problem!
Getting started with Git
Git is a version control tool created by Linus Torvalds, the creator of open source operating system Linux. He had initially created this tool as a way to do version controlling specifically for Linux Kernel files, but as the tool grew in popularity – mainly for its simplicity and distributed nature, it was soon adapted by all software engineering domains at large.
What do you mean by Git being a distributed system?
Yes, Git is a distributed version control system. What we mean by it is that the entire set of copies of different versions of files and directories are not stored in a single central server but is made available to everyone of its git users as a local copy saved onto their own laptops/computers. So this way, even if you lose one system from working due to any technical issues, you will not loose the entire Git repo (as it could happen in case of a central server) as it will still be available by all of its Git users who will have a copy of it!
Alright enough discussion on the theory of Git, let us go for some hands-on exercises to better understand how Git works and how you could use in your everyday coding activities.
Hands On With Git – Just Tell Me What To Do!
The first thing that needs to be done to getting started with Git is to install the tool itself. Depending on the type of OS you are running on your system, you need to install Git using appropriate executable file from the GitHub Official Page.
Once you have installed your Git tool, you will be able to run Git commands. To ensure its installed correctly, issue the following command that checks the installed Git tool’s version:
If it responded back with a string as shown below, you are all good:
git version 2.7.4
We can now start using our Git tool and start working on a real Git repository. For this tutorial, we will make use of a Git Repo that I have created in Github.com. Github.com is a popular public Git repo hosting website where users can create unlimited number of Git repositories.
First step is to clone the Git repository I have on Github called Hands-On-Git.
What is Git Cloning?
Git cloning is a process of pulling a copy of Git repository from a hosted Git server onto your local computer. This is done by using the command Git Clone
In our example, issue the following Git clone command:
Wait a second, how did we get that url? Well it so happens that for every repo stored on Github, Github website provides the url of the Git repo that is to be used to clone the repo. This is demonstrated in the GIF below:
After issuing the above git clone command, you should see an output resembling this:
With this, you now have an exact clone of Hands-On-Git repository as available on my Github repo.
What is Git branch?
Every file that you push or pull from git repo will be stored under a specific branch called Git branch. By default, all files are stored under a branch called the “master” branch. However, if you want to work on a separate feature of your code base and not sure if it will break anything that is already working currently, you can continue developing that feature on a separate branch.
Only after completing the development of said feature and ensuring all tests are passing, you can “merge” back this feature to the master branch.
This way, all others who are working with the same Git repository are not impacted by your code changes until you have finished developing and validating it. And after that, you can send the Git maintainer a “Pull Request” to pull your changes to the “master” branch.
Let us demonstrate each of these steps now. First, check and ensure you are currently in the master branch. You can do so by issuing the following command:
You should see Git replying back with the name of the branch – “*master” in this case as show below:
The * before the reply “master” indicates the current branch you are on. As there is only one branch, you will see *master. But if there are more than one branch as will be demonstrated further, it will list all available branches and the branch which has * in front of it indicates the branch you are currently on.
How to create a new Git branch?
In order to create a new git branch, all you have to do is to issue the following command:
git branch MyCoolBranch
This should now have created a new branch called “MyCoolBranch”. But how do you verify it? You once again, issue “git branch” command:
This should now list you all available branches, which, in this case happens to be 2 – master branch and MyCoolBranch.
As you can see from the above GIF, we have successfully created a new Git branch “MyCoolBranch”. However, did you notice that we are still in master branch (asterisk * is still pointing to master)? That is because we have just created a new branch but not “branched” or “checked out” to that branch.
How to switch to another Git branch?
In order to switch to a new Git branch, we need to issue another command called “git checkout <branchName>“. So in our case, in order to checkout to “MyCoolBranch”, we need to issue the following Git command:
git checkout MyCoolBranch
This should now switch us over to the newly created MyCoolBranch. We can verify it, again, using the command:
This time, we can see that the asterisk * has moved to our newly created branch MyCoolBranch, confirming that we are now in our new branch.
Now that we are in our newly created branch, we can do whatever changes we want to do without affecting anyone else’s code.
Let us now edit the file present in our git repo – the README file. I will just add my name to this file using my favorite text editor and save it.
Once we have edited the file, our Git repo is no more clean as it has some changes that is yet to be tracked by our Git repo. So we need to add and commit the changes done to this file for the Git repo to go back to clean state. In other words, we need to update our Git repo to include the changes we have done.
What is Git Staging and Git Commit?
Adding changes to Git repo happens in two different stages which Git conveniently calls them as Git Staging and Git Commit.
In a Git staging process, we first need to add the file to the Git repo’s staging level and then do a commit on all the staged files with an appropriate Git commit message explaining what these changes are doing to the repo code base.
Git staging is achieved using the command:
git add <Filename>
If no file name is specified in the above command, all files that have been changed will be staged by the Git tool.
In the below GIF, we can see how we did the first step of the two steps process – Git staging:
Now that the file has been Git staged, it is time to commit the file to our Git repository. This is achieved using the command:
git commit -m <Message explaining the commited changes>
Following GIF shows the Git commit process. It also introduces a new command:
Git log is a command that is used by Git users to read all the commit messages that each of the git commit carries. This way, we know who did what changes to the Git repository!
What is Git Push command?
So now we have our changes committed to Git repository. However we still have one more thing to do. All our changes and commits were done locally on our laptop’s Git repo. But we want these changes to be made available to all. To do that, we need to push our changes back to Github’s repository. We can do so using another git command:
However, since we have now created a new branch “MyCoolBranch” where we had created all our changes, we cannot simply use the “git push” command. That is because we do not have this branch in the Original Github repository. So, we need to add some additional parameters to the above git push command:
git push origin MyGitBranch
Here we notice two new parameters – origin and MyGitBranch. While we understand that MyGitBranch was the name of the branch from which we are pushing our changes, what does origin stand for?
Well, it so happens that origin is the term used to represent the remote Github’s repository! So with this, we can now push the changes to our Github remote repository as shown in the GIF below:
Now that we have successfully pushed our local changes to Github remote repository, we should see the push reflected on our Github repository page as well:
What is a Git Pull Request?
Now that we have pushed our changes to Github remote repository, it appears in Github page as shown in the GIF above.
We can now create a Git Pull Request to the maintainer of the Git repository.
Git Pull Request is a request made to the maintainer of a Git repository to merge our changes to the original repository’s master branch.
After raising such a request, a notification is automatically sent to the Git Repo maintainer to perform a “merge” of the Pull request.
Now, the maintainer of the Git repository can merge the Pull request if he is satisfied with your Pull Request changes!
That is it! You have successfully performed an entire flow of git operations starting from Git clone, Git branch, edit file, Git Add (to stage the changes), Git commit, Git push and finally raise a Pull Request for the Github maintainer to merge your Git Pull Request (PR).
Hope this article was useful to you in understanding how Git works.
You can replicate these steps and submit your PRs to the same example Repo (Hands-On-Git) above and I will be happy to merge your requests. 🙂
If you have any queries regarding this article or get stuck anywhere, do not hesitate to ask me about it in the comment section below.
When we sit down to write an html document, one of the first line of code we write is <!DOCTYPE html>. But what does this line do? What is significance of this line to a web browser? What happens if we miss including the DOCTYPE tag? Is it even a tag in the first place? We will answer these questions in this article.
What is DOCTYPE in HTML?
DOCTYPE is a type of directive that tells our web browser what type of document it is dealing with. As there are multiple versions of HTML documents that a web browser need to deal with, each following a different version of HTML definition standards or non HTML documents such as XML files, mentioning the “type” of this particular document helps a web browser to decide and adjust itself to render the specified document appropriately.
What happens if the DOCTYPE is not mentioned in an HTML document?
Specifying the DOCTYPE of a document will help a web browser to make appropriate decisions in rendering that file to its user successfully. In the event a web document does not have appropriate DOCTYPE specified, the web browser will try to make a best guess and try to render the document accordingly. However, a result of this could be that the rendering might not be happening in the most optimal way and as a result, some of the documents might not get rendered properly.
Is DOCTYPE even a HTML tag?
Actually, No! 😮
DOCTYPE according to HTML specification is not a HTML tag, but a declaration for the browser to make use of.
Does DOCTYPE have an end tag?
The answer is NO. DOCTYPE is not a HTML tag and does not have an explicit end tag to itself.
In the earlier days, web documents used DOCTYPE declaration effectively to let the browser know what type of HTML standard specification the document was following. As a result, the first line of a web page that declared the DOCTYPE had a very lengthy string to it, something like this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
But these days, with most of the websites using HTML 5 specification, the actual usage of DOCTYPE has become more or less redundant and hence we can get away by just declaring the DOCTYPE as:
and browser will still be able to handle it perfectly.
Hope this gave a bit of clarity on some of the doubts you had around DOCTYPE declaration in HTML. If you have an inputs or queries regarding it, don’t hesitate to ask about it in the comment section below!
Looking for best Visual Studio Code extensions for my web development activities, I came across a plethora of VS Code extensions made available by various developers not just for web development, but for various other types of programming activities as well.
In this article, I will list out few of these Visual Studio Code extensions suitable for HTML coding activities.
List of Visual Studio Code extensions for HTML
Intellisense (Built-in, no extension required)
I agree that this blog post started as showcasing a list of HTML Visual Studio Code extensions, but I would be doing a disservice to the developers of VS Code if I did not mention the excellent support that has been provided as built in functionality in the Visual Studio Code itself via Intellisense. VS Code Intellisense provides support for suggestions and auto completion of basic HTML tags.
Emmet Feature In VS Code (Again, built in, no extension required)
Emmet is my next go to feature that is built into VS Code now that I highly recommend to everyone out there that is working with HTML coding or development using Visual Studio Code.
One of the main functionality of Emmet on VS Code is to provide basic abbreviations for most of HTML code.
So say for example you are about to create a new HTML page that you want it to be mobile friendly and descibes all the basic structure of an HTML page such as UTF charset meta data, viewport type, language type etc. You can do so by simply typing “html:5” at the beginning of the document and pressing TAB key. This will trigger the Emmet’s abbreviation feature resulting in autocompleting the basic structure of a web page as shown in the GIF below:
HTML5-Boilerplate VS Code HTML Extension
The next Visual Studio Code extension for HTML deals specifically with HTML 5 and is called HTML5-Boilerplate VS Code extension.
The HTML5-Boilerplate VS Code extension is very similar to that of Emmet we had discussed earlier, but differs in the fact that it specifically deals with generating boilerplate code for HTML 5. Below is a GIF showing HTML5-Boilerplate Visual Studio Code (VS Code) extension in action:
HTML Live Preview is another Visual Studio Code extension that as the name suggests, helps its users to do a live preview of their HTML web page during its development. What is interesting is that the HTML Live Preview VS Code extension does this at real time as shown in the GIF below:
These are some of the HTML extensions for Visual Studio Code editor that I have come across up until now. I am pretty sure I might have missed out a lot more useful HTML extensions for VS Code, which I would continue to add to my toolkit upon discovery and update this article accordingly. For now, these extensions are bound to make my life easy while developing HTML code for my web development activities.
If you are aware of any more Visual Studio Code extensions for HTML that you found useful and you think I should try and recommend it to others, do let me know in the comments below and I will definitely look into it.
WordPress is a software tool that one can use to create a website with ease. WordPress is often also called a Content Management System or CMS for short, because it provides a set of tools and user friendly interface to manage the website and its contents, like creating new posts, delete posts, add new users, remove existing users, change the style of the website etc.
The ease at which one can create a website and manage it, without its user needing to have any technical knowledge or programming language makes WordPress one of the most popular website development and CMS software of choice.
Just how easy is it to use WordPress for website development?
In order to create a new website using WordPress, the creator is not required to have any programming background. There are several web hosting providers such as Namecheap, GoDaddy etc who provide web hosting services with One-Click WordPress installers built into their user’s Control Panel.
So, by just clicking a button, a user will be able to create a simple WordPress website. Of course a little bit of configuration needs to be done to ensure that the domain name that the user wants to use with the website is done. However, this is pretty straight forward and the web hosting providers do provide with sufficient tutorials and documentations on how to do this so it should not really be a bottleneck for non technical first time creator of a WordPress website.
How much does the WordPress software tool cost?
WordPress is an open source website development and CMS tool that is made available for free to its users. So, there is no cost one has to pay for using WordPress to create a website.
However having said that, a website needs to be stored in a web server and be made available to its users across the world. This web server will then be serving your WordPress web pages on the internet to its users throughout the day. This process is actually called as “Web Hosting“.
Theoretically, one can use your own computer to run (aka host) your website, but then you will need to ensure your computer is always switched on, connected to the internet and also not slowing down at any point in time. This means, you will not be able to use your computer for any other tasks that will slow down your computer.
So to avoid that, one will need to rent out a web server from any one of the various WordPress web hosting service providers. By doing so, the responsibility of ensuring that the website’s uptime i.e. your WordPress website’s web server is up and running 24/7 throughout the year will be taken care by those WordPress web hosting service providers.
Even though it may cost you a little bit to host your WordPress site on a 3rd party web hosting service providers, It is still considered a very good business decision to outsource such tasks to the service providers as then you can focus on your specific business growth activities knowing that your website’s uptime will never be compromised.
So now that we have a brief introduction to what a WordPress software tool is, we will just briefly discuss about some technical details of the WordPress. While a non technical person need not have to know any programming aspects of WordPress, having a little bit of introduction to what WordPress itself is made up of can be useful when he needs to get some technical help from any person in the future.
What is the Programming language used in WordPress?
WordPress is a PHP programming language based Content Management System (CMS). It is one of the most popular CMS used across the world. It powers more than 30% of the websites on the internet, and as a result of this popularity, it is often also the most targeted platform by hackers to try to find and exploit WordPress’s vulnerability. So one has to ensure that he keeps his WordPress website always updated with any security patches released by WordPress community.
What is the software stack used in WordPress?
While in the previous section, we described WordPress as a CMS written using PHP programming language, it also makes use of other additional technologies. These technologies clubbed together are often called as a software stack.
WordPressContent Management System (CMS) is primarily made of LAMP stack. LAMP stack stands for Linux, Apache, MySQl and PHP stack, where each of these components of the stack serves a specific purpose.
While we will discuss about each of these components of a WordPress stack in greater detail in future articles, here is a brief description of what each of this LAMP stack of WordPress stands for:
Linux – Linux is the Operating System that the webserver runs on.
Apache – Apache is the Web Server on which WordPress will typically run on.
MySQL – MySQL is the name of the database which is typically used by WordPress to store any website data as well as its content itself.
PHP – PHP is the programming language used to write WordPress software.
If you have read up to this point, you should now be having a decent understanding of what a WordPress software tool is, got an introduction to some new terminologies like CMS, Apache, MySQL, PHP etc.
In the future articles, we will start taking a deeper look into each of these components that make up a WordPress website, what their primary roles are in the functioning of WordPress, how their performance matters for the performance of your WordPress website as a whole and much more.
If you have any doubts after going through this article, or would like me to cover any specific point in more detail regarding WordPress, do leave a comment on this post below and I will make sure to discuss with you further on those topics.
Python is a versatile programming language that can be used to write programs of varied applications. The number of available libraries in Python makes it one of the most useful programming languages that can be used to perform numerous tasks. Be it writing a simple Python script to automate basic shell command operations in an Operating System, or a program to perform data analysis or Machine learning, Python excels them in all, thanks to the available Python Library packages.
In this article, we will explore and learn about using Python programming language to perform one of the most common application in the world of web, HTML scraping or web scraping using Python.
HTML – HTML is a simple Markup language used to create various HTML elements that make up a web page. The elements including Headings, Paragraphs, Lists, Images, tables, headers and footers, links etc that we see in a web page are all different HTML elements. So in other words, HTML Markup language is used to create these HTML elements that we see as part of a web page. HTML here stands for Hyper Text Markup Language.
CSS – CSS is a design style programming language that is mainly responsible for implementing the look and feel of the above mentioned HTML web page elements. You might have seen that same contents of a table are displayed in two different styles in two different websites. This is because, even though both use the same HTML Table element to create this content, the HTML Table is styled in different formats by each of these websites. This is achieved using the CSS programming language. CSS here stands for Cascading Style Sheets.
When a web page is rendered in a browser on the user’s computer, the webpage includes all these HTML elements with all the texts and image content of the web page all embedded within themselves. So, we can actually retrieve these text and image contents from a web page using a programming language such as Python. Such a process is actually called “Web Scraping” in the web development world.
Scraping A Web Page Using Python
In order to learn how to scrape a web page using Python, we will try to scrape a table that lists mountains across the world ordered by their elevation, as seen in the the official Wikipedia website:
In this Wikipedia web page, we notice the presence of several tables. The first table mainly displays list of mountains having elevation of 8000 meters or above. It is this web page’s table that we would like to scrape using Python.
Introduction to BeautifulSoup library in Python
As mentioned in the beginning of this article, Python comes with myriad of useful libraries that one can use to perform complex tasks with ease by using these libraries’ APIs. One such library is called the “BeautifulSoup” library and is one of the most interesting library that one can use in Python to perform web scraping.
BeautifulSoup Python library’s functionalities
One of the most important functionality of Python’s BeautifulSoup library is its ability to parse and interpret HTML tags. All html elements are represented using what are called the HTML tags. Some examples of such tags are <h1> for main heading, <p> for paragraphs and <table> for tables. Python’s BeautifulSoup library understands these tags and can extract information present in a web page within these tags. BeautifulSoup library exposes these APIs to us to use these functionalities in our own Python programs, which we will make use of in our Python web scraper program that we are about to write.
BeautifulSoup library is available in Python libraries repository under the name of ‘bs4’ and can be installed into your computer system for developing the web scraper using the command:
pip install bs4
BeaultifulSoup library example
In order to understand how a BeautifulSoup library works, let us download a Wikipedia web page into our local system. For this example, let us download the following Wikipedia web page:
Let us save the web page from above link as mountains.html in our local home directory (~/).
We can then read the content of this web page using Python’s BeautifulSoup library using the following commands:
from bs4 import BeautifulSoup
input = open('~/mountains.html', 'r')
soup = BeautifulSoup(input.read(),'html.parser')
tables = soup.find_all('table')
Well, thats a mouthful of code you just read there. Let us try to understand it in a step by step manner to simplify it and understand what we are doing here: The first line:
from bs4 import BeautifulSoup
Simply imports the BeautifulSoup library form the Python’s bs4 library we just installed. The next line:
input = open('~/mountains.html', 'r')
is simply using Python’s file operation function open( ) to open the previously downloaded mountain.html web page. In the next line:
soup = BeautifulSoup(input.read(),'html.parser')
we call the BeautifulSoup function and pass it as one of the argument, content of our mountain.html webpage using the Python’s standard file operation function read( ). Another argument that we pass along is ‘html.parser’. This tells the BeautifulSoup function to interpret the content of the passed input content as HTML data and use HTML parser to parse it. The resulting parsed HTML data is assigned to the variable ‘soup’ for later usage. In the next line we do this:
tables = soup.find_all('table')
What the above line shows is that we are now searching for all the available HTML tables in the ‘soup’ variable and assign it to a new variable tables. So, by now we should have all the HTML tables present in mountain.html file assigned to the Python list variable ‘tables’.
Finally, we print the content of this tables variable that should print all the tables found in our mountains.html web page!
While this is good and all, we did a manual download of the Wikipedia web page, saved it as mountain.html and only then used Python’s BeautifulSoup library to process it. However, wouldn’t it be great if we could eliminate this manual step and do even this programmatically? As a next step, we would do exactly this using a new Python library – urllib introduced next.
Introduction to Python Urllib library
Another important Python library that we are going to use to create our web scraper program is called the urllib library. Let us see what functionalities Python’s urllib library brings to us.
Python’s Urllib library is used to fetch contents of web page url. It provides us with APIs such as open(), read() etc to open a web page and read its contents back. Url here stands for Uniform Resource Locators. They are the static web addresses that one can use to locate a web page and read/fetch its contents back.
How to install Python Urllib library?
We can install the Python Urllib library using the following pip command:
pip install urllib
Python Urllib Example
Here is a simple example of urllib library that is used to fetch the content of a Wikipedia web page.
First we will import the urllib library into our Python program environment using Python’s import command:
The Urllib library exposes several useful APIs for other programs to make use of. One such API is the request API that one can use to open a web page and read its content. The request API in turn exposes two more functions called the urlopen( ) function and the read( ) function. An example of a Python program using this API is given below, where we are trying to read the contents of a Wikipedia web page:
Python Web Scraper using Urllib and BeautifulSoup libraries
Finally, combining the APIs provided by both BeautifulSoup and Urllib libraries, we can write our web scraper program that reads a Wikipedia page’s contents, extracts its tables, and print the content of a particular table as shown below:
from bs4 import BeautifulSoup import urllib.request
source = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_mountains_by_elevation').read() soup = BeautifulSoup(source,'html.parser') tables = soup.find_all('table') table_rows = tables.find_all('tr') for tr in table_rows: print (tr)
The above program is our intended Python web scraper program that can go fetch a Wikipedia page using urllib library. We can then extract all the contents of the web page and find a way to access each of these HTML elements using the Python BeautifulSoup library.
Here we are simply printing the first “table” element of the Wikipedia page, however BeautifulSoup can be used to perform many more complex scraping operations than what has been shown here.
I will explain more such operations one can perform using BeautifulSoup Python library in future articles, but this should serve as an entry point for someone who is just getting started with Python programming language for web scraping.
Pelican is a Python based static website generator written in Python. Using Pelican, one can start creating static websites that can later be deployed to a simple file web server on the cloud. Some of the cloud web service providers include Amazon Web Services, Digital Ocean, Vultr etc. One can also host these static websites on a static host providers such as Netlify, Contentful etc. But first, lets understand more about static websites and how to use Python’s Pelican to create your static website.
What is a static website and why should you use one?
The internet today is made up of both dynamic websites as well as static websites. A dynamic website is one which usually consists of a database and the server creates dynamic html web pages on the fly, usually specific to the user who requested it. On the contrary, a static website is made up of contents that are just that – static and is served as the same to all its requesting users.
So with the above introduction, its time to move into the technical aspects of Pelican. First let us discuss about the installation aspect of Pelican.
How To Install Pelican
Jump to the end of this article if you just want a one click installer to install and try Pelican
In order to install Pelican, you need to have both pip and Python installed on your system. If you dont have them installed, you can do so using the following commands:
sudo apt-get install python3 python3-pip
For Fedora Linux
sudo yum install python3 python3-pip
In this installation process, we are using Python 3 version. However note that Pelican works on both Python 2.7 as well as latest version of Python 3, so which one to use is solely left to your discretion.
Once Python and Pip are installed, we can proceed with installing Pelican onto our computers. To do so, we issue the following command:
pip3 install pelican markdown
We can note here that we are installing two Python packages from pip, one is the Pelican static site generator and the other is a markdown package. If you are unfamiliar with markdown, it is a set of standard markup language used to write contents in a way that can later be processed to format the content it surrounds. You can read more about Markdown on Wikipedia.
Once they are installed, we can create a new directory using command line to store our project files. In this case, we are creating a directory called Pelican_Demo and then moving to it.
Once inside the newly created directory, we start creating our Pelican website. To do so, we call a Python executable script called pelican-quickstart that was installed to us in our /usr/local/bin directory. So we can run this script simply by calling it as follows:
This would kick start our Pelican static website generator which then proceeds with a series of questions that you need to answer to finally create your static website.
What these set of questions actually does to your Python based Pelican static website will be a topic for another post. But for now, you should be good to go using your website.
If you want to just get your hands dirty and try to get Pelican up and running without wanting to dig deeper into investigating how it works, then you can use the following script to get going.
This script will install all the required packages and answer all the questions of pelican-quickstart automatically for you so that you can simply run it and jump to view the newly created Pelican static website. Follow the instructions given in that Python3-Pelican-Installer github project to get it up and running in no time to get a taste of what Pelican static website looks and feels like.
What is a static website?
Most of the websites we use these days are often dynamic websites. These dynamic websites have databases through which the content of a webpage is generated on the server dynamically and then sent to the user’s browser. Advantage of this is that each of the users get customized contents specific to them that are different from what would be delivered to other users. An example of this can be Facebook homepage of a user who get to see the posts from his friends and network. Google search result page is another example of a dynamic page that varies from person to person for the same query based on his browsing history.
Contrary to this, a static website is usually made up of static content (mostly using only HTML & CSS) that are already stored as complete files on the server. Thus, each of the users who request a particular webpage from this server will always receive the same content. Usually, these webpages are pre-built and stored on a file server and this file as a whole is then just sent back to the user’s browser when requested.
Advantages Of A Static Website
Fast: As these websites serve prebuilt HTML webpages, they are extremely fast.
Secure: As these websites do not possess a database but just a set of files served from a simple web based file server, there are no security threats seen that comes with using a database.
Cheap: The cost of hosting a static website is in pennies compared to a dynamic website as it just needs a simple web enabled file server.
Disadvantages Of A Static Website
As the contents are static and created in advance, no dynamic contents can be added to the web page.
User interactivity is limited due to the static nature of the website.
Usually static websites lack components such as comments, user login, recommendation engine, real time notifications etc. However, these can still be added through some 3rd party external services.
Programming knowledge is required to work with static websites. As we need to use static website generator tools that are quite technical in nature, users who wish to use static websites should be technically capable.
Content Management Systems (CMS) are usually missing in static websites. However there does exist some 3rd party CMS services such as Contentful that can overcome this issue.
Each time a new article is to be added, the static website generator builds the entire website and redeploy to the web server. This can be time consuming and can also be prone to unforeseen technical errors.
Not suitable for a large website with thousands of articles as updating such static website can be extremely slow.
Each of these static vs dynamic websites brings about their own set of advantages and challenges. So a decision as to which one is better for you completely boils down to how familiar you are with programming to work with static website generators, your website content types and its requirements.
If you are just looking for simple blog type of website to operate at a cheap cost, you can definitely opt for a static website. On the other hand, if you are looking to create a website having thousands of web pages or contents that are to be customized specific to each user, then dynamic website is the way to go!