Categories
DATA MINING DATA SCIENCE DATA VISUALIZATION PROGRAMMING PYTHON TUTORIALS

Using Matplotlib To Draw Line Between Points

In this tutorial, let us learn how to use Matplotlib to draw line between two points. Matplotlib is a Python library package that we can use to draw lines, charts and other plots. It takes in datasets as its input and converts them into plots and graphs. Therefore, it can help us in visualizing and interpreting our datasets is a much more better way.

In order for us to be able to use Matplotlib to draw line between two points, first ensure that Matplotlib is installed on your computer. Once it is confirmed, let us now first create a set of data points that we want to plot.

Creating dataset for Matplotlib to draw line between points

One of the simplest way for us to create our dataset is by calling Python’s built-in range function.

Check this tutorial to learn more about Python’s built-in range function.

So, let us now start writing our Python plotting program:

import matplotlib.pyplot as plt

In this line of code, we are simply importing the pyplot submodule of the Matplotlib library as plt. Hence from now onwards we can call it by simply calling the plt variable.

Next, let us generate our desired dataset using Python’s range function.

x = range(5)

As can be seen here, we are asking the range function to provide us with a sequence of integers from 0 to 4. That is because mentioning an upper limit of 5, we have limited the range between 0 to 5. The step size will also default to 1. As a result, our dataset will now look like this:

x = [0, 1, 2, 3, 4]

So now that we have our dataset, its time to plot these values using Matplotlib.

Using Matplotlib to draw line between points

Since we have already imported Matplotlib’s Pyplot submobule, we can right away start using it to plot our line. Pyplot provides us with a very handy helper function called plot to plot our line.

The general syntax of our plot function looks like this:

plot([x], y, [fmt], *, data=None, **kwargs)

As can be seen above, plot takes in an optional x-axis value. However y-axis values are a must for plot function to work. On the other hand, plot function also takes in additional parameters such as an optional [fmt], data etc. You can refer to the official documentation for this function to learn more about how to use it.

However, for our case, we will simply use our dataset as our y-axis parameters. Since x-axis is optional, we can leave it blank. By doing so, Matplotlib will automatically start filling in these values starting with a value of 0 and incrementing it by 1 for each extra intervals. Hence, our code for plotting will simply look like this:

plt.plot([xi for xi in x])

What we are doing here is simply passing each of the values of our dataset x as plot functions y-axis parameters.

However, we are still not done here. The code written up until now would have drawn our line connecting the points of data. However, in order for us to be able to display it to the end user, we need to call another function called the “show” function. So, we still need to add this final line of code into our program:

plt.show()

With this, we should be able to see a plot drawn by Matplotlib that is drawing a line between our data points. It looks something like this:

Final result image of using Matplotlib to draw line between points
Line between points drawn using Matplotlib

Conclusion

Combining all the above piece of code in a single place will give our final code that looks like below:

import matplotlib.pyplot as plt
x = range(5)
plt.plot([xi for xi in x])
plt.show()

So this is it! With just these four lines of code, we are able to make use of Matplotlib to draw line between points. I hope this tutorial was pretty straight forward. If you have any more queries or simply want to say hi to me, please leave a comment below! Until next time, ciao!

Categories
DATA MINING DATA SCIENCE DATA VISUALIZATION PROGRAMMING PYTHON TUTORIALS

Difference between range vs arange in Python

In this article, we will take a look at range vs arange in Python. Learning the difference between these functions will help us understand when to either of them in our programs. Both range and arange functions of Python have their own set of advantages and disadvantages. This article will help us learn about them in detail.

To better understand the difference between range vs arange functions in Python, let us first understand what each of these functions’ do.

range vs arange in Python

range vs arange in Python: Understanding range function

The range function in Python is a function that lets us generate a sequence of integer values lying between a certain range. The function also lets us generate these values with specific step value as well . It is represented by the equation:

range([start], stop[, step])

So, in the above representation of the range function, we get a sequence of numbers lying between optional start & stop values. Next, each of these values are also getting incremented by the optional step values.

range function example 1

So that was the theory behind the range function. Now, let understand it better by practicing using it. Fire up your Python IDLE interpreter and enter this code:

l = range(1, 10, 2)

When you hit enter on your keyboard, you don’t see anything right? That is because the resulting sequence of values are stored in the list variable “l”.

To be able to see the values stored in it, let us print individual list values. So, by indexing each of the list items, we get the following values printed out.

>>> l[0]
1
>>> l[1]
3
>>> l[2]
5
>>> l[3]
7
>>> l[4]
9

So we see a list of numbers, cool! But when you observe closely, you will realize that this function has generated the numbers in a particular pattern. You can see that the first number it has generated is after taking into consideration our optional start parameter. We had set its value to 1. Next, it is also honoring the stop value by printing the value 9, which is within our defined stop value of 10. If you try to index the list for any further value beyond this point will only return an error:

>>> l[5]
Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
  l[5]
IndexError: range object index out of range

So, this confirms that the last value we get will always be less than the stop value.

But the most important thing to observation we need to make here is the step size between each of these values. We can see that each of these values are incremented exactly with a step size of 2. This is the same step size we had defined in our call to range function (the last parameter)!

range function example 2

Does this mean everytime we want to call range function, we need to define these 3 parameters? Not really. If we take a look at the signature of the arange function again:

range([start], stop[, step])

The parameters start and step (mentioned within the square brackets) are optional. This means that we can call the range function without these values as well like this:

nl = range(4)
>>> nl
range(0, 4)
>>> nl[0]
0
>>> nl[1]
1
>>> nl[2]
2
>>> nl[3]
3
>>> nl[4]

In this case, we have only the stop value of 4. As a result we get our sequence of integers starting with a default value of 0. The step value is also defaulted to a value of 1. So we get the integers in the range between 0 to 3, with a step value of 1.

Alright then, hope everything is clear to you up to this point. If this is the case with Python’s range function, what does the arange function in Python do?

range vs arange in Python: Understanding arange function

Unlike range function, arange function in Python is not a built in function. But instead, it is a function we can find in the Numpy module. So, in order for you to use the arange function, you will need to install Numpy package first!

The signature of the Python Numpy’s arange function is as shown below:

numpy.arange([start, ]stop, [step, ]dtype=None)

Wait a second! Doesn’t this signature look exactly like our range function? Yes, you are right! Python’s arange function is exactly like a range function. It also has an optional start and step parameters and the stop parameter.

But then what is the difference between the two then?

range vs arange in Python – What is the difference?

Where the arange function differs from Python’s range function is in the type of values it can generate.

The built in range function can generate only integer values that can be accessed as list elements. But on the other hand, arange function can generate values that are stored in Numpy arrays. We can observer this in the following code:

import numpy as np
a = np.arange(4)
>>> a
array([0, 1, 2, 3])
>>> a[0]
0

We are clearly seeing here that the resulting values are stored in a Numpy array. Each of the individual values can hence also be accessed by simply indexing the array!

range vs arange in Python – Advantages & Disadvantages

This begs us the next question. When should we use Python’s built-in range function vs Numpy’s arange function? To understand this, lets list out the advantages and disadvantages of each of these functions:

Advantages of range function in Python

  • range function returns a list of integers in Python 2. In case of Python 3, it returns a special “range object” just like a generator.

Disadvantages of range function in Python

  • range function is considerably slower
  • It also occupies more memory space when handling large sized data objects.

Advantages of arange function in Python

  • Numpy’s arange function returns a Numpy array
  • Its performance is wat better than the built-in range function
  • When dealing with large datasets, arange function needs much lesser memory than the built-in range function.

So this is the fundamental difference between range vs arange in Python. We can understand them even better by using them more in our everyday programming.

I hope this gave you some amount of clarity on the subject. If you still have any more doubts, do let me know about it in the comments below. I will be more than happy to help you out.

Having said that, take a look at this article. It gives you a simple explanation on the “Difference between expressions and statements in Python“. I have spent considerable amount of time trying to understand these topics. Since there are not many articles available that explains them clearly, I started this blog to capture these topics. Hope you found them useful! If yes, do share them with your friends so that it can help them as well.

With this, I will conclude this article right here. See you again in my next article, until then, ciao!

Categories
DATA MINING DATA SCIENCE DATA VISUALIZATION PROGRAMMING PYTHON TUTORIALS

Introduction To Matplotlib – Data Visualization In Python

This article will give you an introduction to Matplotlib Data Visualization In Python. Matplotlib is a data visualization library package written specifically for us to be used with Python. So Matplotlib is usually the preferred Python package to visualize data while working on Machine Learning & Data Science. It helps us in visualizing the data by representing them with the help of plots and charts.

Brief Introduction To Matplotlib – Data Visualization In Python

We humans are all highly responsive to images than text messages. Images helps us in better visualizing and understanding a situation over interpreting any raw data. So we always wanted a way to represent data through images. If you look at our history, we have always tried to accomplish this in many ways. While I cant go back in time to explain each and every approach, I can quicky give you an historical introduction to Matplotlib. This should help you understand how this package came to be. Why it matters a lot in Python data visualization.

Historical Introduction To Matplotlib – Data Visualization

In the early days of computer data analysis, data scientists often relied on tools like gnuplot and MATLAB to visualize data. However, the problem here was that they had to do it in two stages. First use programming languages like C or Python scripts to process the data. Then plot the resulting data output using gnuplot or MATLAB.

It was a very cumbersome process to say the least. It also resulted in erroneous calculation at times due to lengthy process. As a result, the scientists were in dire need of a simpler solution to this. This is when Matplotlib – Data Visualization package in Python was born.

Matplotlib Official Logo
Matplotlib Official Logo

This helped scientists to both process the data using Python scripts and also visualize the resulting output using Matplotlib package. Now since the Matplotlib was developed along the lines of MATLAB, it is supporting all the functionalities of MATLAB. Because of this reason, it got embraced by the data scientists over MATLAB pretty quickly.

Now you may be wondering why you should be using Matplotlib over MATLAB. For you to understand and appreciate Matplotlib package, you need to understand some of the benefits it brings over a tool like MATLAB. Discussing the advantages and disadvantages of this library in an article that gives an introduction to Matplotlib is appropirate I believe. This will help you in making appropriate decision while chosing this Python library package.

Advantages Of Matplotlib

  • Matplotlib Is Open Source – One of the primary advantage of Matplotlib is that it is an open source package. Because of this, you can use it in whatever way you want. You don’t need to pay any money for this tool to anyone. You can also use it for both academic and commerical purposes.
  • Written In Python – Yet another benefit of Matplotlib is that it is all written in Python. As you use Python programming language to do data processing, plotting its result in Python again makes it so much more easier.
  • Customizable & Extensible – As its written in Python, you will also be able to customize the package (if required) to suit your requirements. In addition to this, you can always extend its functionalities and contribute it back to the open source community. Since Python also has other useful packages, you can also make use of those packages’ functionalities to extend Matplotlib.
  • Portable & Cross Platform – Since its written in Python it is easily portable to any system which can run Python. It also works smoothly on Windows, Linux & Mac OS.
  • Easy to learn – Because the Python language is much easier to learn, any packages written using this language becomes so much more easier. You will not find Matplotlib any different either when it comes to this.

Matplotlib Output formats

When we plot any data using Matplotlib, we can get the resulting output plots in two different ways.

The first method is to get the resulting output plots in a new window. This is useful if you want an interactive data output. In this case, your output will continue to display for as long as your program is running.

The second method of output plots you can obtain is by saving them permanently on your computers. In this method, your resulting output will be saved in a file in standard image formats such as PNG, JPG, PDFs etc. This will be very useful to you when you want to share your results with others. You can also make use of this method when you want to generate a lot of output charts or plots programmatically. But in this case, you have one disadvantage. You will not be able to interact with the resulting output like the way you could in the first method. But as I mentioned earlier, if you just want to analyze the results in bulk at a later time, this is still one of the best method to make use of.

So you might be wondering now, what other different formats can you use to save the resulting output. Let me help you right there! So this is the list of file formats supported by Matplotlib that you can use to save your resulting plots:

EPS, JPG, PDF, PNG, PS, SVG

So now you understand the different formats that we can store our outputs in. Next, let me introduce you to another important feature of Matplotlib. It is called Backends and this terminology is something that you should be aware of when working with Matplotlib.

Introduction To Matplotlib: What Are Backends?

Backends In Matplotlib is a feature using which you can either visualize the output live or store it to analyze them in the future.

As I mentioned in the previous section, we can store the resulting plots in either files or view it live in a new interactive window. Backends simply represent this factor. So from the previous paragraphs we can already realize that there are two types of backends in Matplotlib:

  • Hardcopy backend – This is the type of backend where we save the images in a file
  • User Interface backend – In this type of backend, we display the resulting output in an interactive output window.

In order to provide us with these two type of backends, Matplotlib makes use of two sub modules called the renderer and the canvas. Let us try to learn more about them to get a better understanding of Matplotlib backends.

What is a renderer in Matplotlib?

A renderer is a module used by Matplotlib to draw its output plot or the graph. So it is this module that does the actual drawing of our Matplotlib’s output. The standard renderer used by Matplotlib to render its output is called the Anti-Grain Geometry(AGG) renderer.

Now are you wondering what this rendere do? Want to know what is so special about this renderer? This AGG renderer is a high performance renderer. It helps us in getting generating a publication level quality output. It also helps us in obtaining our output with sub pixel accuracy and antialiasing. Is this all sounding too alien of a terminology for you? Then simply know that the AGG renderer helps in getting a high production quality graphs and plots that we can adore about!

Now that we understand about the renderer used by Matplotlib, let us turn our focus towards the second module – canvas

What Is A Canvas In Matplotlib?

After getting an understanding about renderer, its time for us to learn about the other module – canvas of Matplotlib.

We mentioned that the renderer is responsible for the drawing. But do you know where exactly is this drawing being done at? That is where the canvas module comes into picture. Canvas is the area where the renderer will draw our out plots and graphs. Ok, so who provides this canvas then? Great question!

Canvas in Matplotlib is usually provided by the GUI libraries. So this can be GTK if you are using a Linux machine. Or it can be WX if you are on a Mac OS. On the other hand if you are using Windows machines, these could be coming in from Windows GUI libraries. But these are not all. It could also be coming in from platform agnostic interfaces like QT if you are developing your visualization tools in QT.

So basically we get the canvas from the GUI library we intend to use in our Visualization app.

But do you really need to worry about it? Do you really need to know about canvases and renderers when using Matplotlib? Well, for the most cases, not really. You can simply use Matplotlib’s functions and get away with generating your visualizations. You dont really need to know the underlying aspects of how Matplotlib works to use it. But if you want to be a good visual data scientist, knowing how Matplotlib works under the hood will help you in mastering it. It will also help you in truly appreciating its versatility.

Conclusion

So that is it. This should give you a good introduction to Matplotlib – data visualization and why you should use it. In the next set of articles, we will learn how to install the library package. We will learn how to make use of it to draw some interesting plots and graphs. So see you there! Until then, have a great learning experience!

If you are interested in using Matplotlib to add text to any image, here is a quick tutorial link describing how you can do this.

Categories
DATA MINING DATA SCIENCE HTML JAVASCRIPT PROGRAMMING PYTHON STATIC WEBSITES TUTORIALS WEB DEVELOPMENT WEB SCRAPING

How To Extract Data From A Website Using Python

In this article, we are going to learn how to extract data from a website using Python. The term used for extracting data from a website is called “Web scraping” or “Data scraping”. We can write programs using languages such as Python to perform web scraping automatically.

In order to understand how to write a web scraper using Python, we first need to understand the basic structure of a website. We have already written an article about it here on our website. Take a quick look at it once before proceeding here to get a sense of it.

The way to scrape a webpage is to find specific HTML elements and extract its contents. So, to write a website scraper, you need to have good understanding of HTML elements and its syntax.

Assuming you have good understanding on these per-requisites, we will now proceed to learn how to extract data from website using Python.

Python logo on extracting data from a web page using Python
Python Web Scraper Development

How To Fetch A Web Page Using Python

The first step in writing a web scraper using Python is to fetch the web page from web server to our local computer. One can achieve this by making use of a readily available Python package called urllib.

We can install the Python package urllib using Python package manager pip. We just need to issue the following command to install urllib on our computer:

pip install urllib

Once we have urllib Python package installed, we can start using it to fetch the web page to scrape its data.

For the sake of this tutorial, we are going to extract data from a web page from Wikipedia on comet found here:

https://en.wikipedia.org/wiki/Comet

This wikipedia article contains a variety of HTML elements such as texts, images, tables, headings etc. We can extract each of these elements separately using Python.

How To Fetch A Web Page Using Urllib Python package.

Let us now fetch this web page using Python library urllib by issuing the following command:

import urllib.request
content = urllib.request.urlopen('https://en.wikipedia.org/wiki/Comet')

read_content = content.read()

The first line:

import urllib.request

will import the urllib package’s request function into our Python program. We will make use of this request function send an HTML GET request to Wikipedia server to render us the webpage. The URL of this web page is passed as the parameter to this request.

content = urllib.request.urlopen('https://en.wikipedia.org/wiki/Comet')

As a result of this, the wikipedia server will respond back with the HTML content of this web page. It is this content that is stored in the Python program’s “content” variable.

The content variable will hold all the HTML content sent back by the Wikipedia server. This also includes certain HTML meta tags that are used as directives to web browser such as <meta> tags. However, as a web scraper we are mostly interested only in human readable content and not so much on meta content. Hence, we need extract only non meta HTML content from the “content” variable. We achieve this in the next line of the program by calling the read() function of urllib package.

read_content = content.read()

The above line of Python code will give us only those HTML elements which contain human readable contents.

At this point in our program we have extracted all the relevant HTML elements that we would be interested in. It is now time to extract individual data elements of the web page.

How To Extract Data From Individual HTML Elements Of The Web Page

In order to extract individual HTML elements from our read_content variable, we need to make use of another Python library called Beautifulsoup. Beautifulsoup is a Python package that can understand HTML syntax and elements. Using this library, we will be able to extract out the exact HTML element we are interested in.

We can install Python Beautifulsoup package into our local development system by issuing the command:

pip install bs4

Once Beautifulsoup Python package is installed, we can start using it to extract HTML elements from our web content. Hope you remember that we had earlier stored our web content in the Python variable “read_content“. We are now going to pass this variable along with the flag ‘html.parser’ to Beautifulsoup to extract html elements as shown below:

from bs4 import BeautifulSoup
soup = BeautifulSoup(read_content,'html.parser')

From this point on wards, our “soup” Python variable holds all the HTML elements of the webpage. So we can start accessing each of these HTML elements by using the find and find_all built-in functions.

How To Extract All The Paragraphs Of A Web Page

For example, if we want to extract the first paragraph of the wikipedia comet article, we can do so using the code:

pAll = soup.find_all('p')

Above code will extract all the paragraphs present in the article and assign it to the variable pAll. Now pAll contains a list of all paragraphs, so each individual paragraphs can be accessed through indexing. So in order to access the first paragraph, we issue the command:

pAll[0].text

The output we obtain is:

\n

So the first paragraph only contained a new line. What if we try the next index?

pAll[1].text
'\n'

We again get a newline! Now what about the third index?

pAll[2].text
"A comet is an icy, small Solar System body that..."

And now we get the text of the first paragraph of the article! If we continue further with indexing, we can see that we continue to get access to every other HTML <p> element of the article. In a similar way, we can extract other HTML elements too as shown in the next section.

How To Extract All The H2 Elements Of A Web Page

Extracting H2 elements of a web page can also be achieved in a similar way as how we did for the paragraphs earlier. By simply issuing the following command:

h2All = soup.find_all('h2')

we can filter and store all H2 elements into our h2All variable.

So with this we can now access each of the h2 element by indexing the h2All variable:

>>> h2All[0].text
'Contents'
>>> h2All[2].text
'Physical characteristics[edit]'

Conclusion

So there you have it. This is how we extract data from website using Python. By making use of the two important libraries – urllib and Beautifulsoup.

We first pull the web page content from the web server using urllib and then we use Beautifulsoup over the content. Beautifulsoup will then provides us with many useful functions (find_all, text etc) to extract individual HTML elements of the web page. By making use of these functions, we can address individual elements of the web page.

So far we have seen how we could extract paragraphs and h2 elements from our web page. But we do not stop there. We can extract any type of HTML elements using similar approach – be it images, links, tables etc. If you want to verify this, checkout this other article where we have taken similar approach to extract table elements from another wikipedia article.

How to scrape HTML tables using Python