In this article, I am going to explain how to use Pandas in Python. Pandas is one of the most popular modules in
python that can be used for data manipulation and analysis using python. Basically, it provides an easy interface to
interact with flowing data and apply transformations to them on the go. This module is covered under the BSD license
and can be used for free. You can download this module by visiting the website or by installing it through the python package manager.
Pandas provide us with a range of data analysis options such as reading data from files and databases, to applying
various transformations within the data frames, slicing and dicing the data, and then writing the data back to a
database or prepare it for a visualization tool to be fed to. Pandas can also visualize data within the python
environment by importing another module known as matplotlib and display stunning visuals within it. However, for the
scope of this article, we will stick to learning Pandas in python only. As per the definition provided by Wikipedia,
“The name Pandas is derived from the term ‘panel data’, an econometrics term for data sets that include
observations over multiple time periods for the same individuals”. Over the last few years, this module has
been gaining popularity and this can be explained if we see the search trends from Stack Overflow.
Figure 1 – Pandas popularity from Stack Overflow
If you see the above graph, it is clearly visible that in recent years, the trend of using Pandas has increased
exponentially and it is now one of the most common modules used by the entire data science community.
What can be done with Pandas in python?
You can consider it to be the bread and butter for your data applications. Whenever you think about playing with
data in python, the very first thing that you can consider is to use Pandas to wrangle the data into your
playground. You can get started with cleaning the data by removing unwanted information, transform the data by
applying business logic into it, and then finally prepare the data for visualization.
Let’s take an example that you want to read data from a CSV file which is either on your machine or on a shared
network location. With the help of Pandas, you will easily be able to connect and extract information from the CSV
file and create a data frame within the python environment. Once the data is within the python environment, you can
apply many operations on it, some of which are mentioned as follows.
You can calculate the basic statistics of your dataset and answer common questions like what the mean is, the
median, the minimum and the maximum values
- You can also find a correlation between two or more columns in the dataset
- Perform data cleaning by removing missing or blank values and filter records based on a criterion
- Visualize the data by using other modules like seaborn, matplotlib, etc.
- Save the cleaned data frame into a CSV or a database of your choice
How does it fit into the data world?
If you are working as a Data Engineer or a Data Scientist, you might already have come across Pandas while
developing applications. However, for a beginner, I would suggest that you should have a basic understanding of how
python works, the various data structures within python, like lists, dictionaries, tuples, iterations, etc.
The Pandas module has been developed on top of another popular module, known as NumPy. This means
that a lot of data structures between these two modules will be similar. The data in Pandas can be used to provide
other packages such as SciPy, for making scientific analyses or Matplotlib for
making visualizations, etc. It can also be used as a source for machine learning modules like
Installing and setting up Pandas
So far, we have learned about what Pandas library in python is and various information related to it. Let us now go
ahead and see how we can get this installed on our machine and start using it. Head over to command prompt on your
machine and type the following command.
pip install pandas
As soon as you hit Enter, you can see that the library has started downloading and will be installed on your machine
shortly. The size of this module is around 9MB and should be installed within a minute or so.
Figure 2 – Installing Pandas in Python
If you are using Anaconda, then you can install Pandas by running the following command.
conda install pandas
Now that we have installed Pandas on our machine, let us go ahead and print the version information of the module.
On your command prompt window, type “python” and hit Enter. This will start the python execution within the
command prompt window.
Figure 3 – Starting the python execution in command prompt
Once the python shell is up and running, we need to import the Pandas module into our python environment. This can
be done by running the following command and hit Enter.
This will import the Pandas module and now we can start using this in our code. Once the module is imported, write
the command that will print the version of the Pandas that we have installed recently.
Once you run the above command, the version of the Pandas will be printed on the screen as follows.
Figure 4 – Printing Pandas version information
Creating Data Frames using Pandas in Python
The basic structure of a Pandas library is the data frame. The data frame is basically a representation of a 2-D
array. You can also consider the data frame as an in-memory table on which you can perform all the operations as
discussed earlier. Whenever we work with the Pandas module, we should try to fit the data into a data frame so that
we can apply all the in-built methods directly.
There are a number of ways in which a data frame can be created. For the sake of this article, let us try to create
the same from two dictionaries. For example, let us consider that we have a list of employees and their
corresponding departments. So, we can create a simple dictionary with two lists in it that will contain the
information. You can use the code below to create the dictionary.
Figure 5 – Creating the Dictionary object
Once the dictionary object has been created, let us now use this and pass it to the Pandas to create a data frame.
empDf = pandas.DataFrame(data)
Figure 6 – Converting the dictionary to a Pandas Data Frame
As you can see in the figure above, the dictionary object has been transformed into a Pandas data frame. This data
frame can be now used to perform data analysis and other operations on it. In my next article, I will mention how
can we read data from a CSV file and apply transformations using the data frame.
In this article, we have seen what Pandas in python is and how can we install it on our machine. We have also
learned about some of the important functions that can be done with the help of the Pandas library. In day to day
analysis, the Pandas module plays a very important role in transforming the raw dataset and to apply operations on
this dataset as required. You can either sort the data, filter it, add new columns to the dataset based on existing
values, etc. This makes it a very popular module that is heavily used in data science and machine learning
To learn more about the Pandas library, you can follow the official documentation from the Pandas website.
There is also a very good resource available for Pandas in python which you can purchase from Amazon. This book especially describes the methods in more detail and is quite helpful for beginners to start with. If you are planning to learn python and Pandas by watching video tutorials, Python for Everybody is a good place to learn from Coursera.