Tweet
Tweet
Share
Share
Last Updated on April 21, 2022
When we talk about managing data, it is quite inevitable to see
8.4k
By Nick Cotes
Last Updated on April 21, 2022
When we talk about managing data, it is quite inevitable to see data presented in tables. With column header, and sometimes with names for rows, it makes understanding data easier. In fact, it often happens that we see data of different types staying together. For example, we have quantity as numbers and name as strings in a table of ingredients for a recipe. In Python, we have the pandas library to help us handle tabular data.
After finishing this tutorial, you will learn:
What the pandas library provides
What is a DataFrame and a Series in pandas
How to manipulate DataFrame and Series beyond the trivial array operations
Let’s get started!
Massaging Data Using Pandas Photo by Mark de Jong. Some rights reserved.
Overview
This tutorial is divided into five parts:
DataFrame and Series
Essential functions in DataFrame
Manipulating DataFrames and Series
Aggregation in DataFrames
Handling time series data in pandas
DataFrame and Series
To begin, let’s start with an example dataset. We will import pandas and read the U.S. air pollutant emission data into a DataFrame:
This is a table of pollutant emissions for each year, with the information on what kind of pollutant and the amount of emission per year.
Here we demonstrated one useful feature from pandas: You can read a CSV file using read_csv() or read an Excel file using read_excel(), as above. The filename can be a local file in your machine or an URL from where the file can be downloaded. We learned about this URL from the U.S. Environmental Protection Agency’s website. We know which worksheet contains the data and from which row the data starts, hence the extra arguments to the read_excel() function.
The pandas object created above is a DataFrame, presented as a table. Similar to NumPy, data in Pandas are organized in arrays. But Pandas assign a data type to columns rather than an entire array. This allows data of different types to be included in the same data structure. We can check the data type by either calling the info() function from the DataFrame:
...
df.info()# print info to screen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5319 entries, 0 to 5318
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 State FIPS 5319 non-null int64
1 State 5319 non-null object
2 Tier 1 Code 5319 non-null int64
3 Tier 1 Description 5319 non-null object
4 Pollutant 5319 non-null object
5 emissions90 3926 non-null float64
6 emissions96 4163 non-null float64
7 emissions97 4163 non-null float64
...
29 emissions19 5052 non-null float64
30 emissions20 5052 non-null float64
31 emissions21 5052 non-null float64
dtypes: float64(27), int64(2), object(3)
memory usage: 1.3+ MB
or we can also get the type as a pandas Series:
...
coltypes=df.dtypes
print(coltypes)
1
2
3
4
5
6
7
8
9
10
11
12
13
State FIPS int64
State object
Tier 1 Code int64
Tier 1 Description object
Pollutant object
emissions90 float64
emissions96 float64
emissions97 float64
...
emissions19 float64
emissions20 float64
emissions21 float64
dtype: object
In pandas, a DataFrame is a table, while a Series is a column of the table. This distinction is important because data behind a DataFrame is a 2D array while a Series is a 1D array.
Similar to the fancy indexing in NumPy, we can extract columns from one DataFrame to create another:
State Pollutant emissions19 emissions20 emissions21
0 AL CO 8.243679 8.243679 8.243679
1 AL NH3 0.417551 0.417551 0.417551
2 AL NOX 19.592480 13.752790 11.162100
3 AL PM10-PRI 2.868642 2.868642 2.868642
4 AL PM25-PRI 2.659792 2.659792 2.659792
... ... ... ... ... ...
5314 WY NOX 0.374873 0.374873 0.374873
5315 WY PM10-PRI 2.857886 2.857886 2.857886
5316 WY PM25-PRI 2.421937 2.421937 2.421937
5317 WY SO2 0.208817 0.208817 0.208817
5318 WY VOC 6.645249 6.645249 6.645249
[5319 rows x 5 columns]
Or, if we pass in a column name as a string rather than a list of column names, we extract a column from a DataFrame as a Series:
...
data2021=df["emissions21"]
print(data2021)
1
2
3
4
5
6
7
8
9
10
11
12
0 8.243679
1 0.417551
2 11.162100
3 2.868642
4 2.659792
...
5314 0.374873
5315 2.857886
5316 2.421937
5317 0.208817
5318 6.645249
Name: emissions21, Length: 5319, dtype: float64
Essential Functions in DataFrame
Pandas is feature-rich. Many essential operations on a table or a column are provided as functions defined on the DataFrame or Series. For example, we can see a list of pollutants covered in the table above by using:
And we can find the mean (mean()), standard deviation (std()), minimum (min()), and maximum (max()) of a series similarly:
...
print(df["emissions21"].mean())
But in fact, we are more likely to use the describe() function to explore a new DataFrame. Since the DataFrame in this example has too many columns, it is better to transpose the resulting DataFrame from describe():
...
print(df.describe().T)
1
2
3
4
5
6
7
8
9
10
count mean std min25%50%75%max
State FIPS5319.029.03948115.6673521.0000016.00000029.00000042.00000056.000000
Indeed, the DataFrame produced by describe() can help us get a sense of the data. From there, we can tell how much missing data there is (by looking at the count), how the data are distributed, whether there are outliers, and so on.
Manipulating DataFrame and Series
Similar to the Boolean indexing in NumPy, we can extract a subset of rows from a DataFrame. For example, this is how we can select the data for carbon monoxide emissions only:
...
df_CO=df[df["Pollutant"]=="CO"]
print(df_CO)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
State FIPS State Tier 1 Code Tier 1 Description ... emissions18 emissions19 emissions20 emissions21
As you may expect, the == operator compares each element from a series df["Pollutant"] , resulting in a series of Boolean. If the lengths match, the DataFrame understands it is to select the rows based on the Boolean value. In fact, we can combine Booleans using bitwise operators. For example, this is how we select the rows of carbon monoxide emissions due to highway vehicles:
If you prefer to select rows like a Python list, you may do so via the iloc interface. This is how we can select rows 5 to 10 (zero-indexed) or columns 1 to 6 and rows 5 to 10:
...
df_r5=df.iloc[5:11]
df_c1_r5=df.iloc[5:11,1:7]
If you’re familiar with Excel, you probably know one of its exciting features called a “pivot table.” Pandas allows you to do the same. Let’s consider the pollution of carbon monoxide from all states in 2021 from this dataset:
The pivot_table() function above does not require the values to be unique to the index and columns. In other words, should there be two “wildfire” rows in a state in the original DataFrame, this function will aggregate the two (the default is to take the mean). To reverse the pivot operation, we have the melt() function:
There is way more we can do with a DataFrame. For example, we can sort the rows (using the sort_values() function), rename columns (using the rename() function), remove redundant rows (drop_duplicates() function), and so on.
In a machine learning project, we often need to do some clean-up before we can use the data. It is handy to use pandas for this purpose. The df_pivot DataFrame we just created has some values marked as NaN for no data available. We can replace all those with zero with any of the following:
df_pivot.fillna(0)
df_pivot.where(df_pivot.notna(),0)
df_pivot.mask(df_pivot.isna(),0)
Aggregation in DataFrames
In fact, pandas can provide table manipulation that otherwise can only be easily done using database SQL statements. Reusing the above example dataset, each pollutant in the table is broken down into different sources. If we want to know the aggregated pollutant emissions, we can just sum up all the sources. Similar to SQL, this is a “group by” operation. We can do so with the following:
The result of the groupby() function will use the grouping column as the row index. It works by putting rows that have the same value for that grouping column into a group. Then as a group, some aggregate function is applied to reduce the many rows into one. In the above example, we are taking the sum across each column. Pandas comes with many other aggregate functions, such as taking the mean or just counting the number of rows. Since we are doing sum(), the non-numeric columns are dropped from the output as they do not apply to the operation.
This allows us to do some interesting tasks. Let’s say, using the data in the DataFrame above, we create a table of the total emission of carbon monoxide (CO) and sulfur dioxide (SO2) in 2021 in each state. The reasoning on how to do that would be:
Group by “State” and “Pollutant,” then sum up each group. This is how we get the total emission of each pollutant in each state.
Select only the column for 2021
Run pivot table to make states the rows and the pollutants the columns with the total emission as the values
In the above code, each step after the groupby() function is to create a new DataFrame. Since we are using functions defined under DataFrame, we have the above functional chained invocation syntax.
The sum() function will create a DataFrame from the GroupBy object that has the grouped columns “State” and “Pollutant” as an index. Therefore, after we diced the DataFrame to only one column, we used reset_index() to make the index as columns (i.e., there will be three columns, State, Pollutant, and emissions21). Since there will be more pollutants than we need, we use filter() to select only the columns for CO and SO2 from the resulting DataFrame. This is similar to using fancy indexing to select columns.
Indeed, we can do the same differently:
Select only the rows for CO and compute the total emission; select only the data for 2021
Do the same for SO2
Combine the resulting DataFrame in the previous two steps
In pandas, there is a join() function in DataFrame that helps us combine the columns with another DataFrame by matching the index. In code, the above steps are as follows:
The join() function is limited to index matching. If you’re familiar with SQL, the JOIN clause’s equivalent in pandas is the merge() function. If the two DataFrames we created for CO and SO2 have the states as a separate column, we can do the same as follows:
The merge() function in pandas can do all types of SQL joins. We can match different columns from a different DataFrame, and we can do left join, right join, inner join, and outer join. This will be very useful when wrangling the data for your project.
The groupby() function in a DataFrame is powerful as it allows us to manipulate the DataFrame flexibly and opens the door to many sophisticated transformations. There may be a case that no built-in function can help after groupby(), but we can always provide our own. For example, this is how we can create a function to operate on a sub-DataFrame (on all columns except the group-by column) and apply it to find the years of minimum and maximum emissions:
The apply() function is the last resort to provide us the maximum flexibility. Besides GroupBy objects, there are also apply() interfaces in DataFrames and Series.
The following is the complete code to demonstrate all operations we introduced above:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import pandas aspd
# Pollutants data from Environmental Protection Agency
You will find another powerful feature from pandas if you are dealing with time series data. To begin, let’s consider some daily pollution data. We can select and download some from the EPA’s website:
For illustration purposes, we downloaded the PM2.5 data of Texas in 2021. We can import the downloaded CSV file, ad_viz_plotval_data.csv, as follows:
The read_csv() function from pandas allows us to specify some columns as the date and parse them into datetime objects rather than a string. This is essential for further processing time series data. As we know, the first column (zero-indexed) is the date column; we provide the argument parse_dates=[0] above.
For manipulating time series data, it is important to use time as an index in your DataFrame. We can make one of the columns an index by the set_index() function:
...
df_pm25=df.set_index("Date")
print(df_pm25)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Source Site ID POC Daily Mean PM2.5Concentration...COUNTY_CODE COUNTY SITE_LATITUDE SITE_LONGITUDE
We know its type is datetime64, which is a timestamp object in pandas.
From the index above, we can see each date is not unique. This is because the PM2.5 concentration is observed in different sites, and each will contribute a row to the DataFrame. We can filter the DataFrame to only one site to make the index unique. Alternatively, we can use pivot_table() to transform the DataFrame, where the pivot operation guarantees the resulting DataFrame will have unique index:
df_2021=(df[["Date","Daily Mean PM2.5 Concentration","Site Name"]]
.pivot_table(index="Date",
columns="Site Name",
values="Daily Mean PM2.5 Concentration")
)
print(df_2021)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Site Name Amarillo A&M Ascarate Park SE Austin North Hills Drive ... Von Ormy Highway 16 Waco Mazanec World Trade Bridge
Date ...
2021-01-01 1.7 11.9 3.0 ... 4.6 2.7 4.4
2021-01-02 2.2 7.8 6.1 ... 3.7 2.2 6.1
2021-01-03 2.5 4.2 4.3 ... 6.3 4.2 8.5
2021-01-04 3.7 8.1 3.7 ... 6.4 4.2 5.7
2021-01-05 4.5 10.0 5.2 ... 7.7 6.1 7.1
... ... ... ... ... ... ... ...
2021-12-27 1.9 5.8 11.0 ... 13.8 10.5 15.7
2021-12-28 1.8 6.6 14.1 ... 17.7 9.7 17.6
2021-12-29 NaN 8.1 21.8 ... 28.6 12.5 14.1
2021-12-30 4.0 9.5 13.1 ... 20.4 13.4 18.5
2021-12-31 3.6 3.7 16.3 ... 18.3 11.8 21.5
[365 rows x 53 columns]
We can check the uniqueness with:
df_2021.index.is_unique
Now, every column in this DataFrame is a time series. While pandas does not provide any forecasting function on the time series, it comes with tools to help you clean and transform the data. Setting a DateTimeIndex to a DataFrame will be handy for time series analysis projects because we can easily extract data for a time interval, e.g., the train-test split of the time series. Below is how we can extract a 3-month subset from the above DataFrame:
df_3month=df_2021["2021-04-01":"2021-07-01"]
One commonly used function in a time series is to resample the data. Considering the daily data in this DataFrame, we can transform it into weekly observations instead. We can specify the resulting data to be indexed on every Sunday. But we still have to tell what we want the resampled data to be like. If it is sales data, we probably want to sum over the entire week to get the weekly revenue. In this case, we can take the average over a week to smooth out the fluctuations. An alternative is to take the first observation over each period, like below:
...
df_resample=df_2021.resample("W-SUN").first()
print(df_resample)
1
2
3
4
5
6
7
8
9
10
11
Site Name AmarilloA&MAscarate Park SE Austin North Hills Drive...Von Ormy Highway16Waco Mazanec World Trade Bridge
Date...
2021-01-031.711.93.0...4.62.74.4
2021-01-103.78.13.7...6.44.25.7
2021-01-175.85.37.0...5.46.94.8
...
2021-12-193.613.06.3...6.95.95.5
2021-12-265.310.45.7...5.55.43.9
2022-01-021.95.811.0...13.810.515.7
[53rowsx53columns]
The string “W-SUN” is to determine the mean weekly on Sundays. It is called the “offset alias.” You can find the list of all offset alias from below:
Resampling is particularly useful in financial market data. Imagine if we have the price data from the market, where the raw data does not come in regular intervals. We can still use resampling to convert the data into regular intervals. Because it is so commonly used, pandas even provides you the open-high-low-close (known as OHLC, i.e., first, maximum, minimum, and last observations over a period) from the resampling. We demonstrate below how to get the OHLC over a week on one of the observation sites:
df_ohlc=df_2021["San Antonio Interstate 35"].resample("W-SUN").ohlc()
print(df_ohlc)
1
2
3
4
5
6
7
8
9
10
11
open high low close
Date
2021-01-03 4.2 12.6 4.2 12.6
2021-01-10 9.7 9.7 3.0 5.7
2021-01-17 5.4 13.8 3.0 13.8
2021-01-24 9.5 11.5 5.7 9.0
...
2021-12-12 5.7 20.0 5.7 20.0
2021-12-19 9.7 9.7 3.9 3.9
2021-12-26 6.1 14.7 6.0 14.7
2022-01-02 10.9 23.7 10.9 16.3
In particular, if we resample a time series from a coarser frequency into a finer frequency, it is called upsampling. Pandas usually inserts NaN values during upsampling as the original time series does not have data during the in-between time instances. One way to avoid these NaN values during upsampling is to ask pandas to forward-fill (carry over values from an earlier time) or back-fill (using values from a later time) the data. For example, the following is to forward-fill the daily PM2.5 observations from one site into hourly:
...
series_ffill=df_2021["San Antonio Interstate 35"].resample("H").ffill()
print(series_ffill)
1
2
3
4
5
6
7
8
9
10
11
12
13
Date
2021-01-01 00:00:00 4.2
2021-01-01 01:00:00 4.2
2021-01-01 02:00:00 4.2
2021-01-01 03:00:00 4.2
2021-01-01 04:00:00 4.2
...
2021-12-30 20:00:00 18.2
2021-12-30 21:00:00 18.2
2021-12-30 22:00:00 18.2
2021-12-30 23:00:00 18.2
2021-12-31 00:00:00 16.3
Freq: H, Name: San Antonio Interstate 35, Length: 8737, dtype: float64
Besides resampling, we can also transform the data using a sliding window. For example, below is how we can make a 10-day moving average from the time series. It is not a resampling because the resulting data is still daily. But for each data point, it is the mean of the past 10 days. Similarly, we can find the 10-day standard deviation or 10-day maximum by applying a different function to the rolling object.
...
df_mean=df_2021["San Antonio Interstate 35"].rolling(10).mean()
print(df_mean)
1
2
3
4
5
6
7
8
9
10
11
12
13
Date
2021-01-01 NaN
2021-01-02 NaN
2021-01-03 NaN
2021-01-04 NaN
2021-01-05 NaN
...
2021-12-27 8.30
2021-12-28 9.59
2021-12-29 11.57
2021-12-30 12.78
2021-12-31 13.24
Name: San Antonio Interstate 35, Length: 365, dtype: float64
To show how the original and rolling average time series differs, below shows you the plot. We added the argument min_periods=5 to the rolling() function because the original data has missing data on some days. This produces gaps in the daily data, but we ask that the mean still be computed as long as there are 5 data points over the window of the past 10 days.
...
import matplotlib.pyplot asplt
fig=plt.figure(figsize=(12,6))
plt.plot(df_2021["San Antonio Interstate 35"],label="daily")
plt.plot(df_2021["San Antonio Interstate 35"].rolling(10,min_periods=5).mean(),label="10-day MA")
plt.legend()
plt.ylabel("PM 2.5")
plt.show()
The following is the complete code to demonstrate the time series operations we introduced above: