December 1, 2020

Programmingempire

In this post on A Brief Introduction of Pandas Library in Python, I will introduce the pandas package in python. Undoubtedly, it is one of the most popular libraries in Python and stands for Panel Data. Indeed, it is another useful package for data analysis and we use it frequently when we work with datasets. Unlike the NumPy package that mainly works with arrays, the pandas package works with tabular data. Also, it has features like data visualization along with numerical computation capabilities.

In this Brief Introduction of Pandas Library, I will cover some of the most frequently used features of this package. To illustrate these features, I will use a CSV file with the name temperatures.csv in the examples. Further, you need to install this package, if you don’t already have it. For installing pandas, you can run the following command. Also, ensure that pip is installed.

pip3 install --upgrade pandas

Once, you have installed this library, you can create pandas data structures like series or data frames. Before proceeding with examples, let us first understand the difference in data representation in NumPy and pandas.

NumPy vs. Pandas

Although there are many similarities between NumPy and Pandas, still there are significant differences between the two. Unlike NumPy, Pandas represent data in tabular form with an index associated with each row. First, let us look at the Series data structure.

Working with Series

As you know, the NumPy package of Python works with arrays. Likewise, in Pandas, there are Series that are actually arrays with labels and we call these labels as the index. We use Series when we have single-column data. However, if our CSV file has multiple columns, then we use DataFrames which are discussed next. Meanwhile, consider the following demonstration.

Examples of Series in Pandas

In the following example, index labels are created as a list whereas the data values are stored in a NumPy array. First, the example creates a list without using the index attribute. Hence, the default values for the index are used. Next, when we use the index attribute, then those index values are printed. Further, the example shows the summing of the values of two series.

import numpy as np
import pandas as pd

labels=['row 1', 'row 2', 'row 3']
mylist=[2.5,3.9, 100]
arr=np.array(mylist)

print('Printing Series without using index attribute...')
d1=pd.Series(data=mylist)
print(d1)

print('Printing Series using index attribute...')
d1=pd.Series(data=mylist, index=labels)
print(d1)

print("Second Series...")
mylist=[1,2,3]
d2=pd.Series(data=mylist, index=labels)
print(d2)

#Adding two Series
print("Summing two series...")
print(d1+d2)

Output

Printing Series without using index attribute…
0 2.5
1 3.9
2 100.0
dtype: float64
Printing Series using index attribute…
row 1 2.5
row 2 3.9
row 3 100.0
dtype: float64
Second Series…
row 1 1
row 2 2
row 3 3
dtype: int64
Summing two series…
row 1 3.5
row 2 5.9
row 3 103.0
dtype: float64

Creating Data Frames

Basically, a data frame is a tabular representation of a dataset that contains labels and it is a multiple series data structure that shares the same index. In other words, a data frame is a data structure, that contains rows and columns. Therefore, it is a 2-D data structure. Unlike Series, the data frames can have multiple columns. Also, the different columns of the data frame may have different data types. Now, let us create some data frames.

Examples of Creating Data Frames

import pandas as pd
df=pd.DataFrame({"c1": [1,2,3],
                "c2": [4,5,6],
                "c3": [7,8,9],
                "c4": [10,11,12]},
                index=[1,2,3])
print("First Data Frame: ")
print(df)
print("Shape: "+str(df.shape))

df1=pd.DataFrame([[1,2,3],
                [4,5,6],
                [7,8,9],
                [10,11,12]],
                index=[1,2,3,4],
                columns=['a1', 'a2', 'a3'])
print("Second Data Frame: ")
print(df1)
print("Shape: "+str(df1.shape))

# Creating Data Frame from CSV
df2=pd.read_csv('temperatures.csv')
print("Third Data Frame: ")
print(df2.head())
print("Shape: "+str(df2.shape))

Output

First Data Frame:
c1 c2 c3 c4
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
Shape: (3, 4)
Second Data Frame:
a1 a2 a3
1 1 2 3
2 4 5 6
3 7 8 9
4 10 11 12
Shape: (4, 3)
Third Data Frame:
created_at entry_id Temperature Humidity Unnamed: 4
0 2019-01-12 04:01:03 UTC 116.0 19.0 65.0 NaN
1 2019-01-12 04:02:14 UTC 117.0 19.0 65.0 NaN
2 2019-01-12 15:31:34 UTC 118.0 17.0 70.0 NaN
3 2019-01-12 15:31:59 UTC 119.0 17.0 70.0 NaN
4 2019-01-12 15:32:23 UTC 120.0 18.0 71.0 NaN
Shape: (103, 5)

Handling Missing Data

Many forecasting applications rely on data collection from the environment. However, due to several reasons, data values may be missing and most of the forecasting functions can’t operate with the missing data. For instance, the data collecting device suddenly stops working. However, that is surely not a problem with forecasting applications since it happens rarely. Fortunately, the pandas Data Frames are equipped to handle the missing data.

In this situation, if a forecasting method is unable to work with the missing value (NaN), then we can either drop the whole row in which the missing value appears or fill the missing value.

Examples of Handling Missing Values

As shown below, there are two functions available for handling missing values – dropna() and fillna(). The first one removes all missing values whereas the second one fills a specified value for the missing data. Also, the use of thresh specifies that how many minimum number of missing values are required for applying these functions.

import pandas as pd
import numpy as np

df=pd.DataFrame({'a':[12, 9, np.NaN], 'b':[8, np.NaN, np.NaN], 'c':[9,1,67]})

print(df)

print(df.dropna())

print(df.dropna(axis=1))

print(df.dropna(thresh=2))

print(df.dropna(axis=1, thresh=2))

print(df.fillna(value=3.5))

Output

  a    b   c

0 12.0 8.0 9
1 9.0 NaN 1
2 NaN NaN 67
a b c
0 12.0 8.0 9
c
0 9
1 1
2 67
a b c
0 12.0 8.0 9
1 9.0 NaN 1
a c
0 12.0 9
1 9.0 1
2 NaN 67
a b c
0 12.0 8.0 9
1 9.0 3.5 1
2 3.5 3.5 67

Data Grouping

Basically, we group data when there are several categories and each category contains several data values. For the purpose of grouping, first, we split the data on the basis of categories. After splitting, we get several groups of data for specific categories. Next, we apply some kind of aggregate function such as average, sum, min, max, or the count. Finally, we combine the result of aggregate operations. In fact, the groupby() function of the pandas library performs all three operations as shown in the following example.

Example of Grouping

In the following example, we create a data frame with three column for the course, student name and marks. We group the data on the course field and apply the aggregate functions like min(), max(), sum(), mean(), count(), and std() on the grouped data.

import pandas as pd

data={'Course':['BCA', 'BCA', 'BCA', 'MCA', 'MCA', 'MBA'],
      'Student':['Annu', 'Aman', 'Anuj', 'Binoy', 'Bondita', 'Anirudh'],
      'Marks':[80, 90, 28, 78, 36, 85]}

df=pd.DataFrame(data)
print(df)

d1=df.groupby('Course').mean()
print(d1)

d1=df.groupby('Course').sum()
print(d1)

d1=df.groupby('Course').min()
print(d1)

d1=df.groupby('Course').max()
print(d1)

d1=df.groupby('Course').count()
print(d1)

d1=df.groupby('Course').std()
print(d1)

d1=df.groupby('Course').describe()
print(d1)

d1=df.groupby('Course').describe().transpose()
print(d1)

Output

Summary

To conclude this brief introduction of Pandas Library, we can say that it is one of the most powerful and most frequently used python libraries. In fact, it has applications in nearly every field including finance, insurance, and medical records. It is a high-performance library that allows us to do many complex tasks easily.

Programmingempire

NumPy vs. Pandas

Working with Series

Examples of Series in Pandas

Creating Data Frames

Examples of Creating Data Frames

Handling Missing Data

Examples of Handling Missing Values

Data Grouping

Example of Grouping

Summary

Further Reading

You may also like...

What is Image Segmentation?

Example of Creating Transformer Model Using PyTorch

Exclusive Project Ideas for Students Using PySyft