Programmingempire
In this post on A Brief Introduction of Pandas Library in Python, I will introduce the pandas package in python. Undoubtedly, it is one of the most popular libraries in Python and stands for Panel Data. Indeed, it is another useful package for data analysis and we use it frequently when we work with datasets. Unlike the NumPy package that mainly works with arrays, the pandas package works with tabular data. Also, it has features like data visualization along with numerical computation capabilities.
In this Brief Introduction of Pandas Library, I will cover some of the most frequently used features of this package. To illustrate these features, I will use a CSV file with the name temperatures.csv in the examples. Further, you need to install this package, if you don’t already have it. For installing pandas, you can run the following command. Also, ensure that pip is installed.
pip3 install --upgrade pandas
Once, you have installed this library, you can create pandas data structures like series or data frames. Before proceeding with examples, let us first understand the difference in data representation in NumPy and pandas.
NumPy vs. Pandas
Although there are many similarities between NumPy and Pandas, still there are significant differences between the two. Unlike NumPy, Pandas represent data in tabular form with an index associated with each row. First, let us look at the Series data structure.
Working with Series
As you know, the NumPy package of Python works with arrays. Likewise, in Pandas, there are Series that are actually arrays with labels and we call these labels as the index. We use Series when we have single-column data. However, if our CSV file has multiple columns, then we use DataFrames which are discussed next. Meanwhile, consider the following demonstration.
Examples of Series in Pandas
In the following example, index labels are created as a list whereas the data values are stored in a NumPy array. First, the example creates a list without using the index attribute. Hence, the default values for the index are used. Next, when we use the index attribute, then those index values are printed. Further, the example shows the summing of the values of two series.
import numpy as np
import pandas as pd
labels=['row 1', 'row 2', 'row 3']
mylist=[2.5,3.9, 100]
arr=np.array(mylist)
print('Printing Series without using index attribute...')
d1=pd.Series(data=mylist)
print(d1)
print('Printing Series using index attribute...')
d1=pd.Series(data=mylist, index=labels)
print(d1)
print("Second Series...")
mylist=[1,2,3]
d2=pd.Series(data=mylist, index=labels)
print(d2)
#Adding two Series
print("Summing two series...")
print(d1+d2)
Output
Printing Series without using index attribute…
0 2.5
1 3.9
2 100.0
dtype: float64
Printing Series using index attribute…
row 1 2.5
row 2 3.9
row 3 100.0
dtype: float64
Second Series…
row 1 1
row 2 2
row 3 3
dtype: int64
Summing two series…
row 1 3.5
row 2 5.9
row 3 103.0
dtype: float64
Creating Data Frames
Basically, a data frame is a tabular representation of a dataset that contains labels and it is a multiple series data structure that shares the same index. In other words, a data frame is a data structure, that contains rows and columns. Therefore, it is a 2-D data structure. Unlike Series, the data frames can have multiple columns. Also, the different columns of the data frame may have different data types. Now, let us create some data frames.
Examples of Creating Data Frames
import pandas as pd
df=pd.DataFrame({"c1": [1,2,3],
"c2": [4,5,6],
"c3": [7,8,9],
"c4": [10,11,12]},
index=[1,2,3])
print("First Data Frame: ")
print(df)
print("Shape: "+str(df.shape))
df1=pd.DataFrame([[1,2,3],
[4,5,6],
[7,8,9],
[10,11,12]],
index=[1,2,3,4],
columns=['a1', 'a2', 'a3'])
print("Second Data Frame: ")
print(df1)
print("Shape: "+str(df1.shape))
# Creating Data Frame from CSV
df2=pd.read_csv('temperatures.csv')
print("Third Data Frame: ")
print(df2.head())
print("Shape: "+str(df2.shape))
Output
First Data Frame:
c1 c2 c3 c4
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
Shape: (3, 4)
Second Data Frame:
a1 a2 a3
1 1 2 3
2 4 5 6
3 7 8 9
4 10 11 12
Shape: (4, 3)
Third Data Frame:
created_at entry_id Temperature Humidity Unnamed: 4
0 2019-01-12 04:01:03 UTC 116.0 19.0 65.0 NaN
1 2019-01-12 04:02:14 UTC 117.0 19.0 65.0 NaN
2 2019-01-12 15:31:34 UTC 118.0 17.0 70.0 NaN
3 2019-01-12 15:31:59 UTC 119.0 17.0 70.0 NaN
4 2019-01-12 15:32:23 UTC 120.0 18.0 71.0 NaN
Shape: (103, 5)
Handling Missing Data
Many forecasting applications rely on data collection from the environment. However, due to several reasons, data values may be missing and most of the forecasting functions can’t operate with the missing data. For instance, the data collecting device suddenly stops working. However, that is surely not a problem with forecasting applications since it happens rarely. Fortunately, the pandas Data Frames are equipped to handle the missing data.
In this situation, if a forecasting method is unable to work with the missing value (NaN), then we can either drop the whole row in which the missing value appears or fill the missing value.
Examples of Handling Missing Values
As shown below, there are two functions available for handling missing values – dropna() and fillna(). The first one removes all missing values whereas the second one fills a specified value for the missing data. Also, the use of thresh specifies that how many minimum number of missing values are required for applying these functions.
import pandas as pd
import numpy as np
df=pd.DataFrame({'a':[12, 9, np.NaN], 'b':[8, np.NaN, np.NaN], 'c':[9,1,67]})
print(df)
print(df.dropna())
print(df.dropna(axis=1))
print(df.dropna(thresh=2))
print(df.dropna(axis=1, thresh=2))
print(df.fillna(value=3.5))
Output
a b c
0 12.0 8.0 9
1 9.0 NaN 1
2 NaN NaN 67
a b c
0 12.0 8.0 9
c
0 9
1 1
2 67
a b c
0 12.0 8.0 9
1 9.0 NaN 1
a c
0 12.0 9
1 9.0 1
2 NaN 67
a b c
0 12.0 8.0 9
1 9.0 3.5 1
2 3.5 3.5 67
Data Grouping
Basically, we group data when there are several categories and each category contains several data values. For the purpose of grouping, first, we split the data on the basis of categories. After splitting, we get several groups of data for specific categories. Next, we apply some kind of aggregate function such as average, sum, min, max, or the count. Finally, we combine the result of aggregate operations. In fact, the groupby() function of the pandas library performs all three operations as shown in the following example.
Example of Grouping
In the following example, we create a data frame with three column for the course, student name and marks. We group the data on the course field and apply the aggregate functions like min(), max(), sum(), mean(), count(), and std() on the grouped data.
import pandas as pd
data={'Course':['BCA', 'BCA', 'BCA', 'MCA', 'MCA', 'MBA'],
'Student':['Annu', 'Aman', 'Anuj', 'Binoy', 'Bondita', 'Anirudh'],
'Marks':[80, 90, 28, 78, 36, 85]}
df=pd.DataFrame(data)
print(df)
d1=df.groupby('Course').mean()
print(d1)
d1=df.groupby('Course').sum()
print(d1)
d1=df.groupby('Course').min()
print(d1)
d1=df.groupby('Course').max()
print(d1)
d1=df.groupby('Course').count()
print(d1)
d1=df.groupby('Course').std()
print(d1)
d1=df.groupby('Course').describe()
print(d1)
d1=df.groupby('Course').describe().transpose()
print(d1)
Output
Summary
To conclude this brief introduction of Pandas Library, we can say that it is one of the most powerful and most frequently used python libraries. In fact, it has applications in nearly every field including finance, insurance, and medical records. It is a high-performance library that allows us to do many complex tasks easily.
Further Reading
How to Implement Inheritance in Python
Find Prime Numbers in Given Range in Python
Running Instructions in an Interactive Interpreter in Python
Deep Learning Practice Exercise
Deep Learning Methods for Object Detection
Image Contrast Enhancement using Histogram Equalization
Transfer Learning and its Applications
Examples of OpenCV Library in Python
Understanding Blockchain Concepts
Example of Multi-layer Perceptron Classifier in Python
Measuring Performance of Classification using Confusion Matrix
Artificial Neural Network (ANN) Model using Scikit-Learn
Popular Machine Learning Algorithms for Prediction
Long Short Term Memory – An Artificial Recurrent Neural Network Architecture
Python Project Ideas for Undergraduate Students
Creating Basic Charts using Plotly
Visualizing Regression Models with lmplot() and residplot() in Seaborn
Data Visualization with Pandas