Python

How to Perform Dataset Preprocessing in Python?

The following article demonstrates How to Perform Dataset Preprocessing in Python.

Basically, dataset preprocessing is a crucial step before training a machine learning model. For this purpose, we need to handle the missing values and the categorical data. Further, we need to split the dataset into training and test datasets. The following program demonstrates how to perform descriptive statistics, handle missing data, handle categorical data, and partition a dataset into training and test datasets using the Titanic dataset as an example. In order to run this code, you’ll need to have Pandas, NumPy, Scikit-Learn, and Seaborn installed.

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

# Load the Titanic dataset from Seaborn
titanic = sns.load_dataset('titanic')

# Display the first few rows of the dataset
print("First few rows of the Titanic dataset:")
print(titanic.head())

# Descriptive statistics
print("\nDescriptive statistics of the dataset:")
print(titanic.describe())

# Handling missing data
print("\nHandling missing data:")
# Check for missing values
print("Number of missing values in each column:")
print(titanic.isnull().sum())

# Fill missing values in the 'age' column with the mean age
imputer = SimpleImputer(strategy='mean')
titanic['age'] = imputer.fit_transform(titanic[['age']])

# Drop rows with missing values in the 'embarked' column
titanic.dropna(subset=['embarked'], inplace=True)

# Handling categorical data
print("\nHandling categorical data:")
# Encode categorical columns 'sex' and 'embarked' using Label Encoding
label_encoder = LabelEncoder()
titanic['sex'] = label_encoder.fit_transform(titanic['sex'])
titanic['embarked'] = label_encoder.fit_transform(titanic['embarked'])

# Display the modified dataset
print(titanic.head())

# Partition the dataset into training and test datasets
print("\nPartitioning the dataset into training and test datasets:")
X = titanic.drop('survived', axis=1)  # Features
y = titanic['survived']  # Target variable

# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Output

Descriptive Statistics
Descriptive Statistics
Handling Missing Values and Categorical Data
Handling Missing Values and Categorical Data
Partitioning the Dataset
Partitioning the Dataset

This program loads the Titanic dataset, computes descriptive statistics, handles missing data by imputing values and dropping rows, encodes categorical data, and finally, partitions the dataset into training and test datasets for machine learning tasks. In general, you can apply similar concepts to other datasets as well.


Further Reading

Spring Framework Practice Problems and Their Solutions

Java Practice Exercise

Why Rust? Exploring the Advantages of Rust Programming

How to Get Started With Rust?

Getting Started with Data Analysis in Python

The Benefits of Using Data Science in the Mortgage Industry: Better Outcomes for Borrowers and Lenders

Wake Up to Better Performance with Hibernate

Data Science in Insurance: Better Decisions, Better Outcomes

Most Popular Trading Software

Breaking the Mold: Innovative Ways for College Students to Improve Software Development Skills

programmingempire

Princites

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *