The following article demonstrates How to Perform Dataset Preprocessing in Python.
Basically, dataset preprocessing is a crucial step before training a machine learning model. For this purpose, we need to handle the missing values and the categorical data. Further, we need to split the dataset into training and test datasets. The following program demonstrates how to perform descriptive statistics, handle missing data, handle categorical data, and partition a dataset into training and test datasets using the Titanic dataset as an example. In order to run this code, you’ll need to have Pandas, NumPy, Scikit-Learn, and Seaborn installed.
import pandas as pd import numpy as np import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.impute import SimpleImputer # Load the Titanic dataset from Seaborn titanic = sns.load_dataset('titanic') # Display the first few rows of the dataset print("First few rows of the Titanic dataset:") print(titanic.head()) # Descriptive statistics print("\nDescriptive statistics of the dataset:") print(titanic.describe()) # Handling missing data print("\nHandling missing data:") # Check for missing values print("Number of missing values in each column:") print(titanic.isnull().sum()) # Fill missing values in the 'age' column with the mean age imputer = SimpleImputer(strategy='mean') titanic['age'] = imputer.fit_transform(titanic[['age']]) # Drop rows with missing values in the 'embarked' column titanic.dropna(subset=['embarked'], inplace=True) # Handling categorical data print("\nHandling categorical data:") # Encode categorical columns 'sex' and 'embarked' using Label Encoding label_encoder = LabelEncoder() titanic['sex'] = label_encoder.fit_transform(titanic['sex']) titanic['embarked'] = label_encoder.fit_transform(titanic['embarked']) # Display the modified dataset print(titanic.head()) # Partition the dataset into training and test datasets print("\nPartitioning the dataset into training and test datasets:") X = titanic.drop('survived', axis=1) # Features y = titanic['survived'] # Target variable # Split the data into 80% training and 20% testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Display the shapes of the resulting datasets print("Shape of X_train:", X_train.shape) print("Shape of X_test:", X_test.shape) print("Shape of y_train:", y_train.shape) print("Shape of y_test:", y_test.shape)
This program loads the Titanic dataset, computes descriptive statistics, handles missing data by imputing values and dropping rows, encodes categorical data, and finally, partitions the dataset into training and test datasets for machine learning tasks. In general, you can apply similar concepts to other datasets as well.
- Dot Net Framework
- Power Bi
- Scratch 3.0