Data Science, Python

Getting Started with Data Analysis in Python

In this article on Getting Started with Data Analysis in Python, we will cover common data manipulation tasks. Particularly, we will cover common tasks such as loading CSV files and extracting specific columns, filtering, sorting, and merging datasets

Getting Started with Data Analysis in Python: Loading and Displaying CSV Files

Welcome to our comprehensive guide on data manipulation and analysis using Python and the powerful pandas library. In today’s data-driven world, the ability to load, transform, and analyze data is a fundamental skill for professionals across various domains, from data scientists to business analysts. In this blog, we will explore practical solutions to common data manipulation tasks. From loading CSV files and extracting specific columns to filtering, sorting, and merging datasets, we’ll walk you through each step with clear explanations and hands-on examples. So, let’s dive in and unlock the full potential of Python and pandas for your data analysis needs!

Exercises

a. Load a CSV file called “data.csv” into a DataFrame and display its contents

The following exercise uses this dataset: data.csv

import pandas as pd

# Load the CSV file into a DataFrame
data = pd.read_csv("data.csv")

# Display the contents of the DataFrame
print(data)

Output

Load and Display a Dataset
Load and Display a Dataset

b. Extract only the “Age” column from a DataFrame with columns “Name,” “Age,” and “Gender”

import pandas as pd
# Load the CSV file into a DataFrame
data = pd.read_csv("data.csv")

age_column = data["Age"]
print(age_column)

Output

Extract a Specific Column
Extract a Specific Column

c. Filter and display rows from a dataset of customer orders where the order total is greater than $100

The following exercise uses a dataset customers.csv

import pandas as pd

# Load the data from the 'customers.csv' file into a DataFrame
orders = pd.read_csv("customers.csv")

# Filter and display rows where the order total is greater than $100
filtered_orders = orders[orders['Order_Total'] > 100]
print("Customer orders with order total > $100:")
print(filtered_orders)

Output

Filter Data
Filter Data

d. Sort a dataset of student scores in descending order based on their exam scores

The following exercise uses the dataset students.csv

import pandas as pd

# Load the data from the 'students.csv' file into a DataFrame
student_scores = pd.read_csv("students.csv")

# Sort the DataFrame by 'Exam_Score' in descending order
sorted_scores = student_scores.sort_values(by='Exam_Score', ascending=False)

# Display the sorted DataFrame
print("Student scores sorted by exam score (descending order):")
print(sorted_scores)

Output

Sorting Data
Sorting Data

e. Group a dataset of sales transactions by product category and calculate the total sales amount for each category

The following exercise uses the dataset sales.csv

import pandas as pd

# Load the data from the 'sales.csv' file into a DataFrame
sales = pd.read_csv("sales.csv")

# Group the sales transactions by product category and calculate total sales amount
grouped_sales = sales.groupby('Product_Category')['Sales_Amount'].sum().reset_index()

# Print the grouped sales data
print("Total Sales Amount by Product Category:")
print(grouped_sales)

Output

Grouping Data
Grouping Data

f. Merge two DataFrames containing customer information and purchase history based on a common customer ID:

import pandas as pd

# Sample data for customer_info DataFrame
customer_info_data = {
    'Customer_ID': [101, 102, 103, 104],
    'Customer_Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'diana@example.com']
}

# Sample data for purchase_history DataFrame
purchase_history_data = {
    'Customer_ID': [101, 102, 103, 104, 101],
    'Product_Name': ['Product1', 'Product2', 'Product3', 'Product4', 'Product5'],
    'Quantity': [3, 2, 5, 1, 2],
    'Total_Price': [150, 100, 250, 50, 120]
}

# Create DataFrames from the sample data
customer_info = pd.DataFrame(customer_info_data)
purchase_history = pd.DataFrame(purchase_history_data)

# Merge the DataFrames based on the 'Customer_ID' column
merged_data = pd.merge(customer_info, purchase_history, on='Customer_ID', how='inner')

# Print the merged data
print("Merged Data:")
print(merged_data)

Output

Merging Data Frames
Merging Data Frames

g. Clean a dataset containing student records by handling missing values in the “Grade” column and removing duplicate entries

The following exercise uses a dataset grades.csv

import pandas as pd

# Load the data from the 'grades.csv' file into a DataFrame
data = pd.read_csv("grades.csv")

# Handling missing values in the 'Grade' column by filling them with 'NA'
data['Grade'].fillna('NA', inplace=True)

# Removing duplicate entries
cleaned_data = data.drop_duplicates()

# Print the cleaned data
print("Cleaned Data:")
print(cleaned_data)

Output

Data Cleaning for Missing and Duplicate Values
Data Cleaning for Missing and Duplicate Values

h. Create a new DataFrame containing only employees who joined the company in the last year from a dataset of employee information

The following exercise uses a dataset employees.csv

import pandas as pd
from datetime import datetime

# Load the data from the 'employees.csv' file into a DataFrame
employee_info = pd.read_csv("employees.csv")

# Calculate the current year
current_year = datetime.now().year

# Filter employees who joined in the last year
last_year_joiners = employee_info[employee_info['Joining_Date'].str.startswith(f"{current_year - 1}")]
print(last_year_joiners)

Output

Extracting Data from a Dataset Conditionally
Extracting Data from a Dataset Conditionally

i. Export a cleaned DataFrame as a new CSV file named “cleaned_data.csv” for further analysis

The following exercise makes use of a dataset called grades.csv mentioned above.

import pandas as pd

# Load the dataset from grades.csv
grades_df = pd.read_csv("grades.csv")

# Handle missing values in the 'Grades' column by filling them with 'NA'
grades_df['Grade'].fillna('NA', inplace=True)

# Remove duplicate entries based on the 'Student' column
grades_df.drop_duplicates(subset='Student_Name', keep='first', inplace=True)

# Export the cleaned data to cleaned_data.csv
grades_df.to_csv("cleaned_data.csv", index=False)
print("The cleaned data has been saved to 'cleaned_data.csv'.")

Output

Exporting Data
Exporting Data

Further Reading

Spring Framework Practice Problems and Their Solutions

Java Practice Exercise

Logistic Regression from Scratch

The Benefits of Using Data Science in the Mortgage Industry: Better Outcomes for Borrowers and Lenders

Wake Up to Better Performance with Hibernate

Data Science in Insurance: Better Decisions, Better Outcomes

Most Popular Trading Software

Breaking the Mold: Innovative Ways for College Students to Improve Software Development Skills

programmingempire

Princites

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *