Data cleaning is essential for effective data analysis and decision-making. This tutorial provides a step-by-step approach to cleaning messy datasets using Python and pandas. Topics covered include handling missing values, standardizing formats, and removing duplicates. Ideal for data science students and professionals looking to enhance their data preparation skills. Learn practical techniques to ensure data integrity and improve analysis outcomes.

Key Points

  • Covers essential data cleaning techniques using Python and pandas.
  • Explains how to handle missing values and standardize data formats.
  • Includes practical examples for removing duplicates in datasets.
  • Ideal for data science students and professionals seeking to improve data quality.
Ekemini Tom
3 pages
Language:English
Type:Tutorial
Ekemini Tom
3 pages
Language:English
Type:Tutorial
347
/ 3
 Data Cleaning and Preparation
Welcome to this hands-on tutorial where we clean and wrangle a messy dataset using Python and pandas.
Objectives:
Load messy raw data
Clean column names and entries
Handle missing data
Correct data types
Standardize formats
Remove duplicates
Save cleaned dataset
In [ ]:
import os
os.chdir(r"C:\Users\Praise\Desktop\Israel Umana\UNIUYO\2023-2024\Semester 2\CP
E221 - Data Science and Analytics")
In [ ]:
import pandas as pd
import numpy as np
In [ ]:
df = pd.read_csv('raw_data.csv')
print('Original Data:')
df.head()
In [ ]:
#view full data
df
Step 1: Clean Column Names
In [ ]:
df.columns = df.columns.str.strip().str.replace(' ', '_').str.lower()
df.columns
In [ ]:
df
Step 2: Handle Missing Values
In [ ]:
print(df.isnull().sum())
# Convert 'age' to numeric
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['age'].fillna(df['age'].median(), inplace=True)
In [ ]:
df
In [ ]:
df['email'].replace({'.': np.nan}, inplace=True)
In [ ]:
df
In [ ]:
df.dropna(subset=['email'], inplace=True)
In [ ]:
df
Step 3: Fix Date Formats
In [ ]:
df['joindate'] = pd.to_datetime(df['joindate'], errors='coerce')
In [ ]:
df
In [ ]:
Step 4: Clean Salary Column
In [ ]:
df['salary_($)'] = pd.to_numeric(df['salary_($)'], errors='coerce')
df['salary_($)'].fillna(df['salary_($)'].median(), inplace=True)
Step 5: Standardize Text Columns
In [ ]:
df['name'] = df['name'].str.title().str.strip()
df['gender'] = df['gender'].str.title().str.strip()
df['department'] = df['department'].str.upper().str.strip()
df['employee_status'] = df['employee_status'].str.upper().str.strip()
Step 6: Remove Duplicates
In [ ]:
#identify duplicates
df.duplicated()
In [ ]:
#Find duplicates in specific columns
df.duplicated(subset=['name', 'age'])
In [ ]:
#by default the first copy of the duplicate is retained while subsequent ones
are dropped
df.drop_duplicates(subset=['name', 'email'])
In [ ]:
#to retain the last copy of the duplicate, set the keep argument to 'last'
df.drop_duplicates(subset=['name', 'email'], keep='last')
In [ ]:
df
Step 7: Handle Missing Department
In [ ]:
df['department'].fillna('UNKNOWN', inplace=True)
In [ ]:
df['salary_($)'].fillna(method='ffill')
In [ ]:
df['salary_($)'].fillna(df['salary_($)'].mean(),)
In [ ]:
df
Step 8: Dropping/Adjusting Unused Columns
In [ ]:
df.drop(columns=['gender'], inplace=True)
In [ ]:
df
Final Cleaned Data
In [ ]:
df
In [ ]:
df.to_csv('cleaned_data.csv', index=False)
print('Cleaned data saved as cleaned_data.csv')
/ 3
End of Document
347

FAQs

What are the main objectives of the data cleaning tutorial?
The main objectives of the data cleaning tutorial include loading messy raw data, cleaning column names and entries, handling missing data, correcting data types, standardizing formats, removing duplicates, and saving the cleaned dataset. Each of these steps is crucial for ensuring that the dataset is ready for analysis and free from errors that could skew results.
How do you handle missing values in the dataset?
To handle missing values, the tutorial suggests several methods. For instance, it converts the 'age' column to numeric and fills missing values with the median age. Additionally, it replaces invalid email entries with NaN and drops rows with missing emails. For the salary column, missing values are filled with the median salary, ensuring that the dataset remains robust and usable.
What steps are involved in standardizing text columns?
Standardizing text columns involves several specific actions. The tutorial demonstrates how to title-case names, strip whitespace, and convert gender entries to title case. Furthermore, it emphasizes converting the department and employee status columns to uppercase, ensuring uniformity across the dataset. These steps help maintain consistency and improve data quality.
How do you remove duplicates from the dataset?
The tutorial outlines a method for identifying and removing duplicates from the dataset. It first checks for duplicates using the `duplicated()` function and then allows for the removal of duplicates based on specific columns, such as 'name' and 'email'. Users can choose to retain either the first or last occurrence of a duplicate entry, providing flexibility in how duplicates are handled.
What is the significance of fixing date formats in data cleaning?
Fixing date formats is significant because it ensures that date entries are consistent and interpretable for analysis. The tutorial demonstrates converting the 'joindate' column to a datetime format, which allows for accurate sorting and filtering of data based on dates. This step is essential for analyses that rely on time-based data.
How does the tutorial suggest dealing with unused columns?
The tutorial suggests dropping or adjusting unused columns to streamline the dataset. Specifically, it demonstrates how to remove the 'gender' column, which may not be necessary for the analysis at hand. This practice helps in reducing clutter and focusing on relevant data, ultimately enhancing the efficiency of data processing.