Data Cleaning Tutorial: A Comprehensive Guide PDF Download

 Data Cleaning and Preparation

Welcome to this hands-on tutorial where we clean and wrangle a messy dataset using Python and pandas.

Objectives:

Load messy raw data

Clean column names and entries

Handle missing data

Correct data types

Standardize formats

Remove duplicates

Save cleaned dataset

In [ ]:

import os

os.chdir(r"C:\Users\Praise\Desktop\Israel Umana\UNIUYO\2023-2024\Semester 2\CP

E221 - Data Science and Analytics")

In [ ]:

import pandas as pd

import numpy as np

In [ ]:

df = pd.read_csv('raw_data.csv')

print('Original Data:')

df.head()

In [ ]:

#view full data

Step 1: Clean Column Names

In [ ]:

df.columns = df.columns.str.strip().str.replace(' ', '_').str.lower()

df.columns

In [ ]:

Step 2: Handle Missing Values

In [ ]:

print(df.isnull().sum())

# Convert 'age' to numeric

df['age'] = pd.to_numeric(df['age'], errors='coerce')

df['age'].fillna(df['age'].median(), inplace=True)

In [ ]:

df['email'].replace({'.': np.nan}, inplace=True)

In [ ]:

df.dropna(subset=['email'], inplace=True)

In [ ]:

Step 3: Fix Date Formats

In [ ]:

df['joindate'] = pd.to_datetime(df['joindate'], errors='coerce')

In [ ]:

Step 4: Clean Salary Column

In [ ]:

df['salary_($)'] = pd.to_numeric(df['salary_($)'], errors='coerce')

df['salary_($)'].fillna(df['salary_($)'].median(), inplace=True)

Step 5: Standardize Text Columns

In [ ]:

df['name'] = df['name'].str.title().str.strip()

df['gender'] = df['gender'].str.title().str.strip()

df['department'] = df['department'].str.upper().str.strip()

df['employee_status'] = df['employee_status'].str.upper().str.strip()

Step 6: Remove Duplicates

In [ ]:

#identify duplicates

df.duplicated()

In [ ]:

#Find duplicates in specific columns

df.duplicated(subset=['name', 'age'])

In [ ]:

#by default the first copy of the duplicate is retained while subsequent ones

are dropped

df.drop_duplicates(subset=['name', 'email'])

In [ ]:

#to retain the last copy of the duplicate, set the keep argument to 'last'

df.drop_duplicates(subset=['name', 'email'], keep='last')

In [ ]:

Step 7: Handle Missing Department

In [ ]:

df['department'].fillna('UNKNOWN', inplace=True)

In [ ]:

df['salary_($)'].fillna(method='ffill')

In [ ]:

df['salary_($)'].fillna(df['salary_($)'].mean(),)

In [ ]:

Step 8: Dropping/Adjusting Unused Columns

In [ ]:

df.drop(columns=['gender'], inplace=True)

In [ ]:

✅

Final Cleaned Data

In [ ]:

df.to_csv('cleaned_data.csv', index=False)

print('Cleaned data saved as cleaned_data.csv')

Overview

Data Cleaning Tutorial: A Comprehensive Guide

Data cleaning is essential for effective data analysis and decision-making. This tutorial provides a step-by-step approach to cleaning messy datasets using Python and pandas. Topics covered include handling missing values, standardizing formats, and removing duplicates. Ideal for data science students and professionals looking to enhance their data preparation skills. Learn practical techniques to ensure data integrity and improve analysis outcomes. Key Points Covers essential data cleaning techniques using Python and pandas. Explains how to handle missing values and standardize data formats. Includes practical examples for removing duplicates in datasets. Ideal for data science students and professionals seeking to improve data quality.

/ 3

347

FAQs

What are the main objectives of the data cleaning tutorial?

The main objectives of the data cleaning tutorial include loading messy raw data, cleaning column names and entries, handling missing data, correcting data types, standardizing formats, removing duplicates, and saving the cleaned dataset. Each of these steps is crucial for ensuring that the dataset is ready for analysis and free from errors that could skew results.

How do you handle missing values in the dataset?

To handle missing values, the tutorial suggests several methods. For instance, it converts the 'age' column to numeric and fills missing values with the median age. Additionally, it replaces invalid email entries with NaN and drops rows with missing emails. For the salary column, missing values are filled with the median salary, ensuring that the dataset remains robust and usable.

What steps are involved in standardizing text columns?

Standardizing text columns involves several specific actions. The tutorial demonstrates how to title-case names, strip whitespace, and convert gender entries to title case. Furthermore, it emphasizes converting the department and employee status columns to uppercase, ensuring uniformity across the dataset. These steps help maintain consistency and improve data quality.

How do you remove duplicates from the dataset?

The tutorial outlines a method for identifying and removing duplicates from the dataset. It first checks for duplicates using the `duplicated()` function and then allows for the removal of duplicates based on specific columns, such as 'name' and 'email'. Users can choose to retain either the first or last occurrence of a duplicate entry, providing flexibility in how duplicates are handled.

What is the significance of fixing date formats in data cleaning?

Fixing date formats is significant because it ensures that date entries are consistent and interpretable for analysis. The tutorial demonstrates converting the 'joindate' column to a datetime format, which allows for accurate sorting and filtering of data based on dates. This step is essential for analyses that rely on time-based data.

How does the tutorial suggest dealing with unused columns?

The tutorial suggests dropping or adjusting unused columns to streamline the dataset. Specifically, it demonstrates how to remove the 'gender' column, which may not be necessary for the analysis at hand. This practice helps in reducing clutter and focusing on relevant data, ultimately enhancing the efficiency of data processing.

You May Also Like

FAQs

What are the main objectives of the data cleaning tutorial?

How do you handle missing values in the dataset?

What steps are involved in standardizing text columns?

How do you remove duplicates from the dataset?

What is the significance of fixing date formats in data cleaning?

How does the tutorial suggest dealing with unused columns?