A Step-by-Step Guide to Clean Data for Analysis
By the end of this presentation, you will understand:
The Reality
Common Issues
# ========================================
# Import Required Libraries
# ========================================
# pandas (pd): The primary library for working with tabular data (like spreadsheets)
import pandas as pd
# numpy (np): Library for numerical computations and array operations
import numpy as np
# matplotlib.pyplot (plt): The main plotting library for creating visualizations
import matplotlib.pyplot as pltLibraries are extra pieces of code that facilitate Pythonโs programming scope*
๐ โก๏ธ ๐
From a File?
Always start by examining your data!
๐
From a random?
# datetime modules: For working with dates and times
from datetime import datetime, timedelta
# ========================================
# Configure Settings
# ========================================
# Set random seed to 42 so everyone gets the same "random" data
# This makes our code reproducible - you'll get the same results every time
np.random.seed(42)
# Configure pandas display options for better readability
pd.set_option('display.max_columns', None) # Show all columns (don't truncate)
pd.set_option('display.width', None) # Use full screen width
print("โ Libraries imported successfully!")
print("โ Settings configured!")# ========================================
# Create Sample Dataset with Intentional Data Quality Issues
# ========================================
# We're creating a realistic dataset that has common problems you'll encounter in real data
# Set the number of records we want to generate
n_records = 500
# ----------------------------------------
# Generate Random Dates
# ----------------------------------------
# Create a starting date (January 1, 2024)
start_date = datetime(2024, 1, 1)
# Generate 500 random dates throughout 2024
# This simulates transaction dates spread across the year
dates = [start_date + timedelta(days=np.random.randint(0, 365)) for _ in range(n_records)]
# ----------------------------------------
# Build the Dataset Dictionary
# ----------------------------------------
# Create a dictionary where each key is a column name and value is a list of data
data = {
# Transaction date for each sale
'date': dates,
# Product names - randomly chosen from 6 different products
'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse'], n_records),
# Region names - NOTE: Intentionally inconsistent capitalization ('north' vs 'North' vs 'SOUTH')
# This is a common data quality issue we'll need to fix!
'region': np.random.choice(['North', 'South', 'East', 'West', 'north', 'SOUTH'], n_records),
# Sales amount in dollars - random values between $100 and $5,000
'sales': np.random.randint(100, 5000, n_records),
# Quantity of items sold - random values between 1 and 50
'quantity': np.random.randint(1, 50, n_records),
# Customer age - random values between 18 and 75
'customer_age': np.random.randint(18, 75, n_records),
# Customer satisfaction score (1-5 scale, where 5 is best)
'satisfaction': np.random.choice([1, 2, 3, 4, 5], n_records)
}
# Convert the dictionary into a pandas DataFrame (like a spreadsheet table)
df = pd.DataFrame(data)
# ----------------------------------------
# Introduce Missing Values (10% of data)
# ----------------------------------------
# Randomly select 10% of rows to have missing data
missing_indices = np.random.choice(df.index, size=int(n_records * 0.10), replace=False)
# Make half of those rows have missing sales values
df.loc[missing_indices[:len(missing_indices)//2], 'sales'] = np.nan
# Make the other half have missing satisfaction scores
df.loc[missing_indices[len(missing_indices)//2:], 'satisfaction'] = np.nan
# ----------------------------------------
# Introduce Duplicate Rows
# ----------------------------------------
# Randomly select 20 rows and duplicate them
duplicate_rows = df.sample(20)
# Add these duplicate rows to the dataframe (this creates duplicates)
df = pd.concat([df, duplicate_rows], ignore_index=True)
# ----------------------------------------
# Introduce Outliers
# ----------------------------------------
# Select 10 random rows and give them unrealistically high sales values
outlier_indices = np.random.choice(df.index, size=10, replace=False)
# Set their sales to be between $50,000 and $100,000 (much higher than normal)
df.loc[outlier_indices, 'sales'] = np.random.randint(50000, 100000, len(outlier_indices))
# ----------------------------------------
# Display Results
# ----------------------------------------
print(f"โ Dataset created with {len(df)} records")
print(f"โ Includes: missing values, duplicates, outliers, and inconsistent data")
print(f"\nFirst 10 rows of the dataset:")
df.head(10)Again! Always start by examining your data!
๐
# Get basic information
df.info()
# Check dimensions
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
# View column names and types
df.dtypesTip
Use .info() to quickly see column types and missing values
Three Key Checks:
# Count missing values
missing = df.isnull().sum()
# Calculate percentage missing
missing_pct = (df.isnull().sum() / len(df)) * 100
# Display summary
pd.DataFrame({
'Missing Count': missing,
'Percentage': missing_pct
})๐ฆ ๐
# Count duplicate rows
print(f"Duplicates: {df.duplicated().sum()}")
# View duplicate rows
df[df.duplicated(keep=False)]Warning
Duplicates can skew your analysis results!
๐คก ๐คก
Example Issues:
๐ค
# Create a copy (preserve original!)
df_clean = df.copy()
# Remove duplicates
df_clean = df_clean.drop_duplicates()
print(f"Removed {len(df) - len(df_clean)} duplicates")Tip
Best Practice: Always work on a copy of your data
๐ค ๐ค โก๏ธ ๐ค
# Fix inconsistent capitalization
df_clean['region'] = df_clean['region'].str.title()
# Before: 'north', 'NORTH', 'North'
# After: 'North', 'North', 'North'Other Standardization Examples:
.str.strip() โ๏ธ.replace() ๐.str.lower() or .str.upper() ๐กThree Main Strategies:
๐ค ๐ โ ๐คช โก๏ธ ๐ค ๐ ๐คช
๐๐๐๐๐๐๐๐
Understanding Quartiles
Are outliers important to your work!
๐ ๐ ๐ ๐ ๐ ๐
๐ง Create New Features
Create new columns for better analysis:
# Calculate price per unit
df_clean['price_per_unit'] = df_clean['sales'] / df_clean['quantity']
# Extract time features
df_clean['month'] = pd.to_datetime(df_clean['date']).dt.month
df_clean['quarter'] = pd.to_datetime(df_clean['date']).dt.quarter
# Create age groups
df_clean['age_group'] = pd.cut(
df_clean['customer_age'],
bins=[0, 25, 35, 50, 100],
labels=['18-25', '26-35', '36-50', '50+']
)๐ ๐ก ๐ ๐ก ๐ก ๐ก โก๏ธ ๐ก ๐ก ๐ก ๐ก
# AND condition: both must be true
premium = df_clean[
(df_clean['satisfaction'] >= 4) &
(df_clean['sales'] > 2000)
]
# OR condition: either can be true
tech_products = df_clean[
(df_clean['product'] == 'Laptop') |
(df_clean['product'] == 'Phone')
]Important
Use & for AND, | for OR, and always use parentheses!
โ Quality Assurance
Before Cleaning โ
After Cleaning โ
Result: Clean, analysis-ready dataset! ๐
๐ Always Verify Your Work
๐ Choose the Right Strategy
| Data Type | Missing Value Strategy | Example |
|---|---|---|
| Numerical (continuous) | Mean or Median | Sales, age, price |
| Numerical (discrete) | Mode or Median | Ratings, counts |
| Categorical | Mode | Categories, regions |
| Time series | Forward/backward fill | Dates, sequences |
โโโโ๏ธ vs โฅ๏ธโฆ๏ธโ ๏ธโฃ๏ธ
๐ Domain Knowledge is Key
โ ๏ธ Watch Out For These Mistakes!
๐ Missing Values Heatmap
๐ Compare Distributions
๐ Complete Workflow
Open the accompanying Jupyter notebook: data-analytics-00-data-cleaning-visualization.ipynb
Work through the examples and complete the challenges!
Tip
The best way to learn is by doing!
Data Cleaning & Visualization