Data Cleaning and Visualization

A Step-by-Step Guide to Clean Data for Analysis

Data Analytics Course

Learning Objectives

By the end of this presentation, you will understand:

How to identify common data quality issues
Techniques for cleaning messy datasets
Methods to handle missing values, duplicates, and outliers
How to filter and transform data for analysis
Best practices for data preparation

Why Data Cleaning Matters

The Reality

Real-world data is messy 🧹
80% of analysis time is spent on data cleaning ⏰
Bad data = Bad conclusions ❌

Common Issues

Missing values 🕳️
Duplicate records 👯
Inconsistent formatting 🔀
Outliers 📊
Invalid entries ⚠️

Part 1: Understanding Your Data

Loading and Initial Inspection 📊

Step 0: Load Python’s Libraries

# ========================================
# Import Required Libraries
# ========================================
# pandas (pd): The primary library for working with tabular data (like spreadsheets)
import pandas as pd

# numpy (np): Library for numerical computations and array operations
import numpy as np

# matplotlib.pyplot (plt): The main plotting library for creating visualizations
import matplotlib.pyplot as plt

Libraries are extra pieces of code that facilitate Python’s programming scope*

🔈 ➡️ 🔊

Step 1 (a): Load Your Data 📁

From a File?

Loading from CSV


# Load your dataset
df = pd.read_csv('your_data.csv')

# First look at the data
df.head()

Always start by examining your data!

👀

Step 1 (b): Prepare (Random) Data

From a random?


# datetime modules: For working with dates and times
from datetime import datetime, timedelta

# ========================================
# Configure Settings
# ========================================
# Set random seed to 42 so everyone gets the same "random" data
# This makes our code reproducible - you'll get the same results every time
np.random.seed(42)

# Configure pandas display options for better readability
pd.set_option('display.max_columns', None)  # Show all columns (don't truncate)
pd.set_option('display.width', None)        # Use full screen width

print("✓ Libraries imported successfully!")
print("✓ Settings configured!")

Step 1: Create The Dataset

# ========================================
# Create Sample Dataset with Intentional Data Quality Issues
# ========================================
# We're creating a realistic dataset that has common problems you'll encounter in real data

# Set the number of records we want to generate
n_records = 500

# ----------------------------------------
# Generate Random Dates
# ----------------------------------------
# Create a starting date (January 1, 2024)
start_date = datetime(2024, 1, 1)

# Generate 500 random dates throughout 2024
# This simulates transaction dates spread across the year
dates = [start_date + timedelta(days=np.random.randint(0, 365)) for _ in range(n_records)]

# ----------------------------------------
# Build the Dataset Dictionary
# ----------------------------------------
# Create a dictionary where each key is a column name and value is a list of data
data = {
    # Transaction date for each sale
    'date': dates,
    
    # Product names - randomly chosen from 6 different products
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse'], n_records),
    
    # Region names - NOTE: Intentionally inconsistent capitalization ('north' vs 'North' vs 'SOUTH')
    # This is a common data quality issue we'll need to fix!
    'region': np.random.choice(['North', 'South', 'East', 'West', 'north', 'SOUTH'], n_records),
    
    # Sales amount in dollars - random values between $100 and $5,000
    'sales': np.random.randint(100, 5000, n_records),
    
    # Quantity of items sold - random values between 1 and 50
    'quantity': np.random.randint(1, 50, n_records),
    
    # Customer age - random values between 18 and 75
    'customer_age': np.random.randint(18, 75, n_records),
    
    # Customer satisfaction score (1-5 scale, where 5 is best)
    'satisfaction': np.random.choice([1, 2, 3, 4, 5], n_records)
}

# Convert the dictionary into a pandas DataFrame (like a spreadsheet table)
df = pd.DataFrame(data)

# ----------------------------------------
# Introduce Missing Values (10% of data)
# ----------------------------------------
# Randomly select 10% of rows to have missing data
missing_indices = np.random.choice(df.index, size=int(n_records * 0.10), replace=False)

# Make half of those rows have missing sales values
df.loc[missing_indices[:len(missing_indices)//2], 'sales'] = np.nan

# Make the other half have missing satisfaction scores
df.loc[missing_indices[len(missing_indices)//2:], 'satisfaction'] = np.nan

# ----------------------------------------
# Introduce Duplicate Rows
# ----------------------------------------
# Randomly select 20 rows and duplicate them
duplicate_rows = df.sample(20)
# Add these duplicate rows to the dataframe (this creates duplicates)
df = pd.concat([df, duplicate_rows], ignore_index=True)

# ----------------------------------------
# Introduce Outliers
# ----------------------------------------
# Select 10 random rows and give them unrealistically high sales values
outlier_indices = np.random.choice(df.index, size=10, replace=False)
# Set their sales to be between $50,000 and $100,000 (much higher than normal)
df.loc[outlier_indices, 'sales'] = np.random.randint(50000, 100000, len(outlier_indices))

# ----------------------------------------
# Display Results
# ----------------------------------------
print(f"✓ Dataset created with {len(df)} records")
print(f"✓ Includes: missing values, duplicates, outliers, and inconsistent data")
print(f"\nFirst 10 rows of the dataset:")
df.head(10)

Again! Always start by examining your data!

👀

Step 2: Check Data Structure

# Get basic information
df.info()

# Check dimensions
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

# View column names and types
df.dtypes

Tip

Use .info() to quickly see column types and missing values

Step 3: Identify Data Quality Issues

Three Key Checks:

Missing Values 🕳️ - Are there gaps in the data?
Duplicates 👯 - Are there repeated rows?
Consistency 🎯 - Are values formatted uniformly?

Checking for Missing Values

# Count missing values
missing = df.isnull().sum()

# Calculate percentage missing
missing_pct = (df.isnull().sum() / len(df)) * 100

# Display summary
pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
})

🔦 👀

Checking for Duplicates

# Count duplicate rows
print(f"Duplicates: {df.duplicated().sum()}")

# View duplicate rows
df[df.duplicated(keep=False)]

Warning

Duplicates can skew your analysis results!

🤡 🤡

Checking for Inconsistencies

# Look at unique values in categorical columns
df['region'].value_counts()

Example Issues:

‘North’, ‘north’, ‘NORTH’ - should all be the same! ⚠️
Spelling variations 📝
Different date formats 📅

🤔

Part 2: Data Cleaning

Systematic Approach to Clean Data 🧼

Step 1: Remove Duplicates

# Create a copy (preserve original!)
df_clean = df.copy()

# Remove duplicates
df_clean = df_clean.drop_duplicates()

print(f"Removed {len(df) - len(df_clean)} duplicates")

Tip

Best Practice: Always work on a copy of your data

🤠 🤠 ➡️ 🤠

Step 2: Standardize Categorical Data

# Fix inconsistent capitalization
df_clean['region'] = df_clean['region'].str.title()

# Before: 'north', 'NORTH', 'North'
# After:  'North', 'North', 'North'

Other Standardization Examples:

Trim whitespace: .str.strip() ✂️
Replace values: .replace() 🔄
Normalize case: .str.lower() or .str.upper() 🔡

Step 3: Handle Missing Values

Three Main Strategies:

Remove ❌ rows with missing values
Fill 📝 with a statistic (mean, median, mode)
Predict 🔮 using other columns

🤠 😋 ❌ 🤪 ➡️ 🤠 😋 🤪

Filling Missing Values - Example 1

Strategy: Fill with group-specific values

# Fill missing sales with median by product
df_clean['sales'] = df_clean.groupby('product')['sales'].transform(
    lambda x: x.fillna(x.median())
)

Why? Different products have different typical prices! 💰

Filling Missing Values - Example 2

Strategy: Fill with mode (most common value)

# Fill satisfaction scores with mode
mode_value = df_clean['satisfaction'].mode()[0]
df_clean['satisfaction'] = df_clean['satisfaction'].fillna(mode_value)

Use mode for categorical data (1-5 ratings, categories, etc.) ⭐

Step 4: Identify Outliers

IQR (Interquartile Range) Method

Q1 = df_clean['sales'].quantile(0.25)
Q3 = df_clean['sales'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Find outliers
outliers = df_clean[
    (df_clean['sales'] < lower_bound) | 
    (df_clean['sales'] > upper_bound)
]

Understanding the IQR Method

What is IQR?

Q1 = 25th percentile
Q3 = 75th percentile
IQR = Q3 - Q1
Middle 50% of data

Outlier Bounds

Lower: Q1 - 1.5 × IQR
Upper: Q3 + 1.5 × IQR
Values outside = outliers

🌕🌖🌗🌘🌑🌒🌓🌔

The IQR Method

Understanding Quartiles

Q1 (First Quartile)
- 25th percentile (25% of data is below this value)
- Lower edge of the box
Q3 (Third Quartile)
- 75th percentile (75% of data is below this value)
- Upper edge of the box
IQR = Q3 - Q1
- Contains middle 50% of data
- Used to detect outliers

Step 5: Handle Outliers

Option 1: Remove Them 🗑️

df_clean = df_clean[
    (df_clean['sales'] >= lower_bound) & 
    (df_clean['sales'] <= upper_bound)
]

Option 2: Cap Them 📌

df_clean['sales'] = df_clean['sales'].clip(
    lower=lower_bound, 
    upper=upper_bound
)

Are outliers important to your work!

🐏 🐏 🐏 🐏 🐏 🐟

Step 6: Add Derived Columns

🔧 Create New Features

Create new columns for better analysis:

# Calculate price per unit
df_clean['price_per_unit'] = df_clean['sales'] / df_clean['quantity']

# Extract time features
df_clean['month'] = pd.to_datetime(df_clean['date']).dt.month
df_clean['quarter'] = pd.to_datetime(df_clean['date']).dt.quarter

# Create age groups
df_clean['age_group'] = pd.cut(
    df_clean['customer_age'], 
    bins=[0, 25, 35, 50, 100],
    labels=['18-25', '26-35', '36-50', '50+']
)

Part 3: Data Subsetting

Filtering for Focused Analysis 🔍

Filtering by Single Condition

📋 Simple Filters

# Filter by product
laptops = df_clean[df_clean['product'] == 'Laptop']

# Filter by value range
high_sales = df_clean[df_clean['sales'] > 2000]

🐏 🐡 🐠 🐡 🐡 🐡 ➡️ 🐡 🐡 🐡 🐡

Filtering by Multiple Conditions

# AND condition: both must be true
premium = df_clean[
    (df_clean['satisfaction'] >= 4) & 
    (df_clean['sales'] > 2000)
]

# OR condition: either can be true
tech_products = df_clean[
    (df_clean['product'] == 'Laptop') | 
    (df_clean['product'] == 'Phone')
]

Important

Use & for AND, | for OR, and always use parentheses!

Using .isin() for Multiple Values

🎯 Multiple Value Filtering

# Filter for multiple categories
high_value = df_clean[
    df_clean['product'].isin(['Laptop', 'Monitor'])
]

# Filter by multiple regions and quarters
subset = df_clean[
    (df_clean['quarter'].isin([1, 2])) & 
    (df_clean['region'].isin(['North', 'East']))
]

Part 4: Data Cleaning Workflow

Putting It All Together 🎯

The Workflow (Part 1)

✅ Load Data
✅ Inspect Data
✅ Identify Issues
✅ Remove Duplicates
✅ Standardize Values
✅ Handle Missing Values

The Workflow (Part 2)

✅ Handle Outliers
✅ Add Derived Columns
✅ Clean Data Ready!
✅ Analysis & Visualization

Data Cleaning Checklist

✅ Quality Assurance

✅ Removed duplicate rows
✅ Standardized categorical values
✅ Handled missing values appropriately
✅ Identified and handled outliers
✅ Created useful derived columns
✅ Validated data quality
✅ Documented cleaning decisions

Before and After Example

Before Cleaning ❌

520 rows (with duplicates)
50 missing values
10 extreme outliers
Inconsistent region names
No derived features

After Cleaning ✅

500 unique rows
0 missing values
Outliers removed
Standardized categories
Added month, quarter, age groups

Result: Clean, analysis-ready dataset! 🎉

Best Practices

Tips for Effective Data Cleaning 💡

Best Practice #1: Preserve Original Data

💾 Always Work on a Copy!

# ✅ GOOD: Work on a copy
df_clean = df.copy()
df_clean.drop_duplicates(inplace=True)

# ❌ BAD: Modify original
df.drop_duplicates(inplace=True)

Why? You might need to revisit the raw data!

Best Practice #2: Document Your Decisions

📝 Keep a Cleaning Log

Keep track of what you did and why:

# Document your cleaning steps
cleaning_log = {
    'duplicates_removed': 20,
    'missing_sales_filled': 'median by product',
    'missing_satisfaction_filled': 'mode',
    'outliers_removed': 10,
    'reason': 'Values exceeded 3x IQR'
}

Best Practice #3: Validate Your Cleaning

🔍 Always Verify Your Work

# Check for remaining issues
assert df_clean.duplicated().sum() == 0
assert df_clean.isnull().sum().sum() == 0

# Verify data makes sense
print(f"Sales range: ${df_clean['sales'].min()} - ${df_clean['sales'].max()}")
print(f"Unique regions: {df_clean['region'].nunique()}")

Best Practice #4: Use Appropriate Methods

📊 Choose the Right Strategy

Data Type	Missing Value Strategy	Example
Numerical (continuous)	Mean or Median	Sales, age, price
Numerical (discrete)	Mode or Median	Ratings, counts
Categorical	Mode	Categories, regions
Time series	Forward/backward fill	Dates, sequences

➕➖➗✖️ vs ♥️♦️♠️♣️

Best Practice #5: Understand Your Domain

🎓 Domain Knowledge is Key

Know what’s normal - Is $100,000 sale realistic? 💰
Understand relationships - Should quantity × price = total? ➗
Consider context - Are missing values random or systematic? 🤔
Ask questions - When in doubt, consult domain experts! 👥

Common Pitfalls to Avoid

⚠️ Watch Out For These Mistakes!

❌ Deleting too much data - Missing values might be informative
❌ Using mean for skewed data - Use median instead
❌ Ignoring outliers blindly - Sometimes they’re real and important
❌ Not checking data types - ‘123’ vs 123
❌ Forgetting to validate - Always check your results! (Do they make sense?!)

Visualization

Quick Preview of Clean Data 📊

Visualizing Data Quality

🔍 Missing Values Heatmap

import matplotlib.pyplot as plt

# Missing values heatmap
plt.figure(figsize=(10, 6))
plt.imshow(df.isnull(), cmap='viridis', aspect='auto')
plt.title('Missing Values Pattern')
plt.xlabel('Columns')
plt.ylabel('Rows')
plt.show()

Visual patterns can reveal systematic missing data!

Before/After Comparison

📈 Compare Distributions

# Compare distributions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Before
df['sales'].hist(ax=ax1, bins=30)
ax1.set_title('Sales Distribution - Before Cleaning')

# After
df_clean['sales'].hist(ax=ax2, bins=30)
ax2.set_title('Sales Distribution - After Cleaning')

plt.show()

Summary

Key Takeaways 🎓

The Data Cleaning Process

🔄 Complete Workflow

Load and Inspect - Understand what you’re working with 📥
Identify Issues - Missing values, duplicates, inconsistencies 🔍
Remove Duplicates - Keep only unique records 🗑️
Standardize - Make categorical values consistent ✂️
Handle Missing Values - Fill or remove strategically 🔧
Handle Outliers - Identify and address extreme values 📊
Transform - Create derived features 🔨
Validate - Check that cleaning worked ✅
Document - Record what you did and why 📝

Hands-On Practice

Try it yourself!

Open the accompanying Jupyter notebook: data-analytics-00-data-cleaning-visualization.ipynb

Work through the examples and complete the challenges!

Tip

The best way to learn is by doing!