Data Cleaning and Visualization

A Step-by-Step Guide to Clean Data for Analysis

Data Analytics Course

Learning Objectives

By the end of this presentation, you will understand:

  • How to identify common data quality issues
  • Techniques for cleaning messy datasets
  • Methods to handle missing values, duplicates, and outliers
  • How to filter and transform data for analysis
  • Best practices for data preparation

Why Data Cleaning Matters

The Reality

  • Real-world data is messy ๐Ÿงน
  • 80% of analysis time is spent on data cleaning โฐ
  • Bad data = Bad conclusions โŒ

Common Issues

  • Missing values ๐Ÿ•ณ๏ธ
  • Duplicate records ๐Ÿ‘ฏ
  • Inconsistent formatting ๐Ÿ”€
  • Outliers ๐Ÿ“Š
  • Invalid entries โš ๏ธ

Part 1: Understanding Your Data

Loading and Initial Inspection ๐Ÿ“Š

Step 0: Load Pythonโ€™s Libraries

# ========================================
# Import Required Libraries
# ========================================
# pandas (pd): The primary library for working with tabular data (like spreadsheets)
import pandas as pd

# numpy (np): Library for numerical computations and array operations
import numpy as np

# matplotlib.pyplot (plt): The main plotting library for creating visualizations
import matplotlib.pyplot as plt

Libraries are extra pieces of code that facilitate Pythonโ€™s programming scope*

๐Ÿ”ˆ โžก๏ธ ๐Ÿ”Š

Step 1 (a): Load Your Data ๐Ÿ“

From a File?

Loading from CSV


# Load your dataset
df = pd.read_csv('your_data.csv')

# First look at the data
df.head()

Always start by examining your data!

๐Ÿ‘€

Step 1 (b): Prepare (Random) Data

From a random?


# datetime modules: For working with dates and times
from datetime import datetime, timedelta

# ========================================
# Configure Settings
# ========================================
# Set random seed to 42 so everyone gets the same "random" data
# This makes our code reproducible - you'll get the same results every time
np.random.seed(42)

# Configure pandas display options for better readability
pd.set_option('display.max_columns', None)  # Show all columns (don't truncate)
pd.set_option('display.width', None)        # Use full screen width

print("โœ“ Libraries imported successfully!")
print("โœ“ Settings configured!")

Step 1: Create The Dataset

# ========================================
# Create Sample Dataset with Intentional Data Quality Issues
# ========================================
# We're creating a realistic dataset that has common problems you'll encounter in real data

# Set the number of records we want to generate
n_records = 500

# ----------------------------------------
# Generate Random Dates
# ----------------------------------------
# Create a starting date (January 1, 2024)
start_date = datetime(2024, 1, 1)

# Generate 500 random dates throughout 2024
# This simulates transaction dates spread across the year
dates = [start_date + timedelta(days=np.random.randint(0, 365)) for _ in range(n_records)]

# ----------------------------------------
# Build the Dataset Dictionary
# ----------------------------------------
# Create a dictionary where each key is a column name and value is a list of data
data = {
    # Transaction date for each sale
    'date': dates,
    
    # Product names - randomly chosen from 6 different products
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse'], n_records),
    
    # Region names - NOTE: Intentionally inconsistent capitalization ('north' vs 'North' vs 'SOUTH')
    # This is a common data quality issue we'll need to fix!
    'region': np.random.choice(['North', 'South', 'East', 'West', 'north', 'SOUTH'], n_records),
    
    # Sales amount in dollars - random values between $100 and $5,000
    'sales': np.random.randint(100, 5000, n_records),
    
    # Quantity of items sold - random values between 1 and 50
    'quantity': np.random.randint(1, 50, n_records),
    
    # Customer age - random values between 18 and 75
    'customer_age': np.random.randint(18, 75, n_records),
    
    # Customer satisfaction score (1-5 scale, where 5 is best)
    'satisfaction': np.random.choice([1, 2, 3, 4, 5], n_records)
}

# Convert the dictionary into a pandas DataFrame (like a spreadsheet table)
df = pd.DataFrame(data)

# ----------------------------------------
# Introduce Missing Values (10% of data)
# ----------------------------------------
# Randomly select 10% of rows to have missing data
missing_indices = np.random.choice(df.index, size=int(n_records * 0.10), replace=False)

# Make half of those rows have missing sales values
df.loc[missing_indices[:len(missing_indices)//2], 'sales'] = np.nan

# Make the other half have missing satisfaction scores
df.loc[missing_indices[len(missing_indices)//2:], 'satisfaction'] = np.nan

# ----------------------------------------
# Introduce Duplicate Rows
# ----------------------------------------
# Randomly select 20 rows and duplicate them
duplicate_rows = df.sample(20)
# Add these duplicate rows to the dataframe (this creates duplicates)
df = pd.concat([df, duplicate_rows], ignore_index=True)

# ----------------------------------------
# Introduce Outliers
# ----------------------------------------
# Select 10 random rows and give them unrealistically high sales values
outlier_indices = np.random.choice(df.index, size=10, replace=False)
# Set their sales to be between $50,000 and $100,000 (much higher than normal)
df.loc[outlier_indices, 'sales'] = np.random.randint(50000, 100000, len(outlier_indices))

# ----------------------------------------
# Display Results
# ----------------------------------------
print(f"โœ“ Dataset created with {len(df)} records")
print(f"โœ“ Includes: missing values, duplicates, outliers, and inconsistent data")
print(f"\nFirst 10 rows of the dataset:")
df.head(10)

Again! Always start by examining your data!

๐Ÿ‘€

Step 2: Check Data Structure

# Get basic information
df.info()

# Check dimensions
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

# View column names and types
df.dtypes

Tip

Use .info() to quickly see column types and missing values

Step 3: Identify Data Quality Issues

Three Key Checks:

  1. Missing Values ๐Ÿ•ณ๏ธ - Are there gaps in the data?
  2. Duplicates ๐Ÿ‘ฏ - Are there repeated rows?
  3. Consistency ๐ŸŽฏ - Are values formatted uniformly?

Checking for Missing Values

# Count missing values
missing = df.isnull().sum()

# Calculate percentage missing
missing_pct = (df.isnull().sum() / len(df)) * 100

# Display summary
pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
})

๐Ÿ”ฆ ๐Ÿ‘€

Checking for Duplicates

# Count duplicate rows
print(f"Duplicates: {df.duplicated().sum()}")

# View duplicate rows
df[df.duplicated(keep=False)]

Warning

Duplicates can skew your analysis results!

๐Ÿคก ๐Ÿคก

Checking for Inconsistencies

# Look at unique values in categorical columns
df['region'].value_counts()

Example Issues:

  • โ€˜Northโ€™, โ€˜northโ€™, โ€˜NORTHโ€™ - should all be the same! โš ๏ธ
  • Spelling variations ๐Ÿ“
  • Different date formats ๐Ÿ“…

๐Ÿค”

Part 2: Data Cleaning

Systematic Approach to Clean Data ๐Ÿงผ

Step 1: Remove Duplicates

# Create a copy (preserve original!)
df_clean = df.copy()

# Remove duplicates
df_clean = df_clean.drop_duplicates()

print(f"Removed {len(df) - len(df_clean)} duplicates")

Tip

Best Practice: Always work on a copy of your data

๐Ÿค  ๐Ÿค  โžก๏ธ ๐Ÿค 

Step 2: Standardize Categorical Data

# Fix inconsistent capitalization
df_clean['region'] = df_clean['region'].str.title()

# Before: 'north', 'NORTH', 'North'
# After:  'North', 'North', 'North'

Other Standardization Examples:

  • Trim whitespace: .str.strip() โœ‚๏ธ
  • Replace values: .replace() ๐Ÿ”„
  • Normalize case: .str.lower() or .str.upper() ๐Ÿ”ก

Step 3: Handle Missing Values

Three Main Strategies:

  1. Remove โŒ rows with missing values
  2. Fill ๐Ÿ“ with a statistic (mean, median, mode)
  3. Predict ๐Ÿ”ฎ using other columns

๐Ÿค  ๐Ÿ˜‹ โŒ ๐Ÿคช โžก๏ธ ๐Ÿค  ๐Ÿ˜‹ ๐Ÿคช

Filling Missing Values - Example 1

Strategy: Fill with group-specific values

# Fill missing sales with median by product
df_clean['sales'] = df_clean.groupby('product')['sales'].transform(
    lambda x: x.fillna(x.median())
)

Why? Different products have different typical prices! ๐Ÿ’ฐ

Filling Missing Values - Example 2

Strategy: Fill with mode (most common value)

# Fill satisfaction scores with mode
mode_value = df_clean['satisfaction'].mode()[0]
df_clean['satisfaction'] = df_clean['satisfaction'].fillna(mode_value)

Use mode for categorical data (1-5 ratings, categories, etc.) โญ

Step 4: Identify Outliers

IQR (Interquartile Range) Method

Q1 = df_clean['sales'].quantile(0.25)
Q3 = df_clean['sales'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Find outliers
outliers = df_clean[
    (df_clean['sales'] < lower_bound) | 
    (df_clean['sales'] > upper_bound)
]

Understanding the IQR Method

What is IQR?

  • Q1 = 25th percentile
  • Q3 = 75th percentile
  • IQR = Q3 - Q1
  • Middle 50% of data

Outlier Bounds

  • Lower: Q1 - 1.5 ร— IQR
  • Upper: Q3 + 1.5 ร— IQR
  • Values outside = outliers

๐ŸŒ•๐ŸŒ–๐ŸŒ—๐ŸŒ˜๐ŸŒ‘๐ŸŒ’๐ŸŒ“๐ŸŒ”

The IQR Method

Understanding Quartiles

  • Q1 (First Quartile)
    • 25th percentile (25% of data is below this value)
    • Lower edge of the box
  • Q3 (Third Quartile)
    • 75th percentile (75% of data is below this value)
    • Upper edge of the box
  • IQR = Q3 - Q1
    • Contains middle 50% of data
    • Used to detect outliers

Step 5: Handle Outliers

Option 1: Remove Them ๐Ÿ—‘๏ธ

df_clean = df_clean[
    (df_clean['sales'] >= lower_bound) & 
    (df_clean['sales'] <= upper_bound)
]

Option 2: Cap Them ๐Ÿ“Œ

df_clean['sales'] = df_clean['sales'].clip(
    lower=lower_bound, 
    upper=upper_bound
)

Are outliers important to your work!

๐Ÿ ๐Ÿ ๐Ÿ ๐Ÿ ๐Ÿ ๐ŸŸ

Step 6: Add Derived Columns

๐Ÿ”ง Create New Features

Create new columns for better analysis:

# Calculate price per unit
df_clean['price_per_unit'] = df_clean['sales'] / df_clean['quantity']

# Extract time features
df_clean['month'] = pd.to_datetime(df_clean['date']).dt.month
df_clean['quarter'] = pd.to_datetime(df_clean['date']).dt.quarter

# Create age groups
df_clean['age_group'] = pd.cut(
    df_clean['customer_age'], 
    bins=[0, 25, 35, 50, 100],
    labels=['18-25', '26-35', '36-50', '50+']
)

Part 3: Data Subsetting

Filtering for Focused Analysis ๐Ÿ”

Filtering by Single Condition

๐Ÿ“‹ Simple Filters

# Filter by product
laptops = df_clean[df_clean['product'] == 'Laptop']

# Filter by value range
high_sales = df_clean[df_clean['sales'] > 2000]

๐Ÿ ๐Ÿก ๐Ÿ  ๐Ÿก ๐Ÿก ๐Ÿก โžก๏ธ ๐Ÿก ๐Ÿก ๐Ÿก ๐Ÿก

Filtering by Multiple Conditions

# AND condition: both must be true
premium = df_clean[
    (df_clean['satisfaction'] >= 4) & 
    (df_clean['sales'] > 2000)
]

# OR condition: either can be true
tech_products = df_clean[
    (df_clean['product'] == 'Laptop') | 
    (df_clean['product'] == 'Phone')
]

Important

Use & for AND, | for OR, and always use parentheses!

Using .isin() for Multiple Values

๐ŸŽฏ Multiple Value Filtering

# Filter for multiple categories
high_value = df_clean[
    df_clean['product'].isin(['Laptop', 'Monitor'])
]

# Filter by multiple regions and quarters
subset = df_clean[
    (df_clean['quarter'].isin([1, 2])) & 
    (df_clean['region'].isin(['North', 'East']))
]

Part 4: Data Cleaning Workflow

Putting It All Together ๐ŸŽฏ

The Workflow (Part 1)

  • โœ… Load Data
  • โœ… Inspect Data
  • โœ… Identify Issues
  • โœ… Remove Duplicates
  • โœ… Standardize Values
  • โœ… Handle Missing Values

The Workflow (Part 2)

  • โœ… Handle Outliers
  • โœ… Add Derived Columns
  • โœ… Clean Data Ready!
  • โœ… Analysis & Visualization

Data Cleaning Checklist

โœ… Quality Assurance

  • โœ… Removed duplicate rows
  • โœ… Standardized categorical values
  • โœ… Handled missing values appropriately
  • โœ… Identified and handled outliers
  • โœ… Created useful derived columns
  • โœ… Validated data quality
  • โœ… Documented cleaning decisions

Before and After Example

Before Cleaning โŒ

  • 520 rows (with duplicates)
  • 50 missing values
  • 10 extreme outliers
  • Inconsistent region names
  • No derived features

After Cleaning โœ…

  • 500 unique rows
  • 0 missing values
  • Outliers removed
  • Standardized categories
  • Added month, quarter, age groups

Result: Clean, analysis-ready dataset! ๐ŸŽ‰

Best Practices

Tips for Effective Data Cleaning ๐Ÿ’ก

Best Practice #1: Preserve Original Data

๐Ÿ’พ Always Work on a Copy!

# โœ… GOOD: Work on a copy
df_clean = df.copy()
df_clean.drop_duplicates(inplace=True)

# โŒ BAD: Modify original
df.drop_duplicates(inplace=True)

Why? You might need to revisit the raw data!

Best Practice #2: Document Your Decisions

๐Ÿ“ Keep a Cleaning Log

Keep track of what you did and why:

# Document your cleaning steps
cleaning_log = {
    'duplicates_removed': 20,
    'missing_sales_filled': 'median by product',
    'missing_satisfaction_filled': 'mode',
    'outliers_removed': 10,
    'reason': 'Values exceeded 3x IQR'
}

Best Practice #3: Validate Your Cleaning

๐Ÿ” Always Verify Your Work

# Check for remaining issues
assert df_clean.duplicated().sum() == 0
assert df_clean.isnull().sum().sum() == 0

# Verify data makes sense
print(f"Sales range: ${df_clean['sales'].min()} - ${df_clean['sales'].max()}")
print(f"Unique regions: {df_clean['region'].nunique()}")

Best Practice #4: Use Appropriate Methods

๐Ÿ“Š Choose the Right Strategy

Data Type Missing Value Strategy Example
Numerical (continuous) Mean or Median Sales, age, price
Numerical (discrete) Mode or Median Ratings, counts
Categorical Mode Categories, regions
Time series Forward/backward fill Dates, sequences

โž•โž–โž—โœ–๏ธ vs โ™ฅ๏ธโ™ฆ๏ธโ™ ๏ธโ™ฃ๏ธ

Best Practice #5: Understand Your Domain

๐ŸŽ“ Domain Knowledge is Key

  • Know whatโ€™s normal - Is $100,000 sale realistic? ๐Ÿ’ฐ
  • Understand relationships - Should quantity ร— price = total? โž—
  • Consider context - Are missing values random or systematic? ๐Ÿค”
  • Ask questions - When in doubt, consult domain experts! ๐Ÿ‘ฅ

Common Pitfalls to Avoid

โš ๏ธ Watch Out For These Mistakes!

  1. โŒ Deleting too much data - Missing values might be informative
  2. โŒ Using mean for skewed data - Use median instead
  3. โŒ Ignoring outliers blindly - Sometimes theyโ€™re real and important
  4. โŒ Not checking data types - โ€˜123โ€™ vs 123
  5. โŒ Forgetting to validate - Always check your results! (Do they make sense?!)

Visualization

Quick Preview of Clean Data ๐Ÿ“Š

Visualizing Data Quality

๐Ÿ” Missing Values Heatmap

import matplotlib.pyplot as plt

# Missing values heatmap
plt.figure(figsize=(10, 6))
plt.imshow(df.isnull(), cmap='viridis', aspect='auto')
plt.title('Missing Values Pattern')
plt.xlabel('Columns')
plt.ylabel('Rows')
plt.show()

Visual patterns can reveal systematic missing data!

Before/After Comparison

๐Ÿ“ˆ Compare Distributions

# Compare distributions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Before
df['sales'].hist(ax=ax1, bins=30)
ax1.set_title('Sales Distribution - Before Cleaning')

# After
df_clean['sales'].hist(ax=ax2, bins=30)
ax2.set_title('Sales Distribution - After Cleaning')

plt.show()

Summary

Key Takeaways ๐ŸŽ“

The Data Cleaning Process

๐Ÿ”„ Complete Workflow

  1. Load and Inspect - Understand what youโ€™re working with ๐Ÿ“ฅ
  2. Identify Issues - Missing values, duplicates, inconsistencies ๐Ÿ”
  3. Remove Duplicates - Keep only unique records ๐Ÿ—‘๏ธ
  4. Standardize - Make categorical values consistent โœ‚๏ธ
  5. Handle Missing Values - Fill or remove strategically ๐Ÿ”ง
  6. Handle Outliers - Identify and address extreme values ๐Ÿ“Š
  7. Transform - Create derived features ๐Ÿ”จ
  8. Validate - Check that cleaning worked โœ…
  9. Document - Record what you did and why ๐Ÿ“

Hands-On Practice

Try it yourself!

Open the accompanying Jupyter notebook: data-analytics-00-data-cleaning-visualization.ipynb

Work through the examples and complete the challenges!

Tip

The best way to learn is by doing!