Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
example.py	example.py
exercises.py	exercises.py

Data Cleaning

Objective

Learn how to take messy, real-world data and whip it into shape using pandas. By the end of this lesson, you'll have a complete toolkit for detecting and fixing the most common data quality problems.

Concepts Covered

Why data cleaning matters
Detecting and handling missing values
Removing duplicate records
Converting data types
Cleaning strings (whitespace, case, formatting)
Renaming columns
Detecting and handling outliers
Applying custom transformations
Building a complete cleaning workflow

Prerequisites

DataFrames basics (creating, indexing, selecting)
Basic pandas operations
Python string methods
pip install pandas (the exercises require it)

Lesson

Why Data Cleaning Matters

Here's the truth about data in the real world: it's a mess. Surveys have typos. Databases have gaps. Spreadsheets have columns where someone typed "N/A" instead of leaving it blank, or "yes", "Yes", "YES", and "y" all mean the same thing.

Data scientists spend a huge chunk of their time — some estimates say 60-80% — just cleaning data before they can do anything useful with it. If your data is garbage, your analysis will be garbage too. No fancy model can save you from bad inputs.

The good news? Pandas makes data cleaning surprisingly painless once you know the patterns.

Common Data Quality Issues

Before you can fix problems, you need to know what to look for:

Problem	Example
Missing values	Empty cells, `NaN`, `None`, "N/A"
Duplicates	Same record entered twice
Wrong types	Numbers stored as strings (`"42"` instead of `42`)
Inconsistent formatting	`"New York"`, `"new york"`, `"NEW YORK"`, `" New York "`
Outliers	Age of 999, salary of -50000
Invalid data	Email without `@`, date of `2025-13-45`

Let's tackle each one.

Handling Missing Data

Missing data is the most common problem you'll encounter. Pandas represents missing values as NaN (Not a Number) for numeric data and None or NaN for other types.

Detecting Missing Values

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "name": ["Alice", "Bob", None, "Diana"],
    "age": [25, np.nan, 30, 35],
    "city": ["NYC", "LA", "NYC", None]
})

# Which values are missing?
print(df.isna())         # True where values are missing
print(df.notna())        # True where values are present

# Count missing values per column
print(df.isna().sum())

# What percentage of each column is missing?
print(df.isna().mean() * 100)

Dropping Missing Values

Sometimes the simplest approach is to just remove rows with missing data:

# Drop any row that has at least one missing value
df.dropna()

# Drop rows only if ALL values are missing
df.dropna(how="all")

# Drop rows missing values in specific columns only
df.dropna(subset=["name", "age"])

# Require at least 2 non-null values per row
df.dropna(thresh=2)

Be careful — dropping rows means losing data. If 40% of your rows have missing values, you don't want to throw away nearly half your dataset.

Filling Missing Values

Often it's better to fill in the blanks with something reasonable:

# Fill with a specific value
df["age"].fillna(0)
df["city"].fillna("Unknown")

# Fill with the column's mean, median, or mode
df["age"].fillna(df["age"].mean())
df["age"].fillna(df["age"].median())
df["city"].fillna(df["city"].mode()[0])

# Forward fill — use the previous row's value
df["city"].ffill()

# Backward fill — use the next row's value
df["city"].bfill()

# Interpolate — estimate based on surrounding values (great for time series)
df["age"].interpolate()

The right strategy depends on your data. Mean/median works for numeric data, mode for categories, and forward/backward fill for time-ordered data.

Removing Duplicates

Duplicate records can skew your analysis. Maybe a form was submitted twice, or a merge went wrong.

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Alice", "Charlie", "Bob"],
    "email": ["a@x.com", "b@x.com", "a@x.com", "c@x.com", "b@y.com"]
})

# Which rows are duplicates? (first occurrence is NOT marked as duplicate)
print(df.duplicated())

# Check for duplicates in specific columns only
print(df.duplicated(subset=["name"]))

# Count total duplicates
print(df.duplicated().sum())

# Drop duplicates — keeps the first occurrence by default
df.drop_duplicates()

# Keep the last occurrence instead
df.drop_duplicates(keep="last")

# Check duplicates based on specific columns
df.drop_duplicates(subset=["name"])

Data Type Conversion

Data often arrives with the wrong types. A column of numbers might be stored as strings because one entry was "N/A". Dates might be plain text.

df = pd.DataFrame({
    "price": ["10.50", "20.00", "15.75"],
    "quantity": ["3", "5", "2"],
    "date": ["2025-01-15", "2025-02-20", "2025-03-10"]
})

print(df.dtypes)  # Everything is 'object' (string)

# Convert with .astype()
df["quantity"] = df["quantity"].astype(int)
df["price"] = df["price"].astype(float)

# pd.to_numeric() is safer — handles errors gracefully
df["price"] = pd.to_numeric(df["price"], errors="coerce")  # bad values become NaN

# pd.to_datetime() for date strings
df["date"] = pd.to_datetime(df["date"])

print(df.dtypes)  # Now they're the right types!

The errors="coerce" parameter is your friend. Instead of crashing on a bad value like "N/A", it quietly converts it to NaN so you can deal with it later.

String Cleaning

Messy strings are everywhere. Extra spaces, mixed capitalization, typos — it all needs to go.

Pandas gives you the .str accessor that lets you run string methods on entire columns at once:

df = pd.DataFrame({
    "name": ["  Alice  ", "BOB", "charlie", " Diana Jones "],
    "city": ["new york", "Los Angeles", "NEW YORK", "los angeles"],
    "phone": ["555-1234", "555.5678", "555 9012", "555-3456"]
})

# Strip leading/trailing whitespace
df["name"] = df["name"].str.strip()

# Standardize case
df["name"] = df["name"].str.title()       # "Alice", "Bob", "Charlie"
df["city"] = df["city"].str.lower()        # all lowercase
df["city"] = df["city"].str.title()        # "New York", "Los Angeles"

# Replace characters
df["phone"] = df["phone"].str.replace(".", "-", regex=False)
df["phone"] = df["phone"].str.replace(" ", "-")

# Check if a string contains a pattern
mask = df["city"].str.contains("york", case=False)

# Extract parts of strings with regex
df["area_code"] = df["phone"].str.extract(r"(\d{3})")

Renaming Columns

Column names from spreadsheets or databases are often ugly — spaces, mixed case, special characters. Clean them up:

df = pd.DataFrame({
    "First Name": [1], "Last Name": [2], "Email Address": [3]
})

# Rename specific columns
df = df.rename(columns={
    "First Name": "first_name",
    "Last Name": "last_name",
    "Email Address": "email"
})

# Or clean all column names at once
df.columns = df.columns.str.lower().str.replace(" ", "_")

# Remove special characters from column names
df.columns = df.columns.str.replace(r"[^a-z0-9_]", "", regex=True)

A good convention is snake_case — all lowercase with underscores. It plays nicely with Python attribute access (df.first_name instead of df["First Name"]).

Handling Outliers

Outliers are values that are way outside the normal range. They can be legitimate (a CEO's salary in an employee dataset) or errors (someone typed 999 for age). Either way, you need to decide what to do with them.

IQR Method (Interquartile Range)

The classic statistical approach — values beyond 1.5x the IQR from the quartiles are flagged as outliers:

Q1 = df["salary"].quantile(0.25)
Q3 = df["salary"].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Find outliers
outliers = df[(df["salary"] < lower_bound) | (df["salary"] > upper_bound)]

# Remove outliers
df_clean = df[(df["salary"] >= lower_bound) & (df["salary"] <= upper_bound)]

Z-Score Method

Flag values more than 2-3 standard deviations from the mean:

mean = df["salary"].mean()
std = df["salary"].std()

# Values with z-score > 3 are outliers
z_scores = (df["salary"] - mean) / std
df_clean = df[z_scores.abs() <= 3]

Clipping

Instead of removing outliers, you can cap them at a boundary:

# Cap values at the 5th and 95th percentiles
lower = df["salary"].quantile(0.05)
upper = df["salary"].quantile(0.95)
df["salary"] = df["salary"].clip(lower=lower, upper=upper)

Applying Custom Transformations

Sometimes you need cleaning logic that's specific to your data. That's where .apply() and .map() come in.

# .apply() runs a function on every value in a column (or every row)
df["name"] = df["name"].apply(lambda x: x.strip().title() if isinstance(x, str) else x)

# Use a regular function for more complex logic
def clean_phone(phone):
    if pd.isna(phone):
        return phone
    digits = "".join(c for c in str(phone) if c.isdigit())
    if len(digits) == 10:
        return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
    return phone

df["phone"] = df["phone"].apply(clean_phone)

# .map() is great for replacing values using a dictionary
status_map = {"Y": "Active", "N": "Inactive", "y": "Active", "n": "Inactive"}
df["status"] = df["status"].map(status_map)

# .replace() works similarly but keeps values that don't match
df["status"] = df["status"].replace({"Y": "Active", "N": "Inactive"})

Validating Cleaned Data

After cleaning, always check your work. Trust, but verify:

# Check for remaining missing values
print(df.isna().sum())

# Check data types
print(df.dtypes)

# Check for duplicates
print(f"Duplicates remaining: {df.duplicated().sum()}")

# Look at basic stats — do the numbers make sense?
print(df.describe())

# Check unique values in categorical columns
print(df["status"].value_counts())

# Verify value ranges
assert df["age"].between(0, 120).all(), "Age values out of range!"

A Complete Cleaning Workflow

Here's how it all comes together in practice:

import pandas as pd

# Step 1: Load and inspect
df = pd.read_csv("messy_data.csv")
print(df.shape)
print(df.dtypes)
print(df.isna().sum())
print(df.describe())

# Step 2: Clean column names
df.columns = df.columns.str.lower().str.strip().str.replace(" ", "_")

# Step 3: Remove duplicates
df = df.drop_duplicates()

# Step 4: Fix data types
df["date"] = pd.to_datetime(df["date"], errors="coerce")
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")

# Step 5: Handle missing values
df["name"] = df["name"].fillna("Unknown")
df["amount"] = df["amount"].fillna(df["amount"].median())
df = df.dropna(subset=["date"])  # drop rows where date is essential

# Step 6: Clean strings
df["name"] = df["name"].str.strip().str.title()
df["city"] = df["city"].str.strip().str.title()

# Step 7: Handle outliers
Q1 = df["amount"].quantile(0.25)
Q3 = df["amount"].quantile(0.75)
IQR = Q3 - Q1
df = df[df["amount"].between(Q1 - 1.5 * IQR, Q3 + 1.5 * IQR)]

# Step 8: Validate
print(f"Final shape: {df.shape}")
print(f"Missing values:\n{df.isna().sum()}")
print(f"Duplicates: {df.duplicated().sum()}")

Code Example

Check out example.py for a complete working example that creates a messy dataset and cleans it step by step.

Exercises

Try the practice problems in exercises.py to test your data cleaning skills.

Key Takeaways

Real-world data is always messy — cleaning it is a core skill, not a chore to skip
Use .isna() and .notna() to detect missing values, then decide whether to drop or fill them
.drop_duplicates() removes duplicate rows; use subset to check specific columns
pd.to_numeric() and pd.to_datetime() with errors="coerce" safely convert types
The .str accessor gives you vectorized string operations — .strip(), .lower(), .replace()
Handle outliers with IQR, z-scores, or .clip() depending on the situation
.apply() lets you run any custom function across a column
Always validate after cleaning — check types, missing values, ranges, and duplicates
Build a consistent workflow: inspect, clean columns, deduplicate, fix types, fill gaps, clean strings, handle outliers, validate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Data Cleaning

Objective

Concepts Covered

Prerequisites

Lesson

Why Data Cleaning Matters

Common Data Quality Issues

Handling Missing Data

Detecting Missing Values

Dropping Missing Values

Filling Missing Values

Removing Duplicates

Data Type Conversion

String Cleaning

Renaming Columns

Handling Outliers

IQR Method (Interquartile Range)

Z-Score Method

Clipping

Applying Custom Transformations

Validating Cleaned Data

A Complete Cleaning Workflow

Code Example

Exercises

Key Takeaways

FilesExpand file tree

03-data-cleaning

Directory actions

More options

Directory actions

More options

Latest commit

History

03-data-cleaning

Folders and files

parent directory

README.md

Data Cleaning

Objective

Concepts Covered

Prerequisites

Lesson

Why Data Cleaning Matters

Common Data Quality Issues

Handling Missing Data

Detecting Missing Values

Dropping Missing Values

Filling Missing Values

Removing Duplicates

Data Type Conversion

String Cleaning

Renaming Columns

Handling Outliers

IQR Method (Interquartile Range)

Z-Score Method

Clipping

Applying Custom Transformations

Validating Cleaned Data

A Complete Cleaning Workflow

Code Example

Exercises

Key Takeaways