Learn how to take messy, real-world data and whip it into shape using pandas. By the end of this lesson, you'll have a complete toolkit for detecting and fixing the most common data quality problems.
- Why data cleaning matters
- Detecting and handling missing values
- Removing duplicate records
- Converting data types
- Cleaning strings (whitespace, case, formatting)
- Renaming columns
- Detecting and handling outliers
- Applying custom transformations
- Building a complete cleaning workflow
- DataFrames basics (creating, indexing, selecting)
- Basic pandas operations
- Python string methods
pip install pandas(the exercises require it)
Here's the truth about data in the real world: it's a mess. Surveys have typos. Databases have gaps. Spreadsheets have columns where someone typed "N/A" instead of leaving it blank, or "yes", "Yes", "YES", and "y" all mean the same thing.
Data scientists spend a huge chunk of their time — some estimates say 60-80% — just cleaning data before they can do anything useful with it. If your data is garbage, your analysis will be garbage too. No fancy model can save you from bad inputs.
The good news? Pandas makes data cleaning surprisingly painless once you know the patterns.
Before you can fix problems, you need to know what to look for:
| Problem | Example |
|---|---|
| Missing values | Empty cells, NaN, None, "N/A" |
| Duplicates | Same record entered twice |
| Wrong types | Numbers stored as strings ("42" instead of 42) |
| Inconsistent formatting | "New York", "new york", "NEW YORK", " New York " |
| Outliers | Age of 999, salary of -50000 |
| Invalid data | Email without @, date of 2025-13-45 |
Let's tackle each one.
Missing data is the most common problem you'll encounter. Pandas represents missing values as NaN (Not a Number) for numeric data and None or NaN for other types.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Alice", "Bob", None, "Diana"],
"age": [25, np.nan, 30, 35],
"city": ["NYC", "LA", "NYC", None]
})
# Which values are missing?
print(df.isna()) # True where values are missing
print(df.notna()) # True where values are present
# Count missing values per column
print(df.isna().sum())
# What percentage of each column is missing?
print(df.isna().mean() * 100)Sometimes the simplest approach is to just remove rows with missing data:
# Drop any row that has at least one missing value
df.dropna()
# Drop rows only if ALL values are missing
df.dropna(how="all")
# Drop rows missing values in specific columns only
df.dropna(subset=["name", "age"])
# Require at least 2 non-null values per row
df.dropna(thresh=2)Be careful — dropping rows means losing data. If 40% of your rows have missing values, you don't want to throw away nearly half your dataset.
Often it's better to fill in the blanks with something reasonable:
# Fill with a specific value
df["age"].fillna(0)
df["city"].fillna("Unknown")
# Fill with the column's mean, median, or mode
df["age"].fillna(df["age"].mean())
df["age"].fillna(df["age"].median())
df["city"].fillna(df["city"].mode()[0])
# Forward fill — use the previous row's value
df["city"].ffill()
# Backward fill — use the next row's value
df["city"].bfill()
# Interpolate — estimate based on surrounding values (great for time series)
df["age"].interpolate()The right strategy depends on your data. Mean/median works for numeric data, mode for categories, and forward/backward fill for time-ordered data.
Duplicate records can skew your analysis. Maybe a form was submitted twice, or a merge went wrong.
df = pd.DataFrame({
"name": ["Alice", "Bob", "Alice", "Charlie", "Bob"],
"email": ["a@x.com", "b@x.com", "a@x.com", "c@x.com", "b@y.com"]
})
# Which rows are duplicates? (first occurrence is NOT marked as duplicate)
print(df.duplicated())
# Check for duplicates in specific columns only
print(df.duplicated(subset=["name"]))
# Count total duplicates
print(df.duplicated().sum())
# Drop duplicates — keeps the first occurrence by default
df.drop_duplicates()
# Keep the last occurrence instead
df.drop_duplicates(keep="last")
# Check duplicates based on specific columns
df.drop_duplicates(subset=["name"])Data often arrives with the wrong types. A column of numbers might be stored as strings because one entry was "N/A". Dates might be plain text.
df = pd.DataFrame({
"price": ["10.50", "20.00", "15.75"],
"quantity": ["3", "5", "2"],
"date": ["2025-01-15", "2025-02-20", "2025-03-10"]
})
print(df.dtypes) # Everything is 'object' (string)
# Convert with .astype()
df["quantity"] = df["quantity"].astype(int)
df["price"] = df["price"].astype(float)
# pd.to_numeric() is safer — handles errors gracefully
df["price"] = pd.to_numeric(df["price"], errors="coerce") # bad values become NaN
# pd.to_datetime() for date strings
df["date"] = pd.to_datetime(df["date"])
print(df.dtypes) # Now they're the right types!The errors="coerce" parameter is your friend. Instead of crashing on a bad value like "N/A", it quietly converts it to NaN so you can deal with it later.
Messy strings are everywhere. Extra spaces, mixed capitalization, typos — it all needs to go.
Pandas gives you the .str accessor that lets you run string methods on entire columns at once:
df = pd.DataFrame({
"name": [" Alice ", "BOB", "charlie", " Diana Jones "],
"city": ["new york", "Los Angeles", "NEW YORK", "los angeles"],
"phone": ["555-1234", "555.5678", "555 9012", "555-3456"]
})
# Strip leading/trailing whitespace
df["name"] = df["name"].str.strip()
# Standardize case
df["name"] = df["name"].str.title() # "Alice", "Bob", "Charlie"
df["city"] = df["city"].str.lower() # all lowercase
df["city"] = df["city"].str.title() # "New York", "Los Angeles"
# Replace characters
df["phone"] = df["phone"].str.replace(".", "-", regex=False)
df["phone"] = df["phone"].str.replace(" ", "-")
# Check if a string contains a pattern
mask = df["city"].str.contains("york", case=False)
# Extract parts of strings with regex
df["area_code"] = df["phone"].str.extract(r"(\d{3})")Column names from spreadsheets or databases are often ugly — spaces, mixed case, special characters. Clean them up:
df = pd.DataFrame({
"First Name": [1], "Last Name": [2], "Email Address": [3]
})
# Rename specific columns
df = df.rename(columns={
"First Name": "first_name",
"Last Name": "last_name",
"Email Address": "email"
})
# Or clean all column names at once
df.columns = df.columns.str.lower().str.replace(" ", "_")
# Remove special characters from column names
df.columns = df.columns.str.replace(r"[^a-z0-9_]", "", regex=True)A good convention is snake_case — all lowercase with underscores. It plays nicely with Python attribute access (df.first_name instead of df["First Name"]).
Outliers are values that are way outside the normal range. They can be legitimate (a CEO's salary in an employee dataset) or errors (someone typed 999 for age). Either way, you need to decide what to do with them.
The classic statistical approach — values beyond 1.5x the IQR from the quartiles are flagged as outliers:
Q1 = df["salary"].quantile(0.25)
Q3 = df["salary"].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Find outliers
outliers = df[(df["salary"] < lower_bound) | (df["salary"] > upper_bound)]
# Remove outliers
df_clean = df[(df["salary"] >= lower_bound) & (df["salary"] <= upper_bound)]Flag values more than 2-3 standard deviations from the mean:
mean = df["salary"].mean()
std = df["salary"].std()
# Values with z-score > 3 are outliers
z_scores = (df["salary"] - mean) / std
df_clean = df[z_scores.abs() <= 3]Instead of removing outliers, you can cap them at a boundary:
# Cap values at the 5th and 95th percentiles
lower = df["salary"].quantile(0.05)
upper = df["salary"].quantile(0.95)
df["salary"] = df["salary"].clip(lower=lower, upper=upper)Sometimes you need cleaning logic that's specific to your data. That's where .apply() and .map() come in.
# .apply() runs a function on every value in a column (or every row)
df["name"] = df["name"].apply(lambda x: x.strip().title() if isinstance(x, str) else x)
# Use a regular function for more complex logic
def clean_phone(phone):
if pd.isna(phone):
return phone
digits = "".join(c for c in str(phone) if c.isdigit())
if len(digits) == 10:
return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
return phone
df["phone"] = df["phone"].apply(clean_phone)
# .map() is great for replacing values using a dictionary
status_map = {"Y": "Active", "N": "Inactive", "y": "Active", "n": "Inactive"}
df["status"] = df["status"].map(status_map)
# .replace() works similarly but keeps values that don't match
df["status"] = df["status"].replace({"Y": "Active", "N": "Inactive"})After cleaning, always check your work. Trust, but verify:
# Check for remaining missing values
print(df.isna().sum())
# Check data types
print(df.dtypes)
# Check for duplicates
print(f"Duplicates remaining: {df.duplicated().sum()}")
# Look at basic stats — do the numbers make sense?
print(df.describe())
# Check unique values in categorical columns
print(df["status"].value_counts())
# Verify value ranges
assert df["age"].between(0, 120).all(), "Age values out of range!"Here's how it all comes together in practice:
import pandas as pd
# Step 1: Load and inspect
df = pd.read_csv("messy_data.csv")
print(df.shape)
print(df.dtypes)
print(df.isna().sum())
print(df.describe())
# Step 2: Clean column names
df.columns = df.columns.str.lower().str.strip().str.replace(" ", "_")
# Step 3: Remove duplicates
df = df.drop_duplicates()
# Step 4: Fix data types
df["date"] = pd.to_datetime(df["date"], errors="coerce")
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
# Step 5: Handle missing values
df["name"] = df["name"].fillna("Unknown")
df["amount"] = df["amount"].fillna(df["amount"].median())
df = df.dropna(subset=["date"]) # drop rows where date is essential
# Step 6: Clean strings
df["name"] = df["name"].str.strip().str.title()
df["city"] = df["city"].str.strip().str.title()
# Step 7: Handle outliers
Q1 = df["amount"].quantile(0.25)
Q3 = df["amount"].quantile(0.75)
IQR = Q3 - Q1
df = df[df["amount"].between(Q1 - 1.5 * IQR, Q3 + 1.5 * IQR)]
# Step 8: Validate
print(f"Final shape: {df.shape}")
print(f"Missing values:\n{df.isna().sum()}")
print(f"Duplicates: {df.duplicated().sum()}")Check out example.py for a complete working example that creates a messy dataset and cleans it step by step.
Try the practice problems in exercises.py to test your data cleaning skills.
- Real-world data is always messy — cleaning it is a core skill, not a chore to skip
- Use
.isna()and.notna()to detect missing values, then decide whether to drop or fill them .drop_duplicates()removes duplicate rows; usesubsetto check specific columnspd.to_numeric()andpd.to_datetime()witherrors="coerce"safely convert types- The
.straccessor gives you vectorized string operations —.strip(),.lower(),.replace() - Handle outliers with IQR, z-scores, or
.clip()depending on the situation .apply()lets you run any custom function across a column- Always validate after cleaning — check types, missing values, ranges, and duplicates
- Build a consistent workflow: inspect, clean columns, deduplicate, fix types, fill gaps, clean strings, handle outliers, validate