Daffy - DataFrame Validator

Description

Working with DataFrames often means passing them through multiple transformation functions, making it easy to lose track of their structure over time. Daffy adds runtime validation and documentation to your DataFrame operations through simple decorators. By declaring the expected columns and types in your function definitions, you can:

@df_in(columns=["price", "bedrooms", "location"])
@df_out(columns=["price_per_room", "price_category"])
def analyze_housing(houses_df):
    # Transform raw housing data into price analysis
    return analyzed_df

Like type hints for DataFrames, Daffy helps you catch structural mismatches early and keeps your data pipeline documentation synchronized with the code. Column validation is lightweight and fast. For deeper validation, Daffy also supports row-level validation using Pydantic models. Compatible with both Pandas and Polars.

Key Features

Column Validation (lightweight, minimal overhead):

Validate DataFrame columns at function entry and exit points
Support regex patterns for matching column names (e.g., "r/column_\d+/")
Check data types of columns
Control strictness of validation (allow or disallow extra columns)

Row Validation (optional, requires Pydantic >= 2.4.0):

Validate row data using Pydantic models
Batch validation for optimal performance
Informative error messages showing which rows failed and why

General:

Works with both Pandas and Polars DataFrames
Project-wide configuration via pyproject.toml
Integrated logging for DataFrame structure inspection
Enhanced type annotations for improved IDE and type checker support

When to Use Daffy

Different tools serve different needs. Here's how Daffy compares:

Use Case	Daffy	Pandera	Great Expectations
Function boundary guardrails	✅ Primary focus	⚠️ Possible via decorators	❌ Not designed for this
Quick column/type checks	✅ Lightweight	⚠️ Requires schema definitions	⚠️ Requires Data Context setup
Complex statistical checks	⚠️ Limited	✅ Many built-in	✅ Extensive
Pipeline/warehouse-wide QA	❌ Not designed for this	⚠️ Some support	✅ Primary focus

Philosophy

Non-intrusive: Just add decorators - no refactoring, no custom DataFrame types, no schema files
Easy to adopt and remove: Add Daffy in 30 seconds, remove it just as fast if needed
Stay in-process: No external stores, orchestrators, or infrastructure
Minimal overhead: Column validation is essentially free; pay for row validation only when you need it

Documentation

Usage Guide - Detailed usage instructions
Recipes & Patterns - Common usage patterns
Development Guide - Guide for contributing to Daffy
Changelog - Version history and release notes

Installation

Install with your favorite Python dependency manager:

pip install daffy

Daffy works with pandas, polars, or both - install whichever you need:

pip install pandas   # for pandas support
pip install polars   # for polars support

Python version support: 3.9 - 3.14

Quick Start

Column Validation

from daffy import df_in, df_out

@df_in(columns=["Brand", "Price"])  # Validate input DataFrame columns
@df_out(columns=["Brand", "Price", "Discount"])  # Validate output DataFrame columns
def apply_discount(cars_df):
    cars_df = cars_df.copy()
    cars_df["Discount"] = cars_df["Price"] * 0.1
    return cars_df

Row Validation

For validating actual data values (requires pip install 'pydantic>=2.4.0'):

from pydantic import BaseModel, Field
from daffy import df_in

class Product(BaseModel):
    name: str
    price: float = Field(gt=0)  # Price must be positive
    stock: int = Field(ge=0)    # Stock must be non-negative

@df_in(row_validator=Product)
def process_inventory(df):
    # Process inventory data with validated rows
    return df

Performance

Column validation is essentially free - it only checks column names and types, adding negligible overhead to your functions.

Row validation validates actual data values and is naturally more expensive, but has been optimized to be performant:

Simple validation: ~770K rows/sec (100K rows in 130ms)
Complex validation: ~165K rows/sec (32 columns, missing values, cross-field validation)

Benchmarked on MacBook Pro M1 Pro. Performance depends on:

Model complexity: Number of fields, validators, and custom validation logic
Data characteristics: DataFrame size, missing values, data types
Hardware: CPU speed, available memory

For detailed benchmarks and optimization strategies, see scripts/README_BENCHMARKS.md.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 287 Commits
.github		.github
daffy		daffy
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
TESTING_OPTIONAL_DEPS.md		TESTING_OPTIONAL_DEPS.md
mise.toml		mise.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Daffy - DataFrame Validator

Description

Key Features

When to Use Daffy

Philosophy

Documentation

Installation

Quick Start

Column Validation

Row Validation

Performance

License

About

Uh oh!

Releases 12

Uh oh!

Contributors 8

Uh oh!

Languages

License

vertti/daffy

Folders and files

Latest commit

History

Repository files navigation

Daffy - DataFrame Validator

Description

Key Features

When to Use Daffy

Philosophy

Documentation

Installation

Quick Start

Column Validation

Row Validation

Performance

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 12

Uh oh!

Contributors 8

Uh oh!

Languages