NYC Taxi Data Warehouse

A production-grade dbt data warehouse project showcasing modern data modeling, testing, and documentation practices.

Overview

This project transforms raw NYC taxi trip data into a clean, analytical data warehouse using dbt, BigQuery, and cloud-native best practices. It demonstrates:

✅ dbt Modeling — Staging, intermediate, and mart layers following Jaffle Shop conventions
✅ Data Quality — Automated tests for uniqueness, null checks, and referential integrity
✅ Documentation — Rich YAML documentation with column-level descriptions
✅ Performance — Partitioning and clustering for optimized BigQuery queries
✅ Reproducibility — Seeded sample data for consistent, portable development

Architecture

raw_trips.csv → stg_trips → int_trips_enriched → fact_trips
raw_locations.csv → stg_locations ↗

Models

Layer	Model	Purpose
Staging	`stg_trips`	Clean and standardize raw trip records
Staging	`stg_locations`	Reference table for NYC zones
Intermediate	`int_trips_enriched`	Enrich trips with location details and calculated metrics
Mart	`fact_trips`	Production fact table for analytics (partitioned by date, clustered by borough)

Key Features

Data Quality

Unique & not-null tests on fact table keys
Automated validation on every dbt run

Performance Optimization

Partitioned by pickup_date for efficient date range queries
Clustered by borough for geography-based analytics

Derived Metrics

trip_duration_minutes — calculated from pickup/dropoff times
speed_mph — average speed (distance ÷ duration)
pickup_hour — hour-of-day for time-series analysis

Getting Started

Prerequisites

dbt Cloud account (or local dbt CLI)
BigQuery project with credentials configured
Git

Installation

git clone https://github.com/valtrof/dbt-data-platform.git
cd dbt-data-platform
dbt deps
dbt seed
dbt run
dbt test

Generate Documentation

dbt docs generate
dbt docs serve

Opens interactive documentation with lineage graph and column metadata.

Project Structure

.
├── models/
│   ├── staging/          # Raw data cleaning
│   │   ├── stg_trips.sql
│   │   └── stg_locations.sql
│   ├── intermediate/     # Business logic & enrichment
│   │   └── int_trips_enriched.sql
│   └── marts/            # Analytics-ready tables
│       └── fact_trips.sql
├── seeds/                # Reference data (CSV)
├── tests/                # dbt tests & data contracts
├── dbt_project.yml       # dbt configuration
└── README.md             # This file

Sample Query

select
  pickup_borough,
  dropoff_borough,
  count(*) as trip_count,
  round(avg(trip_distance), 2) as avg_distance,
  round(avg(total_amount), 2) as avg_fare
from `voxerve.dbt_valtrof.fact_trips`
where pickup_date >= current_date() - 7
group by 1, 2
order by trip_count desc;

Cost Monitoring

scripts/bq_cost_monitor.py queries INFORMATION_SCHEMA.JOBS_BY_PROJECT to report the most expensive queries run against a project and flags any query that scanned more than 1 GB without an apparent partition filter.

pip install google-cloud-bigquery
python scripts/bq_cost_monitor.py --project my-gcp-project
python scripts/bq_cost_monitor.py --project my-gcp-project --days 14 --threshold-gb 0.5

Exit code 1 when violations are found — suitable for CI gates.

Technologies

dbt 1.6+ (data transformation)
BigQuery (data warehouse)
Git/GitHub (version control)

Author

Valeriy Trofimov — Data Platform Engineer

This is a portfolio project demonstrating dbt best practices for production data platforms.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
analyses		analyses
design		design
macros		macros
models		models
scripts		scripts
seeds		seeds
snapshots		snapshots
tests		tests
.gitignore		.gitignore
README.md		README.md
dbt_project.yml		dbt_project.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC Taxi Data Warehouse

Overview

Architecture

Models

Key Features

Data Quality

Performance Optimization

Derived Metrics

Getting Started

Prerequisites

Installation

Generate Documentation

Project Structure

Sample Query

Cost Monitoring

Technologies

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NYC Taxi Data Warehouse

Overview

Architecture

Models

Key Features

Data Quality

Performance Optimization

Derived Metrics

Getting Started

Prerequisites

Installation

Generate Documentation

Project Structure

Sample Query

Cost Monitoring

Technologies

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages