A production-grade dbt data warehouse project showcasing modern data modeling, testing, and documentation practices.
This project transforms raw NYC taxi trip data into a clean, analytical data warehouse using dbt, BigQuery, and cloud-native best practices. It demonstrates:
- ✅ dbt Modeling — Staging, intermediate, and mart layers following Jaffle Shop conventions
- ✅ Data Quality — Automated tests for uniqueness, null checks, and referential integrity
- ✅ Documentation — Rich YAML documentation with column-level descriptions
- ✅ Performance — Partitioning and clustering for optimized BigQuery queries
- ✅ Reproducibility — Seeded sample data for consistent, portable development
raw_trips.csv → stg_trips → int_trips_enriched → fact_trips
raw_locations.csv → stg_locations ↗
| Layer | Model | Purpose |
|---|---|---|
| Staging | stg_trips |
Clean and standardize raw trip records |
| Staging | stg_locations |
Reference table for NYC zones |
| Intermediate | int_trips_enriched |
Enrich trips with location details and calculated metrics |
| Mart | fact_trips |
Production fact table for analytics (partitioned by date, clustered by borough) |
- Unique & not-null tests on fact table keys
- Automated validation on every
dbt run
- Partitioned by
pickup_datefor efficient date range queries - Clustered by borough for geography-based analytics
trip_duration_minutes— calculated from pickup/dropoff timesspeed_mph— average speed (distance ÷ duration)pickup_hour— hour-of-day for time-series analysis
- dbt Cloud account (or local dbt CLI)
- BigQuery project with credentials configured
- Git
git clone https://github.com/valtrof/dbt-data-platform.git
cd dbt-data-platform
dbt deps
dbt seed
dbt run
dbt testdbt docs generate
dbt docs serveOpens interactive documentation with lineage graph and column metadata.
.
├── models/
│ ├── staging/ # Raw data cleaning
│ │ ├── stg_trips.sql
│ │ └── stg_locations.sql
│ ├── intermediate/ # Business logic & enrichment
│ │ └── int_trips_enriched.sql
│ └── marts/ # Analytics-ready tables
│ └── fact_trips.sql
├── seeds/ # Reference data (CSV)
├── tests/ # dbt tests & data contracts
├── dbt_project.yml # dbt configuration
└── README.md # This file
select
pickup_borough,
dropoff_borough,
count(*) as trip_count,
round(avg(trip_distance), 2) as avg_distance,
round(avg(total_amount), 2) as avg_fare
from `voxerve.dbt_valtrof.fact_trips`
where pickup_date >= current_date() - 7
group by 1, 2
order by trip_count desc;scripts/bq_cost_monitor.py queries INFORMATION_SCHEMA.JOBS_BY_PROJECT to report the most expensive queries run against a project and flags any query that scanned more than 1 GB without an apparent partition filter.
pip install google-cloud-bigquery
python scripts/bq_cost_monitor.py --project my-gcp-project
python scripts/bq_cost_monitor.py --project my-gcp-project --days 14 --threshold-gb 0.5Exit code 1 when violations are found — suitable for CI gates.
- dbt 1.6+ (data transformation)
- BigQuery (data warehouse)
- Git/GitHub (version control)
Valeriy Trofimov — Data Platform Engineer
This is a portfolio project demonstrating dbt best practices for production data platforms.