Skip to content

valtrof/dbt-data-platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYC Taxi Data Warehouse

A production-grade dbt data warehouse project showcasing modern data modeling, testing, and documentation practices.

Overview

This project transforms raw NYC taxi trip data into a clean, analytical data warehouse using dbt, BigQuery, and cloud-native best practices. It demonstrates:

  • dbt Modeling — Staging, intermediate, and mart layers following Jaffle Shop conventions
  • Data Quality — Automated tests for uniqueness, null checks, and referential integrity
  • Documentation — Rich YAML documentation with column-level descriptions
  • Performance — Partitioning and clustering for optimized BigQuery queries
  • Reproducibility — Seeded sample data for consistent, portable development

Architecture

raw_trips.csv → stg_trips → int_trips_enriched → fact_trips
raw_locations.csv → stg_locations ↗

Models

Layer Model Purpose
Staging stg_trips Clean and standardize raw trip records
Staging stg_locations Reference table for NYC zones
Intermediate int_trips_enriched Enrich trips with location details and calculated metrics
Mart fact_trips Production fact table for analytics (partitioned by date, clustered by borough)

Key Features

Data Quality

  • Unique & not-null tests on fact table keys
  • Automated validation on every dbt run

Performance Optimization

  • Partitioned by pickup_date for efficient date range queries
  • Clustered by borough for geography-based analytics

Derived Metrics

  • trip_duration_minutes — calculated from pickup/dropoff times
  • speed_mph — average speed (distance ÷ duration)
  • pickup_hour — hour-of-day for time-series analysis

Getting Started

Prerequisites

  • dbt Cloud account (or local dbt CLI)
  • BigQuery project with credentials configured
  • Git

Installation

git clone https://github.com/valtrof/dbt-data-platform.git
cd dbt-data-platform
dbt deps
dbt seed
dbt run
dbt test

Generate Documentation

dbt docs generate
dbt docs serve

Opens interactive documentation with lineage graph and column metadata.

Project Structure

.
├── models/
│   ├── staging/          # Raw data cleaning
│   │   ├── stg_trips.sql
│   │   └── stg_locations.sql
│   ├── intermediate/     # Business logic & enrichment
│   │   └── int_trips_enriched.sql
│   └── marts/            # Analytics-ready tables
│       └── fact_trips.sql
├── seeds/                # Reference data (CSV)
├── tests/                # dbt tests & data contracts
├── dbt_project.yml       # dbt configuration
└── README.md             # This file

Sample Query

select
  pickup_borough,
  dropoff_borough,
  count(*) as trip_count,
  round(avg(trip_distance), 2) as avg_distance,
  round(avg(total_amount), 2) as avg_fare
from `voxerve.dbt_valtrof.fact_trips`
where pickup_date >= current_date() - 7
group by 1, 2
order by trip_count desc;

Cost Monitoring

scripts/bq_cost_monitor.py queries INFORMATION_SCHEMA.JOBS_BY_PROJECT to report the most expensive queries run against a project and flags any query that scanned more than 1 GB without an apparent partition filter.

pip install google-cloud-bigquery
python scripts/bq_cost_monitor.py --project my-gcp-project
python scripts/bq_cost_monitor.py --project my-gcp-project --days 14 --threshold-gb 0.5

Exit code 1 when violations are found — suitable for CI gates.

Technologies

  • dbt 1.6+ (data transformation)
  • BigQuery (data warehouse)
  • Git/GitHub (version control)

Author

Valeriy Trofimov — Data Platform Engineer


This is a portfolio project demonstrating dbt best practices for production data platforms.

About

Production-grade dbt data warehouse on BigQuery: 3-layer modeling (staging → intermediate → mart), fact table partitioned by date and clustered by borough, automated data quality tests, dbt docs with lineage graph.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages