Skip to content

thanggnguyenn/Data-Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data-Engineering

Introduction

Learn the core principles of data engineering by constructing a complete data pipeline from the ground up. Develop practical skills using industry-standard tools and best practices.

This repository is based on the Data Engineering Zoomcamp by DataTalksClub.

Index

For more information in each part, please read the READ_ME.md file inside.

1. Docker

  • Docker and Docker Compose to build and run one image or many images in one file.

  • Running PostgreSQL with Docker to create a data pipeline to ingest data from NYC_TLC into the database.

  • Run the entire workflow in the cloud with Google Cloud Platform. First, create VM instance (Compute Engine), then install Anaconda, Docker, Docker Compose, PostgreSQL, pgcli to run the workflow. Accessing the remote machine with VS Code and SSH remote.

2. Kestra

  • Workflow orchestration with Kestra: use Kestra to automate the above workflow. You can shcedule to run on the first day of every month to collect data using Trigger. Another function is that to backfill missing months before the current month.

3. Data warehouse

  • Using BigQuery as data warehouse to ingest Parquet files which were loadded into Cloud Storage.

About

This repository is a place where I learn about Data Engineering.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published