Data-Engineering

Introduction

Learn the core principles of data engineering by constructing a complete data pipeline from the ground up. Develop practical skills using industry-standard tools and best practices.

This repository is based on the Data Engineering Zoomcamp by DataTalksClub.

Index

For more information in each part, please read the READ_ME.md file inside.

1. Docker

Docker and Docker Compose to build and run one image or many images in one file.
Running PostgreSQL with Docker to create a data pipeline to ingest data from NYC_TLC into the database.
Run the entire workflow in the cloud with Google Cloud Platform. First, create VM instance (Compute Engine), then install Anaconda, Docker, Docker Compose, PostgreSQL, pgcli to run the workflow. Accessing the remote machine with VS Code and SSH remote.

2. Kestra

Workflow orchestration with Kestra: use Kestra to automate the above workflow. You can shcedule to run on the first day of every month to collect data using Trigger. Another function is that to backfill missing months before the current month.

3. Data warehouse

Using BigQuery as data warehouse to ingest Parquet files which were loadded into Cloud Storage.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
01-Docker		01-Docker
02_workflow_orchestration		02_workflow_orchestration
03-Data Warehouse		03-Data Warehouse
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data-Engineering

Introduction

Index

About

Uh oh!

Releases

Packages

Languages

thanggnguyenn/Data-Engineering

Folders and files

Latest commit

History

Repository files navigation

Data-Engineering

Introduction

Index

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages