Master Supervised Learning

Context

This repository centralizes all resources (screencast code, exercise starters and solutions) for the running project associated with the course "Master Supervised Learning". The folder structure is as follows:

exercises: Starter and solution notebooks for each chapter of the course
screencasts: Code presented during each screencast
screenshots: Code used to generate certain screenshots in the course

How to use this repository

Downloading the data

You can download all the datasets you will need via this zip file. It contains:

transactions_immobilieres.parquet: the main dataset used throughout Part 1
transactions_par_ville.parquet: a dataset with city-level aggregated data, used across all parts of the course for feature and target computation
transactions_post_feature_engineering.parquet: the dataset used in Parts 2 and 3, where feature engineering is considered complete
transactions_extra_infos.parquet: a dataset containing observation metadata that cannot be used as features (id_transaction, date_transaction) but is useful for exploratory analysis. Use it alongside transactions_post_feature_engineering.parquet
features_used.json: a list file containing all features used in the modeling phase
categorical_features_used.json: a subset of the previous file, containing only categorical features

The datasets are either in parquet format (more convenient than CSV and compatible with both Pandas and Polars) or in JSON format for feature lists.

If you want to rebuild the transactions_immobilieres.parquet file from scratch, you can start from the raw data in this zip file and use the preprocessing.py script.

To keep the repository lightweight, the data files and models saved via MLflow have not been included here. You have the code available to rebuild them.

Important notes about the data

Each city in the dataset can be uniquely identified by a combination of the columns: id_ville, ville, and departement.

Regarding the classification target column, it is computed as follows:

The average price per m2 is computed at the department level for a given month
For each transaction, if its price per m2 is more than 10% below that average, the target value is 1
Otherwise, the target value is 0

How to use the code

This repo uses Poetry for package management and virtual environment creation. The pyproject.toml file contains all package versions used in the course.

Due to dependency conflicts with MLflow, the BentoML package (used in Part 3, Chapter 2) is not listed among the installed dependencies. However, you will find the exact version used for the screencast and exercise code as a comment in the pyproject.toml file.

For the exercises, we recommend storing all datasets in the same location and pointing to that path via an environment variable (as done in the settings.py file). You can implement this by following this tutorial for example.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
exercices		exercices
screencasts		screencasts
screenshots		screenshots
README.md		README.md
data_processing_functions.py		data_processing_functions.py
feature_engineering_functions.py		feature_engineering_functions.py
poetry.lock		poetry.lock
preprocessing.py		preprocessing.py
pyproject.toml		pyproject.toml
settings.py		settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Master Supervised Learning

Context

How to use this repository

Downloading the data

Important notes about the data

How to use the code

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Master Supervised Learning

Context

How to use this repository

Downloading the data

Important notes about the data

How to use the code

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages