-
Notifications
You must be signed in to change notification settings - Fork 74
Description
Creating DataLoaders Before Model Training
Before applying any deep learning model using frameworks like PyTorch or TensorFlow, it is important to understand how data is supplied to the model during training and evaluation. This process is handled using DataLoaders.
What is a DataLoader?
A DataLoader acts as a bridge between raw data stored on disk and the training loop. Instead of loading the entire dataset into memory, it fetches data in small batches during runtime.
A DataLoader typically handles:
- Reading data samples from disk
- Converting samples into tensors
- Batching multiple samples together
- Shuffling data during training
- Loading data efficiently (often in parallel)
Core Idea Behind Implementation
The data pipeline is usually split into two parts:
1. Dataset
- Defines how a single data sample is accessed
- Knows where the data is stored
- Handles loading and basic preprocessing (e.g., reading an image, resizing, normalization)
2. DataLoader
- Wraps the Dataset
- Manages batching, shuffling, and parallel loading
- Feeds batches of data to the model during training
This separation keeps data handling modular and independent of the model architecture.
Why This Step Is Important
Proper use of DataLoaders:
- Ensures consistent preprocessing
- Prevents data leakage between splits
- Improves training performance
- Makes experiments easier to reproduce and scale
Understanding and designing the data loading pipeline is a fundamental step that should be completed before experimenting with model architectures or training strategies.
Task Description
In this issue, you are expected to:
- Design a custom image Dataset class implementing methods like
__getitem__,__len__, and other relevant functions - Wrap the Dataset inside a DataLoader (PyTorch preferred, TensorFlow is also acceptable)
- Write a small loop demonstrating how the DataLoader works during iteration
Contribution Details
Implementation Notes
- This task must be completed inside the participants folder, under your enrolment number directory
- You may:
- Implement it in a separate notebook, or
- Add it to a previously used notebook
If Working on Kaggle
-
Make the required changes directly in your existing Kaggle notebook
-
Download the updated notebook after making changes
-
Upload the updated version to the repository
-
Follow the PR template as specified in previous issues