Skip to content

LDCast implementation in MLCast (in progress)#8

Open
martinbo-meteo wants to merge 13 commits intomlcast-community:mainfrom
martinbo-meteo:main
Open

LDCast implementation in MLCast (in progress)#8
martinbo-meteo wants to merge 13 commits intomlcast-community:mainfrom
martinbo-meteo:main

Conversation

@martinbo-meteo
Copy link
Copy Markdown

I worked on the code of LDCast, trying to reimplement the code to make it clearer, and trying to make it compliant with the MLCast base classes.

Up to now, I tried essentially to make to make clearer the logic between the autoencoder, the conditioner and the denoiser. I think that the logic between the denoiser and the samplers could still be improved.

I removed also some parts of the code which were not used in the original code, again to make things clearer.

The README contains the current status of the code, and how it is organized. I also try to track all changes in messages of the commits.

Martin Bonte added 3 commits February 11, 2026 10:31
…sed,

and to make the three main components (the autoencoder, the conditioner and the diffuser) distinct entities. Thanks to this, the way the data makes its way in these three components is clearer (see LDCast.ipynb).
The files which were taken from the original code without change are:
ldcast/utils.py (from ldcast/models/utils.py)
ldcast/distributions.py (from ldcast/models/distributions.py)
autoenc/auteonc.py
autoenc/encoder.py
blocks/afno.py
blocks/attention.py
blocks/resnet.py
diffusion/ema.py
diffusion/utils.py
diffusion/unet.py (which was in the genforecast folder, even though unet
is used only in the denoiser)
diffusion/utils.py

The changes I made are essentially:
diffusion/diffusion.py: the LatentDiffusion was given the three parts:
model (=denoiser), autoencoder and the context_encoder (= conditioner)
and the interactions of these three was not clear. The main class is now
DiffusionModel, which needs only the denoiser to be instantiated (the
forward call still needs the context given by the conditioner). The
interaction between DiffusionModel and the PLMSSampler could be improved
(merge the two ?). I removed the ema scope for now, but it should taken
care of.
diffusion/plms.py: changed only the way the model (denoiser) is called

I removed the genforecast folder of the original code: unet.py is now in
diffusion, and analysis.py is now context/context.py. The context folder
contains nowcast.py (which comes from nowcast/nowcast.py), so that the
context folder contains everything to build the conditioner.

I reworked the conext/nowcast.py file: I removed the Nowcaster, AFNONowcastNetBasic and the AFNONowcastNet classes (which were not used), and simplified a little the code of two remaining classes (some parts were not used either).
The AFNONowcastNetBase class was also taking the autoencoder as input to build the conditioner, which I find very weird (this is why the data seemed to be decoded but not encoded in forecast.Forecast.__call__...). Now, the conditioner is built without the autoencoder.

ldcast.py is a new file which will contain the classes subclassing the
base classes of mlcast.
 - some changes in /src/mlcast/models/base.py (attribute problem if the loss is a dict)
 - in /src/mlcast/models/ldcast/autoenc/autoenc.py: added the loss for the autoencoder; renamed the 'AutoencoderKL' class  in 'AutoencoderKLNet' and set the encoder and decoder to default configurations (which were the ones used in the original code). Removed all the training logic from that class (it will be handled by the trainer)
 - in src/mlcast/models/ldcast/context/context.py: the timesteps passed to AFNONetCascade.forward were always [-3, -2, -1, 0], so I included them in this function
 - in /src/mlcast/models/ldcast/diffusion/diffusion.py: I tried to separate as much as possible what concerns the samplers from the rest. I replaced the LatentDiffusion class by the LatentNowcaster one. I could not manage to make this a subclass of NowcastingLightningModule but it would be nice to do so. I removed all the training logic which was contained in the LatentDiffusion class (it will be handled by the trainer)
 - /src/mlcast/models/ldcast/diffusion/plms.py: removed the score_corrector, corrector_kwargs and mask keywords, which were not used
 - /src/mlcast/models/ldcast/ldcast.py now contains the main LDCast class subclassing the NowcastingModelBase (only the predict method is implemented, partially)

I did not manage to make LatentNowcaster a sublcass of NowcastingLightningModule because LatentNowcaster needs two nets (denoiser and conditioner) and because the training logic is not as straightforward as it is for the moment in NowcastingLightningModule. One should also take into account the fact that two different samplers are used for training and inference, so that the forward method can not just be self.net(x)

It would be nice to have cleaner and consistent APIs for the samplers. For the moment, the PLMSSampler and the SimpleSampler are not totally consistent in their APIs, because the SimpleSampler (better/more common name for this one?) was only used during training, while the PLMSSampler was used during inference. The handling of the schedule of each sampler with respect to the schedule saved in the denoiser could also be clearer

During training, an EMA scope was used for the weights of the denoiser, I removed this for the moment, but it should reincluded in some way.

The variable sometimes refers to the timesteps of the diffusion process (= 1000 during training) and sometimes refers to the nowcasting timesteps (where each time step = 5 minutes). Better to have different names.

An AutoencoderKLNet instance can now be passed to the NowcastingLightningModule with the autoenc_loss to handle the training

In /src/mlcast/models/ldcast/diffusion/diffusion.py, one has to choose which sampler to use for testing
@martinbo-meteo
Copy link
Copy Markdown
Author

I also included current questions/problems at the end of the README

Martin Bonte and others added 10 commits February 24, 2026 14:44
 - in base.py, I added a training_logic method to NowcastingLightningModule; this method can be rewritten in case the training logic is not straightforward (this is the case for diffusion models); I added also a print_log_loss method to take care of the printing and logging of the loss
 - in autoenc.py, I added the fact that, by default, autoencoder.decode returns only what is used as the latent encoding (which is the mean, see README.md)
 - I have understood that samplers are only used in inference ! The training (and validation) step is always done by predicting the noise (or a quantity which is related to it by a simple formula). The scheduler has some role to play before training, and I put the code of the scheduler in the diffusion/scheduler.py file (the SimpleSampler is not used anymore, because it was the scheduler)
 - I added the LatentDiffusion and LatentDiffusionLightning classes in diffusion/diffusion.py (the latter replaces the LatentNowcaster class)
 - to train the latent diffusion part, I created a LatentDataset class which converts in latent space the data with the autoencoder (to be used once the autoencoder has been trained) in data.py
 - I updated a bit the LDCast class in ldcast.py (this is where the main part of the work remains to be done)
 - I updated the README accordingly, with examples on how deal with these different parts; I also added some basic details I have understood on diffusion models and on the variational autoencoder
 - I reincluded the ema weights. I changed a little the implementation. First, in the original code, ema weights were used within a python scope. It seems more standard to have an object holding the ema weights and using lightning hooks, apply the ema weights before validation and test steps and before inference, and to restore the model weights after these. The original code was holding the ema weights as buffers to have them saved automatically, but it is simpler (and more standard it seems) to hold them in a dictionary. For that, I changed diffusion/ema.py and diffusion.diffusion.py
 - I removed the method to load the weights of the denoiser and of the conditioner from the original way they were saved, and put this code in a function in original_weights.py (I think it is cleaner not to have the LatentDiffusion class with this method). In original_weights.py, I added a function to check that the saved buffers are the same than the ones already in ldm.
 - I added to the LDCast class (ldcast.py) methods to load and save weights from a folder where the weights of the autoencoder, of the conditioner and of the denoiser (and ema weights if any) are stored.
…few typos generating errors:

 - the EMA class can save its weights, and the weights can be loaded through methods of the class
 - change the obsolute imports into relative imports
 - worked on the LDCast class (in ldcast.py): it can be loaded from a yaml file or from dict containing the config and I implemented a very minimal version of the fit method
 - I also chnaged the convert_original_weights function in original_weights.py so that it handles all weights (conditioner, denoiser, scheduling buffers and ema weights)
…few typos generating errors:

 - diffusion/diffusion.py: I took into account the fact that ema might not be used in lightning hooks (on_train_batch_end, etc.) and updated the way the EMA config is passed to the EMA class
 - diffusion/ema.py: I made sure that the gradient graphs are not kept when the weights are stored in self.backup and self.shadow, and added the possibility to store on CPU the weights which are not currently on the model through the store_device keyword
 - data.py: I added the code to construct a dataset to train the autoencoder (AutoencoderDataset), and to construct a dataset to train the ldm (LatentDataset) from a sampled radar dataset. I added a DataModule class
 - ldcast.py: I mainly wrote the fit_autoencoder and fit_ldm methods taking as argument a sampled radar dataset and using the AutoencoderDataset and LatentDataset classes
 - reorganized the documentation: everything was in the README.md of the file, and I created a docs folder with markdown files to organize the documentation a little better. The README now contains only general informations and things to do
 - reorganized the config.yaml file (which is not anymore named original_config.yaml but config.yaml)
 - in the NowcastingLightningModule in base.py, I added the possibility to add a scheduler for the learning rate and removed the n_timesteps argument of the forward method (I think it is not very appropriate to have it there, since not every subclass will need this argument, e.g. the Autoencoder subclass)
 - in autoencoder/autoencoder.py: the loss is now name AutoencoderLoss; the subclass of NowcastingLightningModule is now named Autoencoder while AutoencoderKLNet is a subclass of torch.nn.Module. I added the possibility to do antialiasing before feeding the autoencoder with samples (done by default by an Antialiaser object (in transforms/antialiasing.py)). I added also the possibility to create an instance of Autoencoder via a config dict, based on the original autoencoder architecture
 - in LatentDiffusion in diffusion/diffusion.py: I added the possibility to construct an instance from a config dict, based on the architecture of the corresponding part in the original code
 - in the code in general, ldm was an instance of LatentDiffusionNet and ldm_lightning was an instance of LatentDiffusion, and I changed this: ldm is now an instance of LatentDiffusion and net is the instance of LatentDiffusionNet, to be consistent with the .net attribute of NowcastingLightningModule
 - in ldcast/ldcast.py: I also added the possibility to build the LDCast class from dict config
in parallel on multiple GPUs. The main thing for that is that the
LatentDiffusion class has to have the autoencoder as an attribute so
that Lightning creates one instance of the autoencoder and one instance
of the ldm on each GPU (in DDP strategy).
@leifdenby
Copy link
Copy Markdown
Member

leifdenby commented May 1, 2026

Hi @martinbo-meteo! Thanks for starting this work :) And also, sorry for not engaging with this earlier. I've been holding off on diving into LDCast, because it is really quite a beast 😄

But, as I wrote via email: after working through @franchg's great PR adding the ConvGRU architecture and fiddle based configuration #10 I feel like the mlcast codebase will very soon be ready for its next architecture.

I've been trying to understand the structure of the code for LDCast on your fork. For this I created the diagram below using a script I wrote to do static code-analysis to extract the inheritance and dependency-tree between classes related to model architecture (I wrote this to help me in the process of refactoring neural-lam. The script is here if you are interested: https://github.com/leifdenby/mlcast/blob/feat/ldcast-martinbo/class_diagram_gen.py (it is all LLM written, so not perfect, but I think LLMs are very useful for this kind of tool creation)

mlcast_ldcast_architecture_diagram

(there is a drawio version here too if you want to dig more deeply and pick things apart: mlcast_ldcast_architecture_diagram.drawio

Walking through this has made me realise that there is a training setup defined (I wrote on email I couldn't find it), but it wasn't where I would expect it :) Rather than constructing models and datasets and then passing these to (one or more) trainers, you are currently instantiating the pl.Trainers (and datasets, datamodules) inside this NowcastingModelBase: https://github.com/martinbo-meteo/mlcast-ldcast/blob/main/src/mlcast/models/ldcast/ldcast.py#L29.

What I think we need here is a representation of what I would call a "training experiment" which actually consists of two steps:

  1. Construct required dataset and model, and train an autoencoder
  2. Reuse this trained autoencoder, construct dataset and model, and train a latent diffusion model.

These two steps could be contained in a LatentDiffusionTrainingExperiment for example, or LDCastTrainingExperiment and then we keep the dataset and model classes to only handle logic about producing samples from datasets and producing predictions from those samples respectively. They would be created with a fiddle config (where we can also define that the encoder use both within the autoencoder and the latentdiffusion architecture are the one and same thing, as in will be the same object). And then the "experiment" would handle carrying out the two stages of training.

Here's what the Experiment for training ConvGRU currently looks like: https://github.com/franchg/mlcast/blob/feat/convgru-ensemble-training/src/mlcast/config/base.py#L39, and the construction of the components that go into that experiment (dataset, model trainer) https://github.com/franchg/mlcast/blob/feat/convgru-ensemble-training/src/mlcast/config/base.py#L53. We would then have an experiment with multiple trainers, and the config would create multiple datesets and models.

I can have a go a refactoring this if you like, but we can also talk about it next week and you can have a go if it makes sense to you.

I can also see that we don't have CI integration for pre-commit so your code needs a bit of linting. I will set that up and then maybe you could run:

uvx pre-commit install
uvx pre-commit run -a

that will clean up all the linting things :)

And finally, we need jaxtyping based annotations for the tensors I think, but I can explain how that works :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants