Switch to using CF compliant coord stacking (rather than xr.Dataset.reset_index()) by leifdenby · Pull Request #85 · mllam/mllam-data-prep

leifdenby · 2025-11-25T14:29:28Z

Describe your changes

This PR alters the way mllam-data-prep handles storing stacked coordinates in output zarr datasets. Stacked coordinates are used for example when creating a single grid_index coordinate from two spatial coordinates like x and y or lat and lon. The difficulty arises because calling xr.Dataset.stack(grid_index=('x', 'y')) creates grid_index as a new pandas.MultiIndex coordinate and these cannot be written directly to netCDF or zarr datasets.

The current implementation in mllam-data-prep calls .reset_index(stacked_dim_name) https://github.com/mllam/mllam-data-prep/blob/main/mllam_data_prep/ops/mapping.py#L103 (e.g. .reset_index("grid_index") which removes the relationship between the stacked coordinates and the coordinates that were stacked by making the stacked coordinates just a regular 1D coordinate (i.e. it is no longer a MultiIndex). This has the drawbacks that:

In downstream applications we must explicitly recreate the MultiIndex by calling xr.Dataset.set_index(...) see eg https://github.com/mllam/neural-lam/blob/main/neural_lam/datastore/base.py#L528.
When calling .set_index we must explicitly pass in the the coordinates that the multiindex was created from. Currently we extract this from the mllam-data-prep config, but it would be better not store this information in the dataset. I didn't think this was possible, but it turns out that there is a CF-compliant way of storing these stacked coordinate relationships.

The new implementation in this PR removes the need for calling .reset_index() in mllam-data-prep and instead uses the CF-compliant "gather compression" for coordinate stacking (implemented in cf_xarray, see https://cf-xarray.readthedocs.io/en/latest/coding.html for details). This allows us to fully roundtrip coordinate stacking from zarr datasets produced with mllam-data-prep.

NB: This is unfortunately a breaking change for neural-lam this requires a similar change in neural-lam to use the same approach for coordinate unstacking, and so once this PR is merged in datasets created with mllam-data-prep will not be able to used with versions of neural-lam without this same change. I will make a PR for neural-lam so that we can ensure both are merged in and released together.

For reference so people can see the change to the zarr datasets, using the following config yaml file:

schema_version: v0.6.0
dataset_version: v0.1.0

output:
  variables:
    static: [grid_index, static_feature]
  coord_ranges:
    time:
      start: 1990-09-03T00:00
      end: 1990-09-06T00:00
  chunking:
    time: 1

inputs:
  danra_height_levels:
    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
    dims: [time, x, y]
    variables: [t2m]
    dim_mapping:
      static_feature:
        method: stack_variables_by_var_name
        name_format: "{var_name}"
      grid_index:
        method: stack
        dims: [x, y]

    target_output_variable: static

Currently results in the following zarr dataset:

In [3]: ds
Out[3]:
<xarray.Dataset> Size: 387MB
Dimensions:                        (static_feature: 1, time: 100,
                                    grid_index: 464721)
Coordinates:
  * static_feature                 (static_feature) <U3 12B 't2m'
  * time                           (time) datetime64[ns] 800B 1990-09-01 ... ...
    lat                            (grid_index) float64 4MB ...
    lon                            (grid_index) float64 4MB ...
    static_feature_long_name       (static_feature) <U11 44B ...
    static_feature_source_dataset  (static_feature) <U19 76B ...
    static_feature_units           (static_feature) <U1 4B ...
    x                              (grid_index) float64 4MB ...
    y                              (grid_index) float64 4MB ...
Dimensions without coordinates: grid_index
Data variables:
    static                         (static_feature, time, grid_index) float64 372MB ...
Attributes:
    created_on:       2025-11-25T14:55:23
    created_with:     mllam-data-prep (https://github.com/mllam/mllam-data-prep)
    creation_config:  dataset-version: v0.1.0\nextra: {}\ninputs:\n  danra_he...
    dataset_version:  v0.1.0
    mdp_version:      v0.6.1
    schema_version:   v0.6.0

In [4]: ds.grid_index
Out[4]:
<xarray.DataArray 'grid_index' (grid_index: 464721)> Size: 4MB
[464721 values with dtype=int64]
Coordinates:
    lat      (grid_index) float64 4MB ...
    lon      (grid_index) float64 4MB ...
    x        (grid_index) float64 4MB ...
    y        (grid_index) float64 4MB ...
Dimensions without coordinates: grid_index

With the changes in this PR the zarr dataset now look like this:

<xarray.Dataset> Size: 383MB
Dimensions:                        (static_feature: 1, time: 100,
                                    grid_index: 464721, x: 789, y: 589)
Coordinates:
  * static_feature                 (static_feature) <U3 12B 't2m'
  * time                           (time) datetime64[ns] 800B 1990-09-01 ... ...
  * grid_index                     (grid_index) int64 4MB 0 1 ... 464719 464720
  * x                              (x) float64 6kB -1.999e+06 ... -2.925e+04
  * y                              (y) float64 5kB -6.095e+05 ... 8.605e+05
    lat                            (grid_index) float64 4MB ...
    lon                            (grid_index) float64 4MB ...
    static_feature_long_name       (static_feature) <U11 44B ...
    static_feature_source_dataset  (static_feature) <U19 76B ...
    static_feature_units           (static_feature) <U1 4B ...
Data variables:
    static                         (static_feature, time, grid_index) float64 372MB ...
Attributes:
    created_on:       2025-11-25T14:52:53
    created_with:     mllam-data-prep (https://github.com/mllam/mllam-data-prep)
    creation_config:  dataset-version: v0.1.0\nextra: {}\ninputs:\n  danra_he...
    dataset_version:  v0.1.0
    mdp_version:      v0.6.1
    schema_version:   v0.6.0

In [4]: ds.grid_index
Out[4]:
<xarray.DataArray 'grid_index' (grid_index: 464721)> Size: 4MB
array([     0,      1,      2, ..., 464718, 464719, 464720], shape=(464721,))
Coordinates:
  * grid_index  (grid_index) int64 4MB 0 1 2 3 4 ... 464717 464718 464719 464720
    lat         (grid_index) float64 4MB ...
    lon         (grid_index) float64 4MB ...
Attributes:
    compress:  x y

Note how the x and y coordinates no longer have grid_index as their own coordinate, and instead grid_index coordinate has been given a compress attribute that names the coordinates that it was stacked from.

So why am I making this change if it breaks current versions of neural-lam?

The current implementation is not correct, as in, the creation of the zarr datasets is lossy and the datasets should instead contain enough information to invert the transformation processes that mllam-data-prep applies.
The current implementation uses our own hand-crafted way of handling stacked coordinates in output, rather than the standard CF-compliant way. I feel a lot more comfortable maintaining code where we're using standard ways of doing things rather than inventing our own.

Will future versions of neural-lam support datasets created with mllam-data-prep prior to this change?
Yes, this should be possible. We can simply try applying the CF-compliant unstacking first, if that fails the fall-back to the current handcrafted approach (while issuing a warning).

This PR also add cf_xarray as dependency. I have been wanting to add this package for a while because it also adds accessor methods for safely retrieving say lat/lon coordinate variables using standard names and other standards-compliant things like that. It is a small package that is well designed and maintained so I think it would be a very good addition.

Issue Link

I am came across the problem being addressed while working on the mllam-data-prep inversion operations in Add support for inverting transformation operations (mapping back to structure of source datasets) #78
This PR also offers a more general solution to the current assumption in neural-lam about what the spatial coordinate names are: Add support for cropping domain with padded lat/lon convex hull #45 (comment)

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
- will only break downstream applications in-so-far as the new approach to handling stacked coordinates hasn't been implemented yet in for example neural-lam
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the documentation to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging)

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR is ready to be merged, squash commits and merge the PR.

This commit replaces the current stacking method where `xr.Dataset.reset_index()` is used to change the stacked coordinate (for example `grid_index` as a stacking of `x` and `y` coordinates) from a `pd.MultiIndex` to a regular index, and instead uses the CF-compliant "gather compression" of coordinate stacking (see https://cf-xarray.readthedocs.io/en/latest/coding.html). Using `reset_index()` is a lossy operation in that downstream applications will not know wwhich coordinates have been stacked, meaning we must instead rely on this information from elsewhere (e.g. the config) in downstream applications. The reason that stacked coordinates must be handled in a special way is that `pd.MultiIndex` coordinates cannot be directly written to netCDF/zarr datasets, but with the CF-compliant method they can be handled in such a manner that we can completely roundtrip in downstream applications.

leifdenby added 2 commits November 25, 2025 13:51

add test for cf-compat unstacking

cc2ec4f

leifdenby mentioned this pull request Nov 25, 2025

Add support for cropping domain with padded lat/lon convex hull #45

Merged

20 tasks

leifdenby changed the title ~~Feat/use cfcompliant coord stacking~~ Switch to using CF compliant coord stacking (rather than xr.Dataset.reset_index()) Nov 25, 2025

add cf-xarray dep

7945ce3

leifdenby mentioned this pull request Nov 25, 2025

Add support for inverting transformation operations (mapping back to structure of source datasets) #78

Open

20 tasks

leifdenby added this to the v0.8.0 (proposed) milestone Nov 25, 2025

joeloskarsson assigned leifdenby Feb 9, 2026

joeloskarsson modified the milestones: v0.8.0 (proposed), v0.8.0 Feb 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to using CF compliant coord stacking (rather than xr.Dataset.reset_index())#85

Switch to using CF compliant coord stacking (rather than xr.Dataset.reset_index())#85
leifdenby wants to merge 3 commits into
mllam:mainfrom
leifdenby:feat/use-cfcompliant-coord-stacking

leifdenby commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leifdenby commented Nov 25, 2025

Describe your changes

Issue Link

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants