Skip to content

Switch to using CF compliant coord stacking (rather than xr.Dataset.reset_index())#85

Open
leifdenby wants to merge 3 commits into
mllam:mainfrom
leifdenby:feat/use-cfcompliant-coord-stacking
Open

Switch to using CF compliant coord stacking (rather than xr.Dataset.reset_index())#85
leifdenby wants to merge 3 commits into
mllam:mainfrom
leifdenby:feat/use-cfcompliant-coord-stacking

Conversation

@leifdenby
Copy link
Copy Markdown
Member

Describe your changes

This PR alters the way mllam-data-prep handles storing stacked coordinates in output zarr datasets. Stacked coordinates are used for example when creating a single grid_index coordinate from two spatial coordinates like x and y or lat and lon. The difficulty arises because calling xr.Dataset.stack(grid_index=('x', 'y')) creates grid_index as a new pandas.MultiIndex coordinate and these cannot be written directly to netCDF or zarr datasets.

The current implementation in mllam-data-prep calls .reset_index(stacked_dim_name) https://github.com/mllam/mllam-data-prep/blob/main/mllam_data_prep/ops/mapping.py#L103 (e.g. .reset_index("grid_index") which removes the relationship between the stacked coordinates and the coordinates that were stacked by making the stacked coordinates just a regular 1D coordinate (i.e. it is no longer a MultiIndex). This has the drawbacks that:

  1. In downstream applications we must explicitly recreate the MultiIndex by calling xr.Dataset.set_index(...) see eg https://github.com/mllam/neural-lam/blob/main/neural_lam/datastore/base.py#L528.
  2. When calling .set_index we must explicitly pass in the the coordinates that the multiindex was created from. Currently we extract this from the mllam-data-prep config, but it would be better not store this information in the dataset. I didn't think this was possible, but it turns out that there is a CF-compliant way of storing these stacked coordinate relationships.

The new implementation in this PR removes the need for calling .reset_index() in mllam-data-prep and instead uses the CF-compliant "gather compression" for coordinate stacking (implemented in cf_xarray, see https://cf-xarray.readthedocs.io/en/latest/coding.html for details). This allows us to fully roundtrip coordinate stacking from zarr datasets produced with mllam-data-prep.

NB: This is unfortunately a breaking change for neural-lam this requires a similar change in neural-lam to use the same approach for coordinate unstacking, and so once this PR is merged in datasets created with mllam-data-prep will not be able to used with versions of neural-lam without this same change. I will make a PR for neural-lam so that we can ensure both are merged in and released together.

For reference so people can see the change to the zarr datasets, using the following config yaml file:

schema_version: v0.6.0
dataset_version: v0.1.0

output:
  variables:
    static: [grid_index, static_feature]
  coord_ranges:
    time:
      start: 1990-09-03T00:00
      end: 1990-09-06T00:00
  chunking:
    time: 1

inputs:
  danra_height_levels:
    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
    dims: [time, x, y]
    variables: [t2m]
    dim_mapping:
      static_feature:
        method: stack_variables_by_var_name
        name_format: "{var_name}"
      grid_index:
        method: stack
        dims: [x, y]

    target_output_variable: static

Currently results in the following zarr dataset:

In [3]: ds
Out[3]:
<xarray.Dataset> Size: 387MB
Dimensions:                        (static_feature: 1, time: 100,
                                    grid_index: 464721)
Coordinates:
  * static_feature                 (static_feature) <U3 12B 't2m'
  * time                           (time) datetime64[ns] 800B 1990-09-01 ... ...
    lat                            (grid_index) float64 4MB ...
    lon                            (grid_index) float64 4MB ...
    static_feature_long_name       (static_feature) <U11 44B ...
    static_feature_source_dataset  (static_feature) <U19 76B ...
    static_feature_units           (static_feature) <U1 4B ...
    x                              (grid_index) float64 4MB ...
    y                              (grid_index) float64 4MB ...
Dimensions without coordinates: grid_index
Data variables:
    static                         (static_feature, time, grid_index) float64 372MB ...
Attributes:
    created_on:       2025-11-25T14:55:23
    created_with:     mllam-data-prep (https://github.com/mllam/mllam-data-prep)
    creation_config:  dataset-version: v0.1.0\nextra: {}\ninputs:\n  danra_he...
    dataset_version:  v0.1.0
    mdp_version:      v0.6.1
    schema_version:   v0.6.0

In [4]: ds.grid_index
Out[4]:
<xarray.DataArray 'grid_index' (grid_index: 464721)> Size: 4MB
[464721 values with dtype=int64]
Coordinates:
    lat      (grid_index) float64 4MB ...
    lon      (grid_index) float64 4MB ...
    x        (grid_index) float64 4MB ...
    y        (grid_index) float64 4MB ...
Dimensions without coordinates: grid_index

With the changes in this PR the zarr dataset now look like this:

<xarray.Dataset> Size: 383MB
Dimensions:                        (static_feature: 1, time: 100,
                                    grid_index: 464721, x: 789, y: 589)
Coordinates:
  * static_feature                 (static_feature) <U3 12B 't2m'
  * time                           (time) datetime64[ns] 800B 1990-09-01 ... ...
  * grid_index                     (grid_index) int64 4MB 0 1 ... 464719 464720
  * x                              (x) float64 6kB -1.999e+06 ... -2.925e+04
  * y                              (y) float64 5kB -6.095e+05 ... 8.605e+05
    lat                            (grid_index) float64 4MB ...
    lon                            (grid_index) float64 4MB ...
    static_feature_long_name       (static_feature) <U11 44B ...
    static_feature_source_dataset  (static_feature) <U19 76B ...
    static_feature_units           (static_feature) <U1 4B ...
Data variables:
    static                         (static_feature, time, grid_index) float64 372MB ...
Attributes:
    created_on:       2025-11-25T14:52:53
    created_with:     mllam-data-prep (https://github.com/mllam/mllam-data-prep)
    creation_config:  dataset-version: v0.1.0\nextra: {}\ninputs:\n  danra_he...
    dataset_version:  v0.1.0
    mdp_version:      v0.6.1
    schema_version:   v0.6.0

In [4]: ds.grid_index
Out[4]:
<xarray.DataArray 'grid_index' (grid_index: 464721)> Size: 4MB
array([     0,      1,      2, ..., 464718, 464719, 464720], shape=(464721,))
Coordinates:
  * grid_index  (grid_index) int64 4MB 0 1 2 3 4 ... 464717 464718 464719 464720
    lat         (grid_index) float64 4MB ...
    lon         (grid_index) float64 4MB ...
Attributes:
    compress:  x y

Note how the x and y coordinates no longer have grid_index as their own coordinate, and instead grid_index coordinate has been given a compress attribute that names the coordinates that it was stacked from.

So why am I making this change if it breaks current versions of neural-lam?

  1. The current implementation is not correct, as in, the creation of the zarr datasets is lossy and the datasets should instead contain enough information to invert the transformation processes that mllam-data-prep applies.
  2. The current implementation uses our own hand-crafted way of handling stacked coordinates in output, rather than the standard CF-compliant way. I feel a lot more comfortable maintaining code where we're using standard ways of doing things rather than inventing our own.

Will future versions of neural-lam support datasets created with mllam-data-prep prior to this change?
Yes, this should be possible. We can simply try applying the CF-compliant unstacking first, if that fails the fall-back to the current handcrafted approach (while issuing a warning).

This PR also add cf_xarray as dependency. I have been wanting to add this package for a while because it also adds accessor methods for safely retrieving say lat/lon coordinate variables using standard names and other standards-compliant things like that. It is a small package that is well designed and maintained so I think it would be a very good addition.

Issue Link

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • will only break downstream applications in-so-far as the new approach to handling stacked coordinates hasn't been implemented yet in for example neural-lam
  • 📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

  • My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
  • I have performed a self-review of my code
  • For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
  • I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
  • I have updated the documentation to cover introduced code changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have given the PR a name that clearly describes the change, written in imperative form (context).
  • I have requested a reviewer and an assignee (assignee is responsible for merging)

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

  • the code is readable
  • the code is well tested
  • the code is documented (including return types and parameters)
  • the code is easy to maintain

Author checklist after completed review

  • I have added a line to the CHANGELOG describing this change, in a section
    reflecting type of change (add section where missing):
    • added: when you have added new functionality
    • changed: when default behaviour of the code has been changed
    • fixes: when your contribution fixes a bug

Checklist for assignee

  • PR is up to date with the base branch
  • the tests pass
  • author has added an entry to the changelog (and designated the change as added, changed or fixed)
  • Once the PR is ready to be merged, squash commits and merge the PR.

This commit replaces the current stacking method where
`xr.Dataset.reset_index()` is used to change the stacked coordinate (for
example `grid_index` as a stacking of `x` and `y` coordinates) from a
`pd.MultiIndex` to a regular index, and instead uses the CF-compliant "gather
compression" of coordinate stacking (see
https://cf-xarray.readthedocs.io/en/latest/coding.html). Using `reset_index()`
is a lossy operation in that downstream applications will not know wwhich
coordinates have been stacked, meaning we must instead rely on this information
from elsewhere (e.g. the config) in downstream applications.
The reason that stacked coordinates must be handled in a special way is that
`pd.MultiIndex` coordinates cannot be directly written to netCDF/zarr datasets,
but with the CF-compliant method they can be handled in such a manner that we
can completely roundtrip in downstream applications.
@leifdenby leifdenby changed the title Feat/use cfcompliant coord stacking Switch to using CF compliant coord stacking (rather than xr.Dataset.reset_index()) Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants