Switch to using CF compliant coord stacking (rather than xr.Dataset.reset_index())#85
Open
leifdenby wants to merge 3 commits into
Open
Switch to using CF compliant coord stacking (rather than xr.Dataset.reset_index())#85leifdenby wants to merge 3 commits into
leifdenby wants to merge 3 commits into
Conversation
This commit replaces the current stacking method where `xr.Dataset.reset_index()` is used to change the stacked coordinate (for example `grid_index` as a stacking of `x` and `y` coordinates) from a `pd.MultiIndex` to a regular index, and instead uses the CF-compliant "gather compression" of coordinate stacking (see https://cf-xarray.readthedocs.io/en/latest/coding.html). Using `reset_index()` is a lossy operation in that downstream applications will not know wwhich coordinates have been stacked, meaning we must instead rely on this information from elsewhere (e.g. the config) in downstream applications. The reason that stacked coordinates must be handled in a special way is that `pd.MultiIndex` coordinates cannot be directly written to netCDF/zarr datasets, but with the CF-compliant method they can be handled in such a manner that we can completely roundtrip in downstream applications.
20 tasks
20 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe your changes
This PR alters the way
mllam-data-prephandles storing stacked coordinates in outputzarrdatasets. Stacked coordinates are used for example when creating a singlegrid_indexcoordinate from two spatial coordinates likexandyorlatandlon. The difficulty arises because callingxr.Dataset.stack(grid_index=('x', 'y'))createsgrid_indexas a newpandas.MultiIndexcoordinate and these cannot be written directly to netCDF or zarr datasets.The current implementation in
mllam-data-prepcalls.reset_index(stacked_dim_name)https://github.com/mllam/mllam-data-prep/blob/main/mllam_data_prep/ops/mapping.py#L103 (e.g..reset_index("grid_index")which removes the relationship between the stacked coordinates and the coordinates that were stacked by making the stacked coordinates just a regular 1D coordinate (i.e. it is no longer aMultiIndex). This has the drawbacks that:MultiIndexby callingxr.Dataset.set_index(...)see eg https://github.com/mllam/neural-lam/blob/main/neural_lam/datastore/base.py#L528..set_indexwe must explicitly pass in the the coordinates that the multiindex was created from. Currently we extract this from themllam-data-prepconfig, but it would be better not store this information in the dataset. I didn't think this was possible, but it turns out that there is a CF-compliant way of storing these stacked coordinate relationships.The new implementation in this PR removes the need for calling
.reset_index()inmllam-data-prepand instead uses the CF-compliant "gather compression" for coordinate stacking (implemented incf_xarray, see https://cf-xarray.readthedocs.io/en/latest/coding.html for details). This allows us to fully roundtrip coordinate stacking from zarr datasets produced withmllam-data-prep.NB: This is unfortunately a breaking change for
neural-lamthis requires a similar change inneural-lamto use the same approach for coordinate unstacking, and so once this PR is merged in datasets created withmllam-data-prepwill not be able to used with versions ofneural-lamwithout this same change. I will make a PR forneural-lamso that we can ensure both are merged in and released together.For reference so people can see the change to the zarr datasets, using the following config yaml file:
Currently results in the following zarr dataset:
With the changes in this PR the zarr dataset now look like this:
Note how the
xandycoordinates no longer havegrid_indexas their own coordinate, and insteadgrid_indexcoordinate has been given acompressattribute that names the coordinates that it was stacked from.So why am I making this change if it breaks current versions of
neural-lam?mllam-data-prepapplies.Will future versions of
neural-lamsupport datasets created withmllam-data-prepprior to this change?Yes, this should be possible. We can simply try applying the CF-compliant unstacking first, if that fails the fall-back to the current handcrafted approach (while issuing a warning).
This PR also add cf_xarray as dependency. I have been wanting to add this package for a while because it also adds accessor methods for safely retrieving say lat/lon coordinate variables using standard names and other standards-compliant things like that. It is a small package that is well designed and maintained so I think it would be a very good addition.
Issue Link
Type of change
neural-lamChecklist before requesting a review
pullwith--rebaseoption if possible).Checklist for reviewers
Each PR comes with its own improvements and flaws. The reviewer should check the following:
Author checklist after completed review
reflecting type of change (add section where missing):
Checklist for assignee