Skip to content

Allowing for dataset derived traits #5

@leifdenby

Description

@leifdenby

The current python api is like this:

# mymodule.myloader
import xarray as xr

TIME_PROFILE = "observation"
SPACE_PROFILE = "grid"
UNCERTAINTY_PROFILE = "deterministic"

def load_dataset(paths: list[str], **kwargs) -> xr.Dataset:
    ds = xr.open_mfdataset(paths, combine="by_coords", **kwargs)
    return ds
# mlwp_data_loaders.cli
from mlwp_data_loaders import load_dataset
from mlwp_data_specs import validate_dataset

# 1. Load the dataset and extract the trait profiles defined by the loader
ds, dataset_traits = load_dataset(
    [
        "/path/to/file1.nc",
        "/path/to/file2.nc",
    ],
    loader="mymodule.myloader",
    return_dataset_traits=True,
)

# 2. Get a detailed validation report by passing the extracted traits
report = validate_dataset(
    ds,
    time=dataset_traits.get("time_profile"),
    space=dataset_traits.get("space_profile"),
    uncertainty=dataset_traits.get("uncertainty_profile"),
)

In this design the load_dataset implementation in mlwp-data-loaders reads the statically defined trait properties from the loader module provided, and return the loaded dataset and the traits, separately.

But this has the issue that the traits that a loader imposes is hard-coded in the loader module, but @mpvginde has pointed out we might want to support allowing the loader to define the traits of a dataset at runtime based on the contents what the loader is reading from disk. For example the current "loader" implementation in mxalign https://github.com/mlwp-tools/mxalign/blob/e2232d93275c7508897a7ddb0cce8b508665f24c/src/mxalign/loaders/base.py#L81-L107

If we instead want to loader to infer the dataset traits from what it reads from disk then:

  1. the traits for a specific dataset need to be defined at runtime
  2. the traits need to be somehow return from the loader

I will use this issue to outline a few approaches to achieving this

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions