Skip to content

Conversation

@LDeakin
Copy link
Member

@LDeakin LDeakin commented Oct 21, 2025

I'm still finalising an implementation, but here is a draft spec.

@jbms
Copy link
Contributor

jbms commented Oct 21, 2025

Looks good, the only issue I see is the fill value representation not allowing null for the base data type fill value, e.g. for a base data type of json or nested optional. Instead you could specify the base data type fill value in a single-element array:

null -> missing
[1] -> value of 1
[null] -> value of null

@mkitti
Copy link
Contributor

mkitti commented Oct 24, 2025

While I see the simplicity here of toggling between null and a type, I am wondering if there is a further opportunity to introduce a full type tag to implement tagged unions in Zarr.

@LDeakin
Copy link
Member Author

LDeakin commented Oct 28, 2025

not allowing null for the base data type fill value

I suppose it is a limitation, but I'd note that we don't have any data types that permit a null fill value. Can anyone think of any that might arise?

Also, for a multiply nested optional type like Option<Option<u8>>, you could potentially have the "null" at different levels. An implementation would have to track the "depth" of the fill value and change it each time it goes through the optional codec.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 28, 2025

Can anyone think of any that might arise?

I think null will be a valid fill value for the JSON data type

@LDeakin
Copy link
Member Author

LDeakin commented Nov 30, 2025

I've implemented this in zarrs with codec/data type support for arbitrarily nested optional data. I'll add example data later. I have not addressed these two issues:

the only issue I see is the fill value representation not allowing null for the base data type fill value

for a multiply nested optional type like Option<Option>, you could potentially have the "null" at different levels

I am open to explicit suggestions that satisfy both, or only the first. The latter is a bit burdensome to support for something I suspect nobody would use. What I currently do:

  • null always means an empty element with the optional data type, irrespective of the inner type
  • A non-null fill value gets propagated down to the deepest non-optional inner type.

@jbms
Copy link
Contributor

jbms commented Nov 30, 2025

I've implemented this in zarrs with codec/data type support for arbitrarily nested optional data. I'll add example data later. I have not addressed these two issues:

the only issue I see is the fill value representation not allowing null for the base data type fill value

for a multiply nested optional type like Option<Option>, you could potentially have the "null" at different levels

I am open to explicit suggestions that satisfy both, or only the first. The latter is a bit burdensome to support for something I suspect nobody would use. What I currently do:

  • null always means an empty element with the optional data type, irrespective of the inner type
  • A non-null fill value gets propagated down to the deepest non-optional inner type.

I previously suggested wrapping any non-None fill value in a one-element array. That solves both issues and is syntactically pretty minimal.

@mkitti
Copy link
Contributor

mkitti commented Dec 1, 2025

I'm still not quite following why this current implementation is not just a special case of a more general sum type, an enum type in Rust.

Currently, Optional has a bit that toggles between None and Some<T>. Why not generalize this to any two data types?

null is just a singleton instance of a null_type. The bit in the mask is then a toggle between null_type and T.

Why build this infrastructure for null rather than any two types? Why not generalize further to N types?

@LDeakin
Copy link
Member Author

LDeakin commented Dec 1, 2025

Interesting... masked data has come up in a few discussions I've had, but never a more general sum type. Will people use this? It seems like not many people are complaining about the lack of struct support in Zarr V3.

A more general enum type would probably need to encode each variant through separate codec chains. E.g.

enum EnumType {
  U8(u8),
  String(String),
}
{
    "data_type": {
        "name": "enum",
        "configuration": {
            "data_types": [
                {
                    "name": "uint8"
                },
                {
                    "name": "string"
                }
            ]
        }
    },
    "fill_value": "?",
    "codecs": [
        {
            "name": "enum",
            "configuration": {
                "discriminator_data_type": {
                    "name": "uint8"
                },
                "discriminator_codecs": [
                    {
                        "name": "bytes"
                    }
                ],
                "variant_codecs": [
                    [
                        {
                            "name": "bytes"
                        }
                    ],
                    [
                        {
                            "name": "vlen-utf8"
                        }
                    ]
                ]
            }
        }
    ]
}

Specialising the above for an optional would need a new null_type or similar (zero-sized), as you mentioned. Yet another thing to standardise... I think there is space in the Zarr ecosystem to handle enum, optional, and struct data types separately

@mkitti
Copy link
Contributor

mkitti commented Dec 1, 2025

The other two null-like types I would immediately like to use this for are:

  1. NA or missing to represent missing data in the statistical sense.
  2. NaN as a general analog to IEEE 754 floating point. Practically we probably just using a floating point type here.

I do not think these are well represented by null. Each has their own semantics.

julia> NaN == true
false

julia> missing == true
missing

If we could generalize this to any two types and introduce a null_type, I think this could cover a much wider array of statistical and numerical applications.

@mkitti
Copy link
Contributor

mkitti commented Dec 1, 2025

A recent application is in the tracking standard GEFF, they introduced a missing mask.

https://liveimagetrackingtools.org/geff/latest/specification/#the-props-group-and-node-property-groups.

While they use Python's None to represent this, the meaning is that the value does not exist and can either be ignored or imputed as opposed to the value being undefined or that there was an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants