diff --git a/codecs/optional/README.md b/codecs/optional/README.md new file mode 100644 index 0000000..33619de --- /dev/null +++ b/codecs/optional/README.md @@ -0,0 +1,109 @@ +# Optional codec + +Defines an `array -> bytes` codec that encodes optional (nullable) data by separating the validity mask and data encoding. + +This codec is designed for the `optional` data type, which represents nullable values of any underlying data type. + +## Codec name + +The value of the `name` member in the codec object MUST be `optional`. + +## Configuration parameters + +### `mask_codecs` +An array of codec configurations that will be applied to encode the mask (boolean array indicating which elements are present). +The mask codecs are applied in sequence as a codec chain. + +### `data_codecs` +An array of codec configurations that will be applied to encode the data (flattened bytes of only the valid/present elements). +The data codecs are applied in sequence as a codec chain. + +## Example + +For example, the array metadata below specifies the optional codec with `packbits` serialisation for the mask, and `gzip` compression for the data: + +```json +{ + "data_type": { + "name": "optional", + "configuration": { + "name": "uint8", + "configuration": {} + } + }, + "fill_value": null, + "codecs": [ + { + "name": "optional", + "configuration": { + "mask_codecs": [ + { + "name": "packbits" + } + ], + "data_codecs": [ + { + "name": "bytes" + }, + { + "name": "gzip", + "configuration": { + "level": 5 + } + } + ] + } + } + ] +} +``` + +Example `optional` data and be found in the [examples](./examples/) subdirectory. + +## Format and algorithm + +This is an `array -> bytes` codec. + +This codec is only compatible with the [`optional`](../../data-types/optional/README.md) data type. + +The optional codec separates encoding of the mask (boolean array) and the data (flattened bytes, excluding missing elements). +The mask and data are encoded through independent codec chains specified by `mask_codecs` and `data_codecs` configuration parameters. +This allows for efficient storage when many elements are missing, and enables independent compression strategies for mask and data. + +**Note**: The in-memory representation of optional data arrays is an implementation detail. +Implementations MAY choose any suitable representation for handling nullable values in memory (e.g., separate mask and data arrays, nullable object wrappers, etc.), as long as the encoding and decoding follow the specified format. + +### Encoded format + +The encoded format consists of: +1. 8 bytes: mask length (u64 little-endian) - the number of bytes in the encoded mask. +2. 8 bytes: data length (u64 little-endian) - the number of bytes in the encoded data. +3. N bytes: encoded mask - the result of applying the mask codec chain to the boolean mask. +4. M bytes: encoded data - the result of applying the data codec chain to the flattened bytes of only valid elements. + +### Algorithm + +**Encoding:** +1. Create a boolean mask array indicating which elements are present (not null). +2. Extract only the valid (non-null) elements into a flattened data array. +3. Apply the mask codec chain to the mask. +4. Apply the data codec chain to the flattened data bytes. +5. Write the lengths and encoded data in the specified format. + +**Decoding:** +1. Read the mask length and data length from the first 16 bytes. +2. Read and decode the mask using the mask codec chain. +3. Read and decode the flattened data using the data codec chain. +4. Reconstruct the masked array into a suitable in-memory format, with null values where the mask is false. + +### Compatible Implementations + +* zarrs (Rust implementation) + +## Change log + +No changes yet. + +## Current maintainers + +* Lachlan Deakin ([@LDeakin](https://github.com/LDeakin)) diff --git a/codecs/optional/examples/README.md b/codecs/optional/examples/README.md new file mode 100644 index 0000000..8d90630 --- /dev/null +++ b/codecs/optional/examples/README.md @@ -0,0 +1,32 @@ +# Optional codec example data + +This directory contains example Zarr array demonstrating the use of the `optional` codec and `optional` data type. + +### `array_optional.zarr` + +This Zarr array uses the `optional` codec to encode an array of optional `uint8` values. + +The fill value is set to `null`, representing missing elements. +The array contains the below elements (`N` marks missing values): + +```text + 0 N 2 3 + N 5 N 7 + 8 9 N N +12 N N N +``` + +### `array_optional_nested.zarr` + +This Zarr array demonstrates nesting of the `optional` codec/data type. +It encodes an array of optional optional `uint8` values, requiring two layers of the `optional` codec and data type. + +The fill value is `[null]`, representing missing inner optional elements. +The array contains the below elements (`N` marks missing values, `SN` marks missing inner optional values): + +```text + N SN 2 3 + N 5 N 7 +SN SN N N +SN SN N N +``` \ No newline at end of file diff --git a/codecs/optional/examples/array_optional.zarr/array/c/0/0 b/codecs/optional/examples/array_optional.zarr/array/c/0/0 new file mode 100644 index 0000000..9284870 Binary files /dev/null and b/codecs/optional/examples/array_optional.zarr/array/c/0/0 differ diff --git a/codecs/optional/examples/array_optional.zarr/array/c/0/1 b/codecs/optional/examples/array_optional.zarr/array/c/0/1 new file mode 100644 index 0000000..132e464 Binary files /dev/null and b/codecs/optional/examples/array_optional.zarr/array/c/0/1 differ diff --git a/codecs/optional/examples/array_optional.zarr/array/c/1/0 b/codecs/optional/examples/array_optional.zarr/array/c/1/0 new file mode 100644 index 0000000..f6ca314 Binary files /dev/null and b/codecs/optional/examples/array_optional.zarr/array/c/1/0 differ diff --git a/codecs/optional/examples/array_optional.zarr/array/zarr.json b/codecs/optional/examples/array_optional.zarr/array/zarr.json new file mode 100644 index 0000000..39a58c4 --- /dev/null +++ b/codecs/optional/examples/array_optional.zarr/array/zarr.json @@ -0,0 +1,58 @@ +{ + "zarr_format": 3, + "node_type": "array", + "shape": [ + 4, + 4 + ], + "data_type": { + "name": "optional", + "configuration": { + "name": "uint8", + "configuration": {} + } + }, + "chunk_grid": { + "name": "regular", + "configuration": { + "chunk_shape": [ + 2, + 2 + ] + } + }, + "chunk_key_encoding": { + "name": "default", + "configuration": { + "separator": "/" + } + }, + "fill_value": null, + "codecs": [ + { + "name": "optional", + "configuration": { + "mask_codecs": [ + { + "name": "packbits" + } + ], + "data_codecs": [ + { + "name": "bytes", + "configuration": { + "endian": "little" + } + } + ] + } + } + ], + "attributes": { + "description": "A 4x4 array of optional uint8 values with some missing data.\nN marks missing (`None`=`null`) values:\n 0 N 2 3 \n N 5 N 7 \n 8 9 N N \n12 N N N" + }, + "dimension_names": [ + "y", + "x" + ] +} \ No newline at end of file diff --git a/codecs/optional/examples/array_optional_nested.zarr/array/c/0/0 b/codecs/optional/examples/array_optional_nested.zarr/array/c/0/0 new file mode 100644 index 0000000..84124b6 Binary files /dev/null and b/codecs/optional/examples/array_optional_nested.zarr/array/c/0/0 differ diff --git a/codecs/optional/examples/array_optional_nested.zarr/array/c/0/1 b/codecs/optional/examples/array_optional_nested.zarr/array/c/0/1 new file mode 100644 index 0000000..c762b01 Binary files /dev/null and b/codecs/optional/examples/array_optional_nested.zarr/array/c/0/1 differ diff --git a/codecs/optional/examples/array_optional_nested.zarr/array/c/1/1 b/codecs/optional/examples/array_optional_nested.zarr/array/c/1/1 new file mode 100644 index 0000000..250c80a Binary files /dev/null and b/codecs/optional/examples/array_optional_nested.zarr/array/c/1/1 differ diff --git a/codecs/optional/examples/array_optional_nested.zarr/array/zarr.json b/codecs/optional/examples/array_optional_nested.zarr/array/zarr.json new file mode 100644 index 0000000..771a714 --- /dev/null +++ b/codecs/optional/examples/array_optional_nested.zarr/array/zarr.json @@ -0,0 +1,75 @@ +{ + "zarr_format": 3, + "node_type": "array", + "shape": [ + 4, + 4 + ], + "data_type": { + "name": "optional", + "configuration": { + "name": "optional", + "configuration": { + "name": "uint8", + "configuration": {} + } + } + }, + "chunk_grid": { + "name": "regular", + "configuration": { + "chunk_shape": [ + 2, + 2 + ] + } + }, + "chunk_key_encoding": { + "name": "default", + "configuration": { + "separator": "/" + } + }, + "fill_value": [ + null + ], + "codecs": [ + { + "name": "optional", + "configuration": { + "mask_codecs": [ + { + "name": "packbits" + } + ], + "data_codecs": [ + { + "name": "optional", + "configuration": { + "mask_codecs": [ + { + "name": "packbits" + } + ], + "data_codecs": [ + { + "name": "bytes", + "configuration": { + "endian": "little" + } + } + ] + } + } + ] + } + } + ], + "attributes": { + "description": "A 4x4 array of optional optional uint8 values with some missing data.\nThe fill value is null on the inner optional layer, i.e. Some(None).\nN marks missing (`None`=`null`) values. SN marks `Some(None)`=`[null]` values:\n N SN 2 3 \n N 5 N 7 \n SN SN N N \n SN SN N N" + }, + "dimension_names": [ + "y", + "x" + ] +} \ No newline at end of file diff --git a/codecs/optional/schema.json b/codecs/optional/schema.json new file mode 100644 index 0000000..9f4864f --- /dev/null +++ b/codecs/optional/schema.json @@ -0,0 +1,52 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "type": "object", + "properties": { + "name": { + "const": "optional" + }, + "configuration": { + "type": "object", + "properties": { + "mask_codecs": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "configuration": { + "type": "object" + } + }, + "required": ["name"], + "additionalProperties": false + }, + "minItems": 1 + }, + "data_codecs": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "configuration": { + "type": "object" + } + }, + "required": ["name"], + "additionalProperties": false + }, + "minItems": 1 + } + }, + "required": ["mask_codecs", "data_codecs"], + "additionalProperties": false + } + }, + "required": ["name", "configuration"], + "additionalProperties": false +} diff --git a/data-types/optional/README.md b/data-types/optional/README.md new file mode 100644 index 0000000..84b820c --- /dev/null +++ b/data-types/optional/README.md @@ -0,0 +1,80 @@ +# Optional data type + +Defines a data type for optional (nullable) values that can contain either a value of a specified underlying data type or be missing/null. + +This data type is designed for the [`optional`](../../codecs/optional/README.md) codec, which separately encodes a validity mask and the data. + +While array-to-array codecs MAY support the `optional` data type, implementations SHOULD use the `optional` codec as the sole top-level codec. +This approach is preferred because the codecs contained within the `optional` codec configuration do not need to explicitly handle optional data type semantics. + +## Configuration parameters + +### `name` +The name of the underlying data type. +This can be any valid Zarr data type name. + +### `configuration` +The configuration object for the underlying data type. +This should match the configuration requirements of the specified underlying data type. + +## Permitted fill values + +The value of the `fill_value` metadata key MUST be `null` or a single element array containing any valid fill value of the underlying data type. + +- A `null` fill value represents the absence of a value (missing element). +- A single element array fill value represents a valid value of the underlying data type. + +For nested optional types, this representation is applied recursively. + +The table below demonstrates valid `data_type` and `fill_value` combinations with an `optional` and nested `optional` data type, along with their equivalent Rust [`Option`](https://doc.rust-lang.org/std/option/) values. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
"data_type""fill_value"Rust value
{
  "name": "optional",
  "configuration": {
    "name": "uint8",
  }
}
nullNone
[42]Some(42)
{
  "name": "optional",
  "configuration": {
    "name": "optional",
    "configuration": {
      "name": "uint8"
    }
  }
}
nullNone
[null]Some(None)
[[42]]Some(Some(42))
+ + +## Example + +See the [`optional` codec](../../codecs/optional/README.md) for an example of how to configure the optional data type within array metadata. + +## Compatible Implementations + +* zarrs (Rust implementation) + +## Change log + +No changes yet. + +## Current maintainers + +* Lachlan Deakin ([@LDeakin](https://github.com/LDeakin)) diff --git a/data-types/optional/schema.json b/data-types/optional/schema.json new file mode 100644 index 0000000..a80ab8b --- /dev/null +++ b/data-types/optional/schema.json @@ -0,0 +1,28 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "oneOf": [ + { + "type": "object", + "properties": { + "name": { + "const": "optional" + }, + "configuration": { + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "configuration": { + "type": "object" + } + }, + "required": ["name"], + "additionalProperties": false + } + }, + "required": ["name", "configuration"], + "additionalProperties": false + } + ] +}