Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions codecs/optional/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Optional codec

Defines an `array -> bytes` codec that encodes optional (nullable) data by separating the validity mask and data encoding.

This codec is designed for the `optional` data type, which represents nullable values of any underlying data type.

## Codec name

The value of the `name` member in the codec object MUST be `optional`.

## Configuration parameters

### `mask_codecs`
An array of codec configurations that will be applied to encode the mask (boolean array indicating which elements are present).
The mask codecs are applied in sequence as a codec chain.

### `data_codecs`
An array of codec configurations that will be applied to encode the data (flattened bytes of only the valid/present elements).
The data codecs are applied in sequence as a codec chain.

## Example

For example, the array metadata below specifies the optional codec with `packbits` serialisation for the mask, and `gzip` compression for the data:

```json
{
"data_type": {
"name": "optional",
"configuration": {
"name": "uint8",
"configuration": {}
}
},
"fill_value": null,
"codecs": [
{
"name": "optional",
"configuration": {
"mask_codecs": [
{
"name": "packbits"
}
],
"data_codecs": [
{
"name": "bytes"
},
{
"name": "gzip",
"configuration": {
"level": 5
}
}
]
}
}
]
}
```

Example `optional` data and be found in the [examples](./examples/) subdirectory.

## Format and algorithm

This is an `array -> bytes` codec.

This codec is only compatible with the [`optional`](../../data-types/optional/README.md) data type.

The optional codec separates encoding of the mask (boolean array) and the data (flattened bytes, excluding missing elements).
The mask and data are encoded through independent codec chains specified by `mask_codecs` and `data_codecs` configuration parameters.
This allows for efficient storage when many elements are missing, and enables independent compression strategies for mask and data.

**Note**: The in-memory representation of optional data arrays is an implementation detail.
Implementations MAY choose any suitable representation for handling nullable values in memory (e.g., separate mask and data arrays, nullable object wrappers, etc.), as long as the encoding and decoding follow the specified format.

### Encoded format

The encoded format consists of:
1. 8 bytes: mask length (u64 little-endian) - the number of bytes in the encoded mask.
2. 8 bytes: data length (u64 little-endian) - the number of bytes in the encoded data.
3. N bytes: encoded mask - the result of applying the mask codec chain to the boolean mask.
4. M bytes: encoded data - the result of applying the data codec chain to the flattened bytes of only valid elements.

### Algorithm

**Encoding:**
1. Create a boolean mask array indicating which elements are present (not null).
2. Extract only the valid (non-null) elements into a flattened data array.
3. Apply the mask codec chain to the mask.
4. Apply the data codec chain to the flattened data bytes.
5. Write the lengths and encoded data in the specified format.

**Decoding:**
1. Read the mask length and data length from the first 16 bytes.
2. Read and decode the mask using the mask codec chain.
3. Read and decode the flattened data using the data codec chain.
4. Reconstruct the masked array into a suitable in-memory format, with null values where the mask is false.

### Compatible Implementations

* zarrs (Rust implementation)

## Change log

No changes yet.

## Current maintainers

* Lachlan Deakin ([@LDeakin](https://github.com/LDeakin))
32 changes: 32 additions & 0 deletions codecs/optional/examples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Optional codec example data

This directory contains example Zarr array demonstrating the use of the `optional` codec and `optional` data type.

### `array_optional.zarr`

This Zarr array uses the `optional` codec to encode an array of optional `uint8` values.

The fill value is set to `null`, representing missing elements.
The array contains the below elements (`N` marks missing values):

```text
0 N 2 3
N 5 N 7
8 9 N N
12 N N N
```

### `array_optional_nested.zarr`

This Zarr array demonstrates nesting of the `optional` codec/data type.
It encodes an array of optional optional `uint8` values, requiring two layers of the `optional` codec and data type.

The fill value is `[null]`, representing missing inner optional elements.
The array contains the below elements (`N` marks missing values, `SN` marks missing inner optional values):

```text
N SN 2 3
N 5 N 7
SN SN N N
SN SN N N
```
Binary file not shown.
Binary file not shown.
Binary file not shown.
58 changes: 58 additions & 0 deletions codecs/optional/examples/array_optional.zarr/array/zarr.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
{
"zarr_format": 3,
"node_type": "array",
"shape": [
4,
4
],
"data_type": {
"name": "optional",
"configuration": {
"name": "uint8",
"configuration": {}
}
},
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [
2,
2
]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"fill_value": null,
"codecs": [
{
"name": "optional",
"configuration": {
"mask_codecs": [
{
"name": "packbits"
}
],
"data_codecs": [
{
"name": "bytes",
"configuration": {
"endian": "little"
}
}
]
}
}
],
"attributes": {
"description": "A 4x4 array of optional uint8 values with some missing data.\nN marks missing (`None`=`null`) values:\n 0 N 2 3 \n N 5 N 7 \n 8 9 N N \n12 N N N"
},
"dimension_names": [
"y",
"x"
]
}
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
{
"zarr_format": 3,
"node_type": "array",
"shape": [
4,
4
],
"data_type": {
"name": "optional",
"configuration": {
"name": "optional",
"configuration": {
"name": "uint8",
"configuration": {}
}
}
},
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [
2,
2
]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"fill_value": [
null
],
"codecs": [
{
"name": "optional",
"configuration": {
"mask_codecs": [
{
"name": "packbits"
}
],
"data_codecs": [
{
"name": "optional",
"configuration": {
"mask_codecs": [
{
"name": "packbits"
}
],
"data_codecs": [
{
"name": "bytes",
"configuration": {
"endian": "little"
}
}
]
}
}
]
}
}
],
"attributes": {
"description": "A 4x4 array of optional optional uint8 values with some missing data.\nThe fill value is null on the inner optional layer, i.e. Some(None).\nN marks missing (`None`=`null`) values. SN marks `Some(None)`=`[null]` values:\n N SN 2 3 \n N 5 N 7 \n SN SN N N \n SN SN N N"
},
"dimension_names": [
"y",
"x"
]
}
52 changes: 52 additions & 0 deletions codecs/optional/schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"name": {
"const": "optional"
},
"configuration": {
"type": "object",
"properties": {
"mask_codecs": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"configuration": {
"type": "object"
}
},
"required": ["name"],
"additionalProperties": false
},
"minItems": 1
},
"data_codecs": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"configuration": {
"type": "object"
}
},
"required": ["name"],
"additionalProperties": false
},
"minItems": 1
}
},
"required": ["mask_codecs", "data_codecs"],
"additionalProperties": false
}
},
"required": ["name", "configuration"],
"additionalProperties": false
}
Loading