Skip to content

Conversation

@mkitti
Copy link
Contributor

@mkitti mkitti commented Oct 29, 2025

This pull request introduces the pad codec, evolved from an initial proposal for an "offset codec" as discussed in the comments. The pad codec is a bytes->bytes codec designed to add or remove fixed-size padding at either the start or end of a chunk.

Motivation and Functionality

The pad codec is configured with:

  • location: Specifies whether padding is at the "start" or "end" of the chunk.
  • nbytes: The number of bytes of padding.
  • padding: An optional base64-encoded string defining the literal bytes for padding. If not provided, zeros are used.

This provides a flexible mechanism for handling padding in various scenarios:

  • Skipping headers/footers: It can ignore leading or trailing bytes when reading from foreign data formats (e.g., N5 datasets with their specific chunk headers).
  • Prepending/Appending static content: It can add fixed headers (e.g., a TIFF header to make each chunk a valid TIFF file) or footers when writing.

To apply padding at both the start and end of a chunk, the pad codec can be applied twice in the codec pipeline.

Changes in this Pull Request

This PR converts the initially proposed offset codec into the more general pad codec, incorporating feedback and expanding its capabilities. Key changes include:

  • Renaming the codec, its directory, and associated files from offset to pad.
  • Updating the schema.json and README.md have been updated to reflect the new pad codec configuration and functionality.
  • The examples (TIFF, custom header, N5) have been updated to use the pad codec, while retaining the detailed explanations.
  • Modifying generate_tiff_header.py for consistency with padding terminology.

@jbms
Copy link
Contributor

jbms commented Oct 29, 2025

Maybe call this skip_bytes or padding_bytes?

@jbms
Copy link
Contributor

jbms commented Oct 29, 2025

This could support both prefix and suffix.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 29, 2025

another name idea, if prefix and suffix are added: inset

@mkitti
Copy link
Contributor Author

mkitti commented Oct 29, 2025

I may need to add another parameter to control how to handle existing chunks. Should the codec open the chunk, seek to the offset, and use the existing header OR should it always write zeros or the defined prefix?

My current thought is that the lack of of a defined prefix means that zeros should be prepended if a chunk does not exist but that the header of a existing chunk should be preserved. If there is specified prefix, then that prefix should always be written as the header even if the chunk exists.

Some implementatins may find the write behavior problematic if the underlying storage does not allow seeking before writing. In this case, a read may be necessary before writing.

Maybe call this skip_bytes or padding_bytes?
another name idea, if prefix and suffix are added: inset

skip_bytes would probably make sense for the codec as written now. That name seems to come from the reading perspective rather than writing. When writing a new chunk, there are not really any bytes to "skip", yet. Thus, I would lean towards padding_bytes or inset.

I have proposed another extension called "suffix" that is a chunk key encoding. I also note the name collision between the conditional codec and optional data type. Perhaps we need some extension naming conventions to avoid such collisions or is the distinct nature of the extensions sufficient?

This could support both prefix and suffix.

Appending a suffix could be another codec. The CRC32c checksum codec could be considered as a specific form of an appending codec where the last four bytes could be ignored as they are not needed to interpret the preceding bytes.

Alternatively, we could make it so that a "negative" offset refers to bytes to skip at the end of a byte sequence. In this case, we may need to rename the "prefix" parameter to "padding". To add both a header and a footer to a file we would just use two offset codecs, one with a positive offset and one with a negative offset.

A suffix would be able to better support other foreign image file formats. In particular, I was thinking about PNG files where you would need a terminal IEND chunk. Another case might be ZIP shards.

This might overlap with the sharding codec in that we are effectively defining the byte offset and nbytes for a single chunk shard. An alternative here would be to consider extensions to the sharding codec where we define the index either in an external key (another file) or have a fixed index common to all shards defined in the codec configuration.

One difference from the sharding codec is that we define a byte offset from the beginning and an offset from the end rather than the size of the "inset". In composition with the sharding-indexed codec, this would also allow the shard index to not have to exist at the exact beginning or end of the file.

@mkitti
Copy link
Contributor Author

mkitti commented Oct 29, 2025

Another name I just thought of for a combined prefix and suffix codec is "byte_range". We could then support syntax similar to the HTTP Range header:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Range

Range: <unit>=<range-start>-
Range: <unit>=<range-start>-<range-end>
Range: <unit>=<range-start>-<range-end>, …, <range-startN>-<range-endN>
Range: <unit>=-<suffix-length>

@jbms
Copy link
Contributor

jbms commented Oct 29, 2025

Re name collision: technically it is fine since they are different namespaces, and indeed we already have a collision between the bytes codec and the bytes data type, which are quite unrelated and in fact incompatible, but it would be better to avoid collisions to reduce confusion.

Re preserving existing prefix/suffix content: that is outside of the existing capabilities of a codec. That would likely require changes to codec APIs in existing implementations.

@mkitti
Copy link
Contributor Author

mkitti commented Oct 30, 2025

I'm currently leaning towards supporting both prefix and suffix but not at the same time within one instance of the codec. In order to add ignored bytes (padding) to the beginning and end one would need to apply the codec twice as follows.

codecs: [
    { "name": "bytes" },
    {
        "name": "pad",
        "configuration": {
            "location": "beginning",
            "nbytes": 120,
            "padding": <Base64 encoded string>
        }
     },
    {
        "name": "pad",
        "configuration": {
            "location": "end",
            "nbytes": 96,
            "padding": <Base64 encoded string>
        }
     }
]

This commit converts the existing offset codec into a more general pad codec.

Changes include:
- Renamed the codecs/offset directory to codecs/pad.
- Updated schema.json to define location (start/end), nbytes, and padding parameters.
- Rewrote README.md to describe the pad codec's functionality and updated examples, including re-inserting detailed N5 explanations.
- Modified generate_tiff_header.py to use padding terminology.
@mkitti mkitti changed the title feat: add offset codec feat: Convert offset codec to pad codec Nov 24, 2025
@mkitti
Copy link
Contributor Author

mkitti commented Nov 24, 2025

This has been converted to the "pad' codec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants