-
Notifications
You must be signed in to change notification settings - Fork 10
feat: Convert offset codec to pad codec #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Maybe call this |
|
This could support both prefix and suffix. |
|
another name idea, if prefix and suffix are added: |
|
I may need to add another parameter to control how to handle existing chunks. Should the codec open the chunk, seek to the offset, and use the existing header OR should it always write zeros or the defined prefix? My current thought is that the lack of of a defined prefix means that zeros should be prepended if a chunk does not exist but that the header of a existing chunk should be preserved. If there is specified prefix, then that prefix should always be written as the header even if the chunk exists. Some implementatins may find the write behavior problematic if the underlying storage does not allow seeking before writing. In this case, a read may be necessary before writing.
I have proposed another extension called "suffix" that is a chunk key encoding. I also note the name collision between the
Appending a suffix could be another codec. The CRC32c checksum codec could be considered as a specific form of an appending codec where the last four bytes could be ignored as they are not needed to interpret the preceding bytes. Alternatively, we could make it so that a "negative" offset refers to bytes to skip at the end of a byte sequence. In this case, we may need to rename the "prefix" parameter to "padding". To add both a header and a footer to a file we would just use two offset codecs, one with a positive offset and one with a negative offset. A suffix would be able to better support other foreign image file formats. In particular, I was thinking about PNG files where you would need a terminal IEND chunk. Another case might be ZIP shards. This might overlap with the sharding codec in that we are effectively defining the byte offset and nbytes for a single chunk shard. An alternative here would be to consider extensions to the sharding codec where we define the index either in an external key (another file) or have a fixed index common to all shards defined in the codec configuration. One difference from the sharding codec is that we define a byte offset from the beginning and an offset from the end rather than the size of the "inset". In composition with the sharding-indexed codec, this would also allow the shard index to not have to exist at the exact beginning or end of the file. |
|
Another name I just thought of for a combined prefix and suffix codec is "byte_range". We could then support syntax similar to the HTTP Range header: |
|
Re name collision: technically it is fine since they are different namespaces, and indeed we already have a collision between the Re preserving existing prefix/suffix content: that is outside of the existing capabilities of a codec. That would likely require changes to codec APIs in existing implementations. |
|
I'm currently leaning towards supporting both prefix and suffix but not at the same time within one instance of the codec. In order to add ignored bytes (padding) to the beginning and end one would need to apply the codec twice as follows. |
This commit converts the existing offset codec into a more general pad codec. Changes include: - Renamed the codecs/offset directory to codecs/pad. - Updated schema.json to define location (start/end), nbytes, and padding parameters. - Rewrote README.md to describe the pad codec's functionality and updated examples, including re-inserting detailed N5 explanations. - Modified generate_tiff_header.py to use padding terminology.
offset codec|
This has been converted to the "pad' codec. |
This pull request introduces the
padcodec, evolved from an initial proposal for an "offset codec" as discussed in the comments. Thepadcodec is abytes->bytescodec designed to add or remove fixed-size padding at either the start or end of a chunk.Motivation and Functionality
The
padcodec is configured with:location: Specifies whether padding is at the"start"or"end"of the chunk.nbytes: The number of bytes of padding.padding: An optional base64-encoded string defining the literal bytes for padding. If not provided, zeros are used.This provides a flexible mechanism for handling padding in various scenarios:
To apply padding at both the start and end of a chunk, the
padcodec can be applied twice in the codec pipeline.Changes in this Pull Request
This PR converts the initially proposed
offsetcodec into the more generalpadcodec, incorporating feedback and expanding its capabilities. Key changes include:offsettopad.schema.jsonandREADME.mdhave been updated to reflect the newpadcodec configuration and functionality.padcodec, while retaining the detailed explanations.generate_tiff_header.pyfor consistency withpaddingterminology.