-
-
Notifications
You must be signed in to change notification settings - Fork 368
Description
Zarr version
3.1.3
Numcodecs version
0.16.3
Python Version
3.14.0
Operating System
Linux
Installation
Pip inside a conda environment
Description
The behavior I observe is: When I append items to an array that is stored in a ZipStore, I get UserWarning: Duplicate name: 'zarr.json' and the *.zip file on disk will contain duplicate entries for zarr.json.
This behavior is problematic for the following reasons:
- The *.zip viewer of my desktop environment (KDE's Ark), and possibly others, do not show me the duplicate entries for
zarr.json, but only the earliest version. This has made debugging the issue quite a mystery adventure, because I did not even know that this was possible before. - Only when I copied the *.zip file to a Windows machine did I even see that there are multiple entries.
- Storing all those duplicates in the *.zip file is not only epically confusing, but also a (small) waste of space.
The expected behavior would be that if I do what I do, I don't get this warning, and the old version of zarr.json is properly deleted/overwritten, such that there really is only one single entry for that file name.
This expected behavior is better, because it makes aforementioned confusion impossible, saves space and properly resolves the justified warning.
This issue seems to be closely related, but not equivalent to #129 , Deltares/imod-python#1706 and Deltares/imod-python#1707 . I apologize if the problem I describe here is just a subset of those, but from scrolling over those issues I could not reliably tell if the people there have realized that specifically zarr.json is affected.
Steps to reproduce
# /// script
# requires-python = ">=3.14"
# dependencies = [
# "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# "numpy"
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues
import zipfile
import numpy
import zarr.storage
from zarr.storage import ZipStore
out_path = "test.zip"
n = 13
with ZipStore(out_path, mode='w', read_only=False, compression=zipfile.ZIP_STORED, allowZip64=False) as store:
shape = (42, 42)
array = zarr.create_array(store=store, shape=(7, *shape),dtype='uint8')
# Now we want to append elements to the array one by one, which is a very very common usage pattern
# if you simply cannot know the length of the input beforehand.
for idx in range(n):
# This is just some dummy data. In actual use cases, this might be image data coming from a webcam.
data = numpy.zeros(shape, dtype=numpy.uint8)
# Maybe we were able to pre-estimate the length of the input.
# But estimates cannot be guaranteed to be correct.
if idx < array.shape[0]:
array[idx] = data
else:
# I think the following line is the one that raises warnings.
array.append(data.reshape(1, *shape), axis=0)
# This assertion always passes.
assert array.shape == (n, *shape)
with zipfile.ZipFile(out_path, 'r') as zf:
with zf.open('zarr.json') as metadata_file:
# The following assertion checks if the excepted array length is found in the metadata file.
# The assertion may pass, if nondeterministically the latest of the many duplicate versions of
# zarr.json ends up being read:
assert f"{n}," in metadata_file.read().decode("utf-8")
zarr_info_versions = [info for info in zf.infolist() if info.filename == 'zarr.json']
# The following assertion is going to fail:
assert len(zarr_info_versions) == 1
# If you manually inspect the Zip file, though, you will either see only one single entry for `zarr.json`,
# which is your archive viewer lying to you, or you will see multiple different versions of `zarr.json`.
# zarr.print_debug_info()Additional output
No response