Skip to content

Resizing an array in a ZipFileStore will lead to many duplicate entries for zarr.json in the ZipFile. #3580

@gfhcs

Description

@gfhcs

Zarr version

3.1.3

Numcodecs version

0.16.3

Python Version

3.14.0

Operating System

Linux

Installation

Pip inside a conda environment

Description

The behavior I observe is: When I append items to an array that is stored in a ZipStore, I get UserWarning: Duplicate name: 'zarr.json' and the *.zip file on disk will contain duplicate entries for zarr.json.

This behavior is problematic for the following reasons:

  1. The *.zip viewer of my desktop environment (KDE's Ark), and possibly others, do not show me the duplicate entries for zarr.json, but only the earliest version. This has made debugging the issue quite a mystery adventure, because I did not even know that this was possible before.
  2. Only when I copied the *.zip file to a Windows machine did I even see that there are multiple entries.
  3. Storing all those duplicates in the *.zip file is not only epically confusing, but also a (small) waste of space.

The expected behavior would be that if I do what I do, I don't get this warning, and the old version of zarr.json is properly deleted/overwritten, such that there really is only one single entry for that file name.

This expected behavior is better, because it makes aforementioned confusion impossible, saves space and properly resolves the justified warning.

This issue seems to be closely related, but not equivalent to #129 , Deltares/imod-python#1706 and Deltares/imod-python#1707 . I apologize if the problem I describe here is just a subset of those, but from scrolling over those issues I could not reliably tell if the people there have realized that specifically zarr.json is affected.

Steps to reproduce

# /// script
# requires-python = ">=3.14"
# dependencies = [
#   "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
#   "numpy"
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues

import zipfile

import numpy
import zarr.storage
from zarr.storage import ZipStore

out_path = "test.zip"
n = 13
with ZipStore(out_path, mode='w', read_only=False, compression=zipfile.ZIP_STORED, allowZip64=False) as store:

    shape = (42, 42)

    array = zarr.create_array(store=store, shape=(7, *shape),dtype='uint8')

    # Now we want to append elements to the array one by one, which is a very very common usage pattern
    # if you simply cannot know the length of the input beforehand.
    for idx in range(n):

        # This is just some dummy data. In actual use cases, this might be image data coming from a webcam.
        data = numpy.zeros(shape, dtype=numpy.uint8)

        # Maybe we were able to pre-estimate the length of the input.
        # But estimates cannot be guaranteed to be correct.
        if idx < array.shape[0]:
            array[idx] = data
        else:
            # I think the following line is the one that raises warnings.
            array.append(data.reshape(1, *shape), axis=0)

    # This assertion always passes.
    assert array.shape == (n, *shape)

with zipfile.ZipFile(out_path, 'r') as zf:
    with zf.open('zarr.json') as metadata_file:
        # The following assertion checks if the excepted array length is found in the metadata file.
        # The assertion may pass, if nondeterministically the latest of the many duplicate versions of
        # zarr.json ends up being read:
        assert f"{n}," in metadata_file.read().decode("utf-8")

    zarr_info_versions = [info for info in zf.infolist() if info.filename == 'zarr.json']

    # The following assertion is going to fail:
    assert len(zarr_info_versions) == 1

# If you manually inspect the Zip file, though, you will either see only one single entry for `zarr.json`,
# which is your archive viewer lying to you, or you will see multiple different versions of `zarr.json`.
# zarr.print_debug_info()

Additional output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugPotential issues with the zarr-python library

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions