Skip to content

Building indices removes user defined metadata#489

Open
pavankumar-jamanjyothi-by wants to merge 2 commits into
masterfrom
building-indices-removes-user-defined-metadata
Open

Building indices removes user defined metadata#489
pavankumar-jamanjyothi-by wants to merge 2 commits into
masterfrom
building-indices-removes-user-defined-metadata

Conversation

@pavankumar-jamanjyothi-by
Copy link
Copy Markdown

Description:

When building indices for an existing dataset via build_dataset_indices methods, user-defined metadata (the "metadata" key in the deserialized by-metadata.json file) is removed. The reason is that build_dataset_indices functions pass load_dataset_metadata=False to the DatasetFactory, e.g. here:
https://github.com/JDASoftwareGroup/kartothek/blob/master/kartothek/io/eager.py#L817-L822
This has the effect of actively removing user-defined metadata:
https://github.com/JDASoftwareGroup/kartothek/blob/master/kartothek/core/factory.py#L99-L100
so no metadata is written when the by-metadata.json file is written in the end.

Fix is to pass load_dataset_metadata=True to the DatasetFactory.

Copy link
Copy Markdown
Contributor

@steffen-schroeder-by steffen-schroeder-by left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, all in all, looks good to me. There are 2 things, I'd like to see in addition:

  1. this is worth an entry in the changelog
  2. we should understand why load_dataset_metadata was always set to False and what implication it has to set it to True as default now. (Functionality/Performance/...). Maybe @fjetter has an idea.

dataset_uuid=dataset_uuid,
store=store_factory,
factory=factory,
load_dataset_metadata=False,
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jochen-ott-by As this is gc, I think we should set it to False here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does not really matter and we can use load_dataset_meadata=True everywhere.

dataset_uuid=dataset_uuid,
store=store_factory,
factory=factory,
load_dataset_metadata=False,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does not really matter and we can use load_dataset_meadata=True everywhere.

Comment thread kartothek/io/testing/index.py Outdated
{"label": "cluster_1", "data": [("core", pd.DataFrame({"p": [1, 2]}))]},
{"label": "cluster_2", "data": [("core", pd.DataFrame({"p": [2, 3]}))]},
]
with freeze_time(TIME_TO_FREEZE_ISO):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we had some issues with the freeze_time approach in the past, which is why almost no test nowadays uses it. I think this test can be re-written without using freeze_time, simply by not checking a value for metadata["creation_time"]. This would not only drop the dependency on freezegun here, but also make the test clearer.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Removed freeze_time and pushed the changes.

@pavankumar-jamanjyothi-by pavankumar-jamanjyothi-by force-pushed the building-indices-removes-user-defined-metadata branch from 1c8ed8b to 0618558 Compare September 1, 2021 09:37
@pavankumar-jamanjyothi-by pavankumar-jamanjyothi-by requested review from a user and steffen-schroeder-by and removed request for florian-jetter-by September 1, 2021 09:43
@johan-olsson-by johan-olsson-by removed their request for review April 20, 2022 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants