Building indices removes user defined metadata#489
Building indices removes user defined metadata#489pavankumar-jamanjyothi-by wants to merge 2 commits into
Conversation
steffen-schroeder-by
left a comment
There was a problem hiding this comment.
Thanks, all in all, looks good to me. There are 2 things, I'd like to see in addition:
- this is worth an entry in the changelog
- we should understand why
load_dataset_metadatawas always set toFalseand what implication it has to set it toTrueas default now. (Functionality/Performance/...). Maybe @fjetter has an idea.
| dataset_uuid=dataset_uuid, | ||
| store=store_factory, | ||
| factory=factory, | ||
| load_dataset_metadata=False, |
There was a problem hiding this comment.
@jochen-ott-by As this is gc, I think we should set it to False here.
There was a problem hiding this comment.
I think it does not really matter and we can use load_dataset_meadata=True everywhere.
| dataset_uuid=dataset_uuid, | ||
| store=store_factory, | ||
| factory=factory, | ||
| load_dataset_metadata=False, |
There was a problem hiding this comment.
I think it does not really matter and we can use load_dataset_meadata=True everywhere.
| {"label": "cluster_1", "data": [("core", pd.DataFrame({"p": [1, 2]}))]}, | ||
| {"label": "cluster_2", "data": [("core", pd.DataFrame({"p": [2, 3]}))]}, | ||
| ] | ||
| with freeze_time(TIME_TO_FREEZE_ISO): |
There was a problem hiding this comment.
IIRC, we had some issues with the freeze_time approach in the past, which is why almost no test nowadays uses it. I think this test can be re-written without using freeze_time, simply by not checking a value for metadata["creation_time"]. This would not only drop the dependency on freezegun here, but also make the test clearer.
There was a problem hiding this comment.
Makes sense. Removed freeze_time and pushed the changes.
1c8ed8b to
0618558
Compare
Description:
When building indices for an existing dataset via build_dataset_indices methods, user-defined metadata (the "metadata" key in the deserialized by-metadata.json file) is removed. The reason is that build_dataset_indices functions pass
load_dataset_metadata=Falseto the DatasetFactory, e.g. here:https://github.com/JDASoftwareGroup/kartothek/blob/master/kartothek/io/eager.py#L817-L822
This has the effect of actively removing user-defined metadata:
https://github.com/JDASoftwareGroup/kartothek/blob/master/kartothek/core/factory.py#L99-L100
so no metadata is written when the by-metadata.json file is written in the end.
Fix is to pass
load_dataset_metadata=Trueto the DatasetFactory.