-
Notifications
You must be signed in to change notification settings - Fork 72
Add support for dataframe and arrow table in add_entities #907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #907 +/- ##
==========================================
+ Coverage 87.61% 90.78% +3.17%
==========================================
Files 169 177 +8
Lines 11931 12421 +490
==========================================
+ Hits 10453 11277 +824
+ Misses 1478 1144 -334 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
mdekstrand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! There are some subtle edge cases for different Pandas index layouts we need to handle, and documentation to add, but otherwise it looks sound.
| assert ds.entities("item").attribute("title").is_scalar | ||
| assert ds.entities("item").attribute("genres").is_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should test that a few item IDs have the correct titles, too. It's possible for the code to set up the structures in the right format, but not align them correctly, and the tests should check for that.
| assert ds.entities("item").attribute("title").is_scalar | ||
| assert ds.entities("item").attribute("genres").is_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
| if isinstance(source, pa.Table): # pragma: nocover | ||
| raise NotImplementedError() | ||
| if isinstance(source, pd.DataFrame): | ||
| source = pa.Table.from_pandas(source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an interesting and challenging edge case here, that we need to clearly document and/or design for.
Right now, this works because your test case names the index item_id, which then turns into a column when we do from_pandas.
However, if the client provides code that has no item_id column, and has an index with a different name, we need to figure out what to do. Do we want to use the index? Do we want to throw an error?
I think we probably want to use the Pandas index, with the following logic:
- If the data frame has a column named
{cls}_id, use that column as the entity IDs, and ignore the index. - Otherwise, assume the index has entity IDs.
Implementing this logic will require this line to be a little more aware of the Pandas data frames, and also require tests for each of the different conditions. Importantly, for case (1), this line here will create a new attribute called index (or whatever the index name is), and we don't want to include that.
The input cases we will need to test for correct behavior with:
- Input has an index named
{cls}_id(the current test) - Input has an index named something else, and no column named
{cls}_id - Input has a column named
{cls}_id
This isn't a problem for PyArrow input, because Arrow tables do not have indices.
| self.add_entities(cls, entity_col) | ||
|
|
||
| for col_name in source.column_names: | ||
| if not col_name.endswith("_id"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only exclude the {cls}_id column — we want any other _id columns to result in an error, not be silently ignored.
| .. note:: | ||
| Data frame support will be added in a future version. | ||
| duplicates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update the docstring to document the kinds of attributes supported, limitations, etc., along with the index logic.
This should be in the main body of the docstring (before Args:), not in the argument documentation, for readability.
Allows client to call add_entities with either dataframe or arrow table without having to call individual add_attributes for all of the data columns.