Skip to content

Conversation

@FroggoLight
Copy link
Contributor

Allows client to call add_entities with either dataframe or arrow table without having to call individual add_attributes for all of the data columns.

@codecov
Copy link

codecov bot commented Oct 26, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.78%. Comparing base (d0c50a7) to head (596d1f1).
⚠️ Report is 169 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #907      +/-   ##
==========================================
+ Coverage   87.61%   90.78%   +3.17%     
==========================================
  Files         169      177       +8     
  Lines       11931    12421     +490     
==========================================
+ Hits        10453    11277     +824     
+ Misses       1478     1144     -334     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@mdekstrand mdekstrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! There are some subtle edge cases for different Pandas index layouts we need to handle, and documentation to add, but otherwise it looks sound.

Comment on lines +163 to +164
assert ds.entities("item").attribute("title").is_scalar
assert ds.entities("item").attribute("genres").is_list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should test that a few item IDs have the correct titles, too. It's possible for the code to set up the structures in the right format, but not align them correctly, and the tests should check for that.

Comment on lines +181 to +182
assert ds.entities("item").attribute("title").is_scalar
assert ds.entities("item").attribute("genres").is_list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

if isinstance(source, pa.Table): # pragma: nocover
raise NotImplementedError()
if isinstance(source, pd.DataFrame):
source = pa.Table.from_pandas(source)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an interesting and challenging edge case here, that we need to clearly document and/or design for.

Right now, this works because your test case names the index item_id, which then turns into a column when we do from_pandas.

However, if the client provides code that has no item_id column, and has an index with a different name, we need to figure out what to do. Do we want to use the index? Do we want to throw an error?

I think we probably want to use the Pandas index, with the following logic:

  1. If the data frame has a column named {cls}_id, use that column as the entity IDs, and ignore the index.
  2. Otherwise, assume the index has entity IDs.

Implementing this logic will require this line to be a little more aware of the Pandas data frames, and also require tests for each of the different conditions. Importantly, for case (1), this line here will create a new attribute called index (or whatever the index name is), and we don't want to include that.

The input cases we will need to test for correct behavior with:

  1. Input has an index named {cls}_id (the current test)
  2. Input has an index named something else, and no column named {cls}_id
  3. Input has a column named {cls}_id

This isn't a problem for PyArrow input, because Arrow tables do not have indices.

self.add_entities(cls, entity_col)

for col_name in source.column_names:
if not col_name.endswith("_id"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only exclude the {cls}_id column — we want any other _id columns to result in an error, not be silently ignored.

.. note::
Data frame support will be added in a future version.
duplicates:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update the docstring to document the kinds of attributes supported, limitations, etc., along with the index logic.

This should be in the main body of the docstring (before Args:), not in the argument documentation, for readability.

@mdekstrand mdekstrand linked an issue Oct 28, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for attributes to add_entities

2 participants