Add support for dataframe and arrow table in add_entities #907

FroggoLight · 2025-10-26T05:12:13Z

Allows client to call add_entities with either dataframe or arrow table without having to call individual add_attributes for all of the data columns.

codecov · 2025-10-26T05:39:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.78%. Comparing base (d0c50a7) to head (596d1f1).
⚠️ Report is 169 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #907      +/-   ##
==========================================
+ Coverage   87.61%   90.78%   +3.17%     
==========================================
  Files         169      177       +8     
  Lines       11931    12421     +490     
==========================================
+ Hits        10453    11277     +824     
+ Misses       1478     1144     -334

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mdekstrand

Good work! There are some subtle edge cases for different Pandas index layouts we need to handle, and documentation to add, but otherwise it looks sound.

mdekstrand · 2025-10-27T14:45:57Z

tests/data/test_builder_entities.py

+    assert ds.entities("item").attribute("title").is_scalar
+    assert ds.entities("item").attribute("genres").is_list


We should test that a few item IDs have the correct titles, too. It's possible for the code to set up the structures in the right format, but not align them correctly, and the tests should check for that.

mdekstrand · 2025-10-27T14:46:05Z

tests/data/test_builder_entities.py

+    assert ds.entities("item").attribute("title").is_scalar
+    assert ds.entities("item").attribute("genres").is_list


Same as above.

mdekstrand · 2025-10-27T14:51:00Z

src/lenskit/data/builder.py

-        if isinstance(source, pa.Table):  # pragma: nocover
-            raise NotImplementedError()
+        if isinstance(source, pd.DataFrame):
+            source = pa.Table.from_pandas(source)


There is an interesting and challenging edge case here, that we need to clearly document and/or design for.

Right now, this works because your test case names the index item_id, which then turns into a column when we do from_pandas.

However, if the client provides code that has no item_id column, and has an index with a different name, we need to figure out what to do. Do we want to use the index? Do we want to throw an error?

I think we probably want to use the Pandas index, with the following logic:

If the data frame has a column named {cls}_id, use that column as the entity IDs, and ignore the index.

Otherwise, assume the index has entity IDs.

Implementing this logic will require this line to be a little more aware of the Pandas data frames, and also require tests for each of the different conditions. Importantly, for case (1), this line here will create a new attribute called index (or whatever the index name is), and we don't want to include that.

The input cases we will need to test for correct behavior with:

Input has an index named {cls}_id (the current test)

Input has an index named something else, and no column named {cls}_id

Input has a column named {cls}_id

This isn't a problem for PyArrow input, because Arrow tables do not have indices.

mdekstrand · 2025-10-27T14:52:05Z

src/lenskit/data/builder.py

+            self.add_entities(cls, entity_col)
+
+            for col_name in source.column_names:
+                if not col_name.endswith("_id"):


We should only exclude the {cls}_id column — we want any other _id columns to result in an error, not be silently ignored.

mdekstrand · 2025-10-27T14:53:39Z

src/lenskit/data/builder.py


                .. note::
                    Data frame support will be added in a future version.
            duplicates:


We should update the docstring to document the kinds of attributes supported, limitations, etc., along with the index logic.

This should be in the main body of the docstring (before Args:), not in the argument documentation, for readability.

Add support for dataframe and table in add_entities

596d1f1

mdekstrand requested changes Oct 27, 2025

View reviewed changes

mdekstrand linked an issue Oct 28, 2025 that may be closed by this pull request

Add support for attributes to add_entities #843

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for dataframe and arrow table in add_entities #907

Add support for dataframe and arrow table in add_entities #907

Uh oh!

FroggoLight commented Oct 26, 2025

Uh oh!

codecov bot commented Oct 26, 2025

Uh oh!

mdekstrand left a comment

Uh oh!

mdekstrand Oct 27, 2025

Uh oh!

mdekstrand Oct 27, 2025

Uh oh!

mdekstrand Oct 27, 2025

Uh oh!

mdekstrand Oct 27, 2025

Uh oh!

mdekstrand Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		assert ds.entities("item").attribute("title").is_scalar
		assert ds.entities("item").attribute("genres").is_list

Add support for dataframe and arrow table in add_entities #907

Are you sure you want to change the base?

Add support for dataframe and arrow table in add_entities #907

Uh oh!

Conversation

FroggoLight commented Oct 26, 2025

Uh oh!

codecov bot commented Oct 26, 2025

Codecov Report

Uh oh!

mdekstrand left a comment

Choose a reason for hiding this comment

Uh oh!

mdekstrand Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

mdekstrand Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

mdekstrand Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

mdekstrand Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

mdekstrand Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants