Skip to content

DRAFT OF A DATAFRAME API#716

Draft
CarloMariaProietti wants to merge 6 commits intoapache:mainfrom
CarloMariaProietti:df_api_draft
Draft

DRAFT OF A DATAFRAME API#716
CarloMariaProietti wants to merge 6 commits intoapache:mainfrom
CarloMariaProietti:df_api_draft

Conversation

@CarloMariaProietti
Copy link

@CarloMariaProietti CarloMariaProietti commented Mar 11, 2026

Hi everyone.

The aim of this draft is not to provide a ready-to-use toy DF API, instead the goal is to share my ideas regarding the API with the help of code and comments which (hopefully) help in communicating the core ideas.

THE PROJECT IN BRIEF, RATIONALE AND IMPACT
Currently, Wayang users leverage the DataQuanta abstraction to build execution plans in an object-oriented style. In other words, users write strongly-typed lambda functions that operate on typed objects. While this approach guarantees compile-time type safety, it reflects a paradigm that has largely been superseded in most of use cases. The current state of the art is the DataFrame paradigm. Instead of writing opaque functions that operate on typed objects, users employ SQL-like declarative expressions that operate on tables with a schema; for this reason the executor engine has enough informations to perform advanced optimizations such as filtering and selecting data at the source. Wayang will provide the Dataframe abstraction by leveraging the existing DataQuanta class. Specifically, a Wayang Dataframe will be a wrapper arround a DataQuanta[Row] where the Row object provides a schema-aware container for data. The API will expose standard data science operations such as projection, filtering, and aggregation. Deliverables: A Dataframe API within Apache Wayang. This new API will be leveraged by data scientists and engineers that will have the opportunity to write plans of execution in a Dataframe style, and, having each part of their data pipeline executed by the engine that is best suited for the job.

I hope to get some feedback to improve my proposal.

@CarloMariaProietti CarloMariaProietti marked this pull request as draft March 11, 2026 10:13
@CarloMariaProietti CarloMariaProietti changed the title DRAFT FOR A DATAFRAME API DRAFT OF A DATAFRAME API Mar 11, 2026
@AjayBoddeda4
Copy link

Hi Carlo, this is a great draft! I am Ajay Boddeda, a GSoC 2026 applicant interested in the DataFrames API project. I have hands-on experience with PySpark DataFrames and I am very excited about this direction. I especially like the Row + Schema abstraction idea. Would love to contribute to this and build on your work!

@AjayBoddeda4
Copy link

Hi Carlo, I have been studying your draft code carefully. The Row class with a list of generic objects and the Schema mapping column names to types is a clean design. I have one question — for the filter() operation, are you planning to use expression-based filtering like Spark's Column expressions, or a simpler predicate approach first? I ask because in PySpark I use df.filter(df.age > 21) daily in production and I'm thinking about how to map that cleanly to Wayang's execution plan.

@AjayBoddeda4
Copy link

Hi Carlo, I noticed you pushed a new refining commit after our discussion — exciting to see the draft evolving! I cloned the Wayang repository locally and have been studying the wayang-api-scala-java structure to understand where the DataFrame API would best fit. Looking forward to seeing the updated design!

@AjayBoddeda4
Copy link

Hi Carlo, I studied the new commits carefully — this is excellent progress! I noticed you used Java Records for both Row and Schema which is exactly the direction I suggested on issue #514. The SparkSelectOperator using Dataset[Row] with functions::col is a clean implementation.
Looking at SparkSelectOperator, I see getSupportedInputChannels and getSupportedOutputChannels return empty lists — would DatasetChannel descriptors be the right choice here to keep execution within the Dataset world and avoid RDD conversions? This connects to issue #362 about DataFrameChannel that I was studying.

@CarloMariaProietti
Copy link
Author

Hi Carlo, I have been studying your draft code carefully. The Row class with a list of generic objects and the Schema mapping column names to types is a clean design. I have one question — for the filter() operation, are you planning to use expression-based filtering like Spark's Column expressions, or a simpler predicate approach first? I ask because in PySpark I use df.filter(df.age > 21) daily in production and I'm thinking about how to map that cleanly to Wayang's execution plan.

Hi Ajay, I am glad you find the design clean. To address your question, I suggest you to look at SelectOperator's comment, there you can find an exhaustive explenation; if you will still have doubts please let me know.

@CarloMariaProietti
Copy link
Author

CarloMariaProietti commented Mar 20, 2026

Hi Carlo, I studied the new commits carefully — this is excellent progress! I noticed you used Java Records for both Row and Schema which is exactly the direction I suggested on issue #514. The SparkSelectOperator using Dataset[Row] with functions::col is a clean implementation. Looking at SparkSelectOperator, I see getSupportedInputChannels and getSupportedOutputChannels return empty lists — would DatasetChannel descriptors be the right choice here to keep execution within the Dataset world and avoid RDD conversions? This connects to issue #362 about DataFrameChannel that I was studying.

Hi, I am glad that you also think that using record class might be a good choice.
Imo you are right when you suggest that the execution 'should be kept in the Dataset world', Dataset< Row > is exactly the Spark implementation of the DataFrame abstraction; which is exactly what the new Wayang API should provide. In this logic, DatasetChannel descriptors seem to be the right choice for both input and output; however it may also have sense to allow RDD convertions in order to have more flexibility (see SparkParquetOperator).

@AjayBoddeda4
Copy link

Hi Carlo, thank you for the detailed responses! I read SelectOperator's comment carefully — the explanation about untyped expressions vs UDFs is very clear and makes perfect sense for the DataFrame abstraction.
Regarding DatasetChannel — I agree that allowing RDD conversions as a fallback gives flexibility while keeping the happy path within Dataset world. I will study SparkParquetOperator to understand how they handle that balance.
Looking forward to contributing to this project through GSoC 2026!

@CarloMariaProietti
Copy link
Author

CarloMariaProietti commented Mar 21, 2026

Looking forward to contributing to this project through GSoC 2026!

Hi Ajay, reading the last part of your message, I fear there is a misunderstanding. I am not a mentor for the project; instead, I am also a GSoC applicant interested in the project :)

@AjayBoddeda4
Copy link

Hi Carlo, thank you for clarifying! That actually makes our discussion even more interesting — it's great to connect with another applicant who is equally passionate about this project. Your draft has been really helpful in understanding the design direction. Looking forward to seeing how this project evolves. Best of luck with your proposal!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants