Conversation
|
Hi Carlo, this is a great draft! I am Ajay Boddeda, a GSoC 2026 applicant interested in the DataFrames API project. I have hands-on experience with PySpark DataFrames and I am very excited about this direction. I especially like the Row + Schema abstraction idea. Would love to contribute to this and build on your work! |
|
Hi Carlo, I have been studying your draft code carefully. The Row class with a list of generic objects and the Schema mapping column names to types is a clean design. I have one question — for the filter() operation, are you planning to use expression-based filtering like Spark's Column expressions, or a simpler predicate approach first? I ask because in PySpark I use df.filter(df.age > 21) daily in production and I'm thinking about how to map that cleanly to Wayang's execution plan. |
|
Hi Carlo, I noticed you pushed a new refining commit after our discussion — exciting to see the draft evolving! I cloned the Wayang repository locally and have been studying the wayang-api-scala-java structure to understand where the DataFrame API would best fit. Looking forward to seeing the updated design! |
|
Hi Carlo, I studied the new commits carefully — this is excellent progress! I noticed you used Java Records for both Row and Schema which is exactly the direction I suggested on issue #514. The SparkSelectOperator using Dataset[Row] with functions::col is a clean implementation. |
Hi Ajay, I am glad you find the design clean. To address your question, I suggest you to look at SelectOperator's comment, there you can find an exhaustive explenation; if you will still have doubts please let me know. |
Hi, I am glad that you also think that using record class might be a good choice. |
|
Hi Carlo, thank you for the detailed responses! I read SelectOperator's comment carefully — the explanation about untyped expressions vs UDFs is very clear and makes perfect sense for the DataFrame abstraction. |
Hi Ajay, reading the last part of your message, I fear there is a misunderstanding. I am not a mentor for the project; instead, I am also a GSoC applicant interested in the project :) |
|
Hi Carlo, thank you for clarifying! That actually makes our discussion even more interesting — it's great to connect with another applicant who is equally passionate about this project. Your draft has been really helpful in understanding the design direction. Looking forward to seeing how this project evolves. Best of luck with your proposal! |
Hi everyone.
The aim of this draft is not to provide a ready-to-use toy DF API, instead the goal is to share my ideas regarding the API with the help of code and comments which (hopefully) help in communicating the core ideas.
THE PROJECT IN BRIEF, RATIONALE AND IMPACT
Currently, Wayang users leverage the DataQuanta abstraction to build execution plans in an object-oriented style. In other words, users write strongly-typed lambda functions that operate on typed objects. While this approach guarantees compile-time type safety, it reflects a paradigm that has largely been superseded in most of use cases. The current state of the art is the DataFrame paradigm. Instead of writing opaque functions that operate on typed objects, users employ SQL-like declarative expressions that operate on tables with a schema; for this reason the executor engine has enough informations to perform advanced optimizations such as filtering and selecting data at the source. Wayang will provide the Dataframe abstraction by leveraging the existing DataQuanta class. Specifically, a Wayang Dataframe will be a wrapper arround a DataQuanta[Row] where the Row object provides a schema-aware container for data. The API will expose standard data science operations such as projection, filtering, and aggregation. Deliverables: A Dataframe API within Apache Wayang. This new API will be leveraged by data scientists and engineers that will have the opportunity to write plans of execution in a Dataframe style, and, having each part of their data pipeline executed by the engine that is best suited for the job.
I hope to get some feedback to improve my proposal.