Skip to content

Conversation

@Imbruced
Copy link
Member

@Imbruced Imbruced commented Dec 15, 2025

Thanks to SedonaDB's great interoperability with Arrow, we could modify the Apache Spark vectorized UDFs to use SedonaDB for spatial attributes. We can start with scalar functions, and later add more complex features like table functions. One obstacle to that is converting the Apache Sedona internal serialization format to WKB, which SedonaDB uses. This ticker aims to add scala function to cast Sedona internal serde format to WKB to follow the following diagram

image

@Imbruced Imbruced force-pushed the add-sedona-deserializer branch from 493e070 to d36bf02 Compare December 16, 2025 16:24
@jiayuasu
Copy link
Member

@Imbruced Hey Pawel, do you mind adding a bit explanation on what this PR is about?

@Imbruced
Copy link
Member Author

@jiayuasu, yeah, it's WIP MR in draft mode, so it's not ready yet. I am verifying what steps are tested for an MR in the Sedona DB.

Basically, the idea is to run the SedonaDB in Sedona vectorized UDFs

image

So do I need to transform the Sedona Serialized geometries to the WKB, which is input for SedonaDB

It would be better to keep it in SedonaDB other than Sedona (at least based on my thinking), but I am open to any suggestions. Another way would be to call the wkb functions on Sedona, but I am not sure if it's doable. I would like to have internal function, which is not exposed for the users, it's only utilized by the vectorized udf worker.

@Imbruced Imbruced force-pushed the add-sedona-deserializer branch from 409bf54 to 82fa010 Compare December 16, 2025 21:22
@Imbruced Imbruced changed the title add sedona deserialization code [SEDONA-750] add sedona deserialization code Dec 16, 2025
@paleolimbot
Copy link
Member

Very cool!

I'll let you finish the proof-of-concept on the Sedona side and we can workshop the best way to connect all the dots here. I think working with the Sedona Spark serialization on the SedonaDB side is going to be the right way to go...we could potentially integrate it to the point where we don't need to convert to WKB (i.e., we can work with the Sedona Spark serialization in place for some functions).

@jiayuasu
Copy link
Member

@Kontinuation SedonaSpark internal serialization is very similar to WKB. Is there a way that we can avoid unnecessary SerDe in the vectorized UDF?

@Imbruced
Copy link
Member Author

@Kontinuation SedonaSpark internal serialization is very similar to WKB. Is there a way that we can avoid unnecessary SerDe in the vectorized UDF?

Yeah, it's similar, but in a few places it's different, and that's what this MR does: shuffles bytes to change the Sedona SerDe to WKB. So, for instance, I avoid reading coordinates to numbers but push bytes to a new array. IF there is a simpler way, that would be great. The difference I've seen is in multipolygon, polygon, multilinestring, where the metadata bytes (number of geometries, rings etc.) are at the end, and in the WKB in multilinestring, each point has complete wkb information of a linestirng like byteorder and wkb type

@Kontinuation
Copy link
Member

There is a way to make SedonaDB work directly with custom serialization formats without converting to WKB first. Below is a high-level non-exhaustive roadmap of the required changes.

  1. We implement geo-traits and geo-traits-ext for the serialization format, this is similar to how Wkb implements geo-traits and geo-traits-ext. This allows generic geo algorithms to work directly with them without deserializing the buffer into intermediate formats.
  2. Extend SedonaType to natively support that custom format.
  3. Support something like Into<geos::Geometry> and From<geos::Geometry> for directly decoding to/encoding from geometry objects defined by third party libraries. This makes ST functions based on GEOS having less performance overhead.
  4. There are some existing code assuming that the data format is WKB and directly work with them. We should refactor some of them to be generic code working with geo-traits values.

@Imbruced
Copy link
Member Author

@Kontinuation This sounds good. What do you think about what @paleolimbot proposed? As PoC, we go with the Sedona serde to the WKB transformation, and then we add the additional native serialization method? I can handle all of this, I'll need some time as I am not super familiar with SedonaDB and Rust.

@Kontinuation
Copy link
Member

@Kontinuation This sounds good. What do you think about what @paleolimbot proposed? As PoC, we go with the Sedona serde to the WKB transformation, and then we add the additional native serialization method? I can handle all of this, I'll need some time as I am not super familiar with SedonaDB and Rust.

Yes. The PoC plan sounds good to me. This allows us building something useful without performing a giant refactoring.

@Imbruced Imbruced force-pushed the add-sedona-deserializer branch from d00a875 to 7179aa7 Compare December 17, 2025 18:06
@Imbruced Imbruced force-pushed the add-sedona-deserializer branch from c46b6bf to 026413c Compare December 19, 2025 21:54
@Imbruced
Copy link
Member Author

Hmm not sure why the step is failing :/ I included the serialization, using my messy code in the Apache Sedona Spark.

I can run the following udf using Apache Sedona which runs SedonaDB

import pyarrow as pa
import shapely
import geoarrow.pyarrow as ga
from sedonadb import udf

@udf.arrow_udf(ga.wkb(), [udf.GEOMETRY, udf.NUMERIC])
def shapely_udf(geom, distance):
    geom_wkb = pa.array(geom.storage.to_array())
    distance = pa.array(distance.to_array())
    geom = shapely.from_wkb(geom_wkb)
    result_shapely = shapely.buffer(geom, distance)

    return pa.array(shapely.to_wkb(result_shapely))

which gives me the result as follows for my testing data

+--------------------+
|                geom|
+--------------------+
|POLYGON ((14.3093...|
|POLYGON ((14.3177...|
|POLYGON ((14.3891...|
|POLYGON ((14.2185...|
|POLYGON ((14.3595...|
|POLYGON ((14.3855...|
|POLYGON ((14.2739...|
|POLYGON ((14.4047...|
|POLYGON ((14.3120...|
|POLYGON ((14.3630...|
+--------------------+

I think the current MR is ready and I'll continue working on the worker and Apache Sedona, SedonaDB strategy

let me know what do you think

@Imbruced Imbruced marked this pull request as ready for review December 19, 2025 23:57
@paleolimbot
Copy link
Member

Cool! When you have a PR on the Sedona Spark side feel free to link it and I will have a look.

Hmm not sure why the step is failing

R on MacOS CI is failing pretty much everywhere on GitHub actions at the moment (not related to this PR).

@Imbruced
Copy link
Member Author

@paleolimbot definitely! Most likely closer to the end of the year, as now it's Christmas week :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants