-
Notifications
You must be signed in to change notification settings - Fork 39
[SEDONA-750] add sedona deserialization code #460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
493e070 to
d36bf02
Compare
|
@Imbruced Hey Pawel, do you mind adding a bit explanation on what this PR is about? |
|
@jiayuasu, yeah, it's WIP MR in draft mode, so it's not ready yet. I am verifying what steps are tested for an MR in the Sedona DB. Basically, the idea is to run the SedonaDB in Sedona vectorized UDFs
So do I need to transform the Sedona Serialized geometries to the WKB, which is input for SedonaDB It would be better to keep it in SedonaDB other than Sedona (at least based on my thinking), but I am open to any suggestions. Another way would be to call the wkb functions on Sedona, but I am not sure if it's doable. I would like to have internal function, which is not exposed for the users, it's only utilized by the vectorized udf worker. |
409bf54 to
82fa010
Compare
|
Very cool! I'll let you finish the proof-of-concept on the Sedona side and we can workshop the best way to connect all the dots here. I think working with the Sedona Spark serialization on the SedonaDB side is going to be the right way to go...we could potentially integrate it to the point where we don't need to convert to WKB (i.e., we can work with the Sedona Spark serialization in place for some functions). |
|
@Kontinuation SedonaSpark internal serialization is very similar to WKB. Is there a way that we can avoid unnecessary SerDe in the vectorized UDF? |
Yeah, it's similar, but in a few places it's different, and that's what this MR does: shuffles bytes to change the Sedona SerDe to WKB. So, for instance, I avoid reading coordinates to numbers but push bytes to a new array. IF there is a simpler way, that would be great. The difference I've seen is in multipolygon, polygon, multilinestring, where the metadata bytes (number of geometries, rings etc.) are at the end, and in the WKB in multilinestring, each point has complete wkb information of a linestirng like byteorder and wkb type |
|
There is a way to make SedonaDB work directly with custom serialization formats without converting to WKB first. Below is a high-level non-exhaustive roadmap of the required changes.
|
|
@Kontinuation This sounds good. What do you think about what @paleolimbot proposed? As PoC, we go with the Sedona serde to the WKB transformation, and then we add the additional native serialization method? I can handle all of this, I'll need some time as I am not super familiar with SedonaDB and Rust. |
Yes. The PoC plan sounds good to me. This allows us building something useful without performing a giant refactoring. |
d00a875 to
7179aa7
Compare
c46b6bf to
026413c
Compare
|
Hmm not sure why the step is failing :/ I included the serialization, using my messy code in the Apache Sedona Spark. I can run the following udf using Apache Sedona which runs SedonaDB import pyarrow as pa
import shapely
import geoarrow.pyarrow as ga
from sedonadb import udf
@udf.arrow_udf(ga.wkb(), [udf.GEOMETRY, udf.NUMERIC])
def shapely_udf(geom, distance):
geom_wkb = pa.array(geom.storage.to_array())
distance = pa.array(distance.to_array())
geom = shapely.from_wkb(geom_wkb)
result_shapely = shapely.buffer(geom, distance)
return pa.array(shapely.to_wkb(result_shapely))which gives me the result as follows for my testing data I think the current MR is ready and I'll continue working on the worker and Apache Sedona, SedonaDB strategy let me know what do you think |
|
Cool! When you have a PR on the Sedona Spark side feel free to link it and I will have a look.
R on MacOS CI is failing pretty much everywhere on GitHub actions at the moment (not related to this PR). |
|
@paleolimbot definitely! Most likely closer to the end of the year, as now it's Christmas week :D |

Thanks to SedonaDB's great interoperability with Arrow, we could modify the Apache Spark vectorized UDFs to use SedonaDB for spatial attributes. We can start with scalar functions, and later add more complex features like table functions. One obstacle to that is converting the Apache Sedona internal serialization format to WKB, which SedonaDB uses. This ticker aims to add scala function to cast Sedona internal serde format to WKB to follow the following diagram