What happens?
Hi.
Problem
As the result of a sparl.sql("non-select") where non-select is any SQL statement that is not a select, e.g., USE, INSERT, DROP, CREATE, ... the sql() function will correctly return an empty DataFrame, which is the behavior of the pyspark API.
However, that object crashes when using any of its APIs, because the internal relation object is None. The same applies when trying to create an empty DataFrame without columns. A
Fix
I think the best fix would require fixing the underlying c++ Relation object from the duckdb C++ library to support an empty relation without columns. There are also a couple other fixes like allowing the underlying duckdb.struct_type() to have no fields. That would make the low-level API more robust and require less patching in the python layer.
Then the DuckDBPyConnection::RunQuery function needs to return an empty relation for non-select statement, instead of nullptr. All these fixes felt a bit overwhelming so I won't submit a patch.
To Reproduce
Testcase. All this works with Spark.
@pytest.mark.parametrize("mode", ["pandas", "list", "non-select"])
def test_empty_sdf( spark_session_g, mode):
from pyspark.sql import functions as f
from pyspark.sql import types as t
import pandas as pd
spark = spark_session_g
if mode =="pandas":
sdf = spark.createDataFrame(pd.DataFrame(), t.StructType([]))
elif mode == "list":
sdf = spark.createDataFrame([], t.StructType([]))
else:
curr_db = spark.catalog.currentDatabase()
sdf = spark.sql(f"USE {curr_db}") # non-result set query
assert sdf.schema == t.StructType([])
assert sdf.columns == []
assert sdf.collect() == []
assert sdf.toPandas().empty
assert sdf.toArrow().shape == (0, 0)
sdf.createOrReplaceTempView("my_vv1")
assert spark.sql("SELECT * from my_vv1").toArrow().shape == (0, 0)
sdf.show() # no-op, no crash
assert sdf.withColumn("col1", f.lit(1)).columns == ["col1"]
assert sdf.withColumns({"col1": f.lit(1)}).columns == ["col1"]
assert sdf.drop("noop").columns == []
OS:
Any
DuckDB Package Version:
Main branch
Python Version:
3.12
Full Name:
João Eiras
Affiliation:
private
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a source build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration to reproduce the issue?
What happens?
Hi.
Problem
As the result of a
sparl.sql("non-select")wherenon-selectis any SQL statement that is not a select, e.g., USE, INSERT, DROP, CREATE, ... thesql()function will correctly return an empty DataFrame, which is the behavior of the pyspark API.However, that object crashes when using any of its APIs, because the internal
relationobject is None. The same applies when trying to create an empty DataFrame without columns. AFix
I think the best fix would require fixing the underlying c++ Relation object from the duckdb C++ library to support an empty relation without columns. There are also a couple other fixes like allowing the underlying
duckdb.struct_type()to have no fields. That would make the low-level API more robust and require less patching in the python layer.Then the
DuckDBPyConnection::RunQueryfunction needs to return an empty relation for non-select statement, instead ofnullptr. All these fixes felt a bit overwhelming so I won't submit a patch.To Reproduce
Testcase. All this works with Spark.
OS:
Any
DuckDB Package Version:
Main branch
Python Version:
3.12
Full Name:
João Eiras
Affiliation:
private
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a source build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration to reproduce the issue?