Skip to content

Conversation

@Flyangz
Copy link
Contributor

@Flyangz Flyangz commented Jan 16, 2026

Which issue does this PR close?

Closes #1902

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR attempts to introduce function caching for external functions using Rust's OnceLock to avoid repeatedly wrapping function pointers in Arc. A cache! macro is introduced to replace direct Arc::new() calls for all Spark extension functions (except the "Placeholder" function).

Changes:

  • Added OnceLock import and created a cache! macro to cache function implementations
  • Replaced 30+ direct Arc::new(function) calls with cache!(function) macro invocations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +42 to +47
macro_rules! cache {
($func:path) => {{
static CELL: OnceLock<ScalarFunctionImplementation> = OnceLock::new();
CELL.get_or_init(|| Arc::new($func)).clone()
}};
}
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache macro has a critical bug: all invocations share the same static CELL variable. This means that once any function is cached in the first call, all subsequent function lookups will return that same cached function, regardless of which function was requested.

For example, if "Spark_NullIf" is called first, it will cache spark_null_if::spark_null_if. Then when "Spark_NullIfZero" is called, it will return the same spark_null_if::spark_null_if function instead of spark_null_if::spark_null_if_zero.

To fix this, each invocation needs its own unique static variable. This can be achieved by making the static variable name unique per function path, or by using a different caching approach such as a global HashMap with function names as keys.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In rust, statics inside functions or blocks are not global singletons sharing the same name; they are local singletons unique to that specific scope instantiation. In this case, it is to that specific matching arm.
Just running this test in AuronQuerySuite

test("my cache test") {
    withTable("my_cache_table") {
      sql("""
            |create table my_cache_table using parquet as
            |select col1, col2 from values ('a,A', '{"a":"1", "b":"2"}'), ('b,B', '{"a":"3", "b":"4"}'), ('c,C', '{"a":"5", "b":"6"}')
            |""".stripMargin)
      sql("""
            |select split(col1, ',')[0],
            |       split(col1, ',')[1],
            |       get_json_object(col2, '$.a'),
            |       get_json_object(col2, '$.b')
            |from my_cache_table
            |""".stripMargin).show()
    }
  }

we can see the following correct answer.

+---------------------+---------------------+--------------------------+--------------------------+
|split(col1, ,, -1)[0]|split(col1, ,, -1)[1]|get_json_object(col2, $.a)|get_json_object(col2, $.b)|
+---------------------+---------------------+--------------------------+--------------------------+
|                    a|                    A|                         1|                         2|
|                    b|                    B|                         3|                         4|
|                    c|                    C|                         5|                         6|
+---------------------+---------------------+--------------------------+--------------------------+

It can handle different ext function StringSplit, GetParsedJsonObject and ParseJson.

ProjectExec [
(spark_ext_function_Spark_StringSplit(#2@0, ,)).[1] AS #16, 
(spark_ext_function_Spark_StringSplit(#2@0, ,)).[2] AS #17, spark_ext_function_Spark_GetParsedJsonObject(spark_ext_function_Spark_ParseJson(#3@1), $.a) AS #18, 
spark_ext_function_Spark_GetParsedJsonObject(spark_ext_function_Spark_ParseJson(#3@1), $.b) AS #19
], schema=[#16:Utf8;N, #17:Utf8;N, #18:Utf8;N, #19:Utf8;N]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix ext function caching

1 participant