[DO NOT MERGE]Support Spark Connect by chenliu0831 · Pull Request #651 · awslabs/deequ

chenliu0831 · 2026-01-13T18:13:50Z

Issue #, if available:

Description of changes:

Initial effort to evolve PyDeequ to use Spark Connect instead of the currently fragile Py4J based bridge.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

SemyonSinchenko · 2026-01-13T20:58:36Z

+
+  // The transform method receives protobuf Any from Spark Connect
+  // Scala compiler sees com.google.protobuf.Any in the interface signature
+  override def transform(


Feel free to ignore

In Spark 4.x the signature was changed from relation: protobuf.Any to relation: Array[Byte]. To avoid pain during the migration I would strongly recommend to keep transform as small as possible and better in a separate class. In GraphFrames we separated implementation of the plugin and the plugin logic to be able to have two versions for different spark. You can see an example here: spark3 and spark4

Otherwise you may need to duplicate the whole logic on a day you will work on support of the spark 4.x

Great call. Thanks, I haven't considered much about Spark 3.x to 4.x breaking change yet (it seems more annoying than I thought..). Let me revisit this in a new revision.

github-actions

Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 926c07a3) — may not be fully accurate. Reply if this doesn't help.

github-actions · 2026-04-27T22:53:12Z

@@ -0,0 +1,504 @@
+<?xml version="1.0" encoding="UTF-8"?>


This file is auto-generated by maven-shade-plugin and should not be committed to the repository. Add dependency-reduced-pom.xml to .gitignore.

github-actions · 2026-04-27T22:53:12Z

+    // Debug: Log what we're receiving
+    println(s"[DeequPlugin] Received relation with type_url: ${relation.getTypeUrl}")
+    println(s"[DeequPlugin] Expected type_url for verification: ${DeequVerificationRelation.getDescriptor.getFullName}")
+    println(s"[DeequPlugin] is(DeequVerificationRelation): ${relation.is(classOf[DeequVerificationRelation])}")


Remove debug println statements. These will pollute driver logs in production. Use a proper logger (e.g., org.slf4j.LoggerFactory) or remove them entirely.

github-actions · 2026-04-27T22:53:12Z

+    println(s"[DeequPlugin] Received relation with type_url: ${relation.getTypeUrl}")
+    println(s"[DeequPlugin] Expected type_url for verification: ${DeequVerificationRelation.getDescriptor.getFullName}")
+    println(s"[DeequPlugin] is(DeequVerificationRelation): ${relation.is(classOf[DeequVerificationRelation])}")
+


Same: remove println debug logging throughout this method (lines 57-58, 62, 68, 74, 80, 85).

github-actions · 2026-04-27T22:53:12Z

+  }
+
+  /**
+   * Deserialize the input relation bytes to a DataFrame.


new DataFrame(spark, logicalPlan, ExpressionEncoder(qe.analyzed.schema)) — the 3-arg DataFrame constructor is an internal API that may not exist in all Spark 3.5 builds. Consider using spark.sessionState.executePlan(logicalPlan) and Dataset.ofRows(spark, logicalPlan) instead, which is the standard internal pattern.

github-actions · 2026-04-27T22:53:12Z

+
+    // Build Check objects from protobuf messages
+    val checks = req.getChecksList.asScala.map(CheckBuilder.build).toSeq
+


suite is a VerificationRunBuilder, but you're reassigning a var with the return of addCheck and addRequiredAnalyzer. Verify that these methods return the builder (they do return this), but the pattern is fragile — prefer chaining: checks.foldLeft(VerificationSuite().onData(inputDf))((s, c) => s.addCheck(c)).

github-actions · 2026-04-27T22:53:12Z

+        val relativeError = if (msg.getRelativeError == 0.0) 0.01 else msg.getRelativeError
+        ApproxQuantile(msg.getColumn, quantile, relativeError)
+
+      case "ApproxQuantiles" =>


ApproxQuantile — defaulting quantile to 0.5 when the proto value is 0.0 means a client cannot explicitly request the 0th quantile. Use msg.hasQuantile() or a wrapper message to distinguish "not set" from "set to 0.0". Same issue with relativeError defaulting when 0.0.

github-actions · 2026-04-27T22:53:12Z

+        } else {
+          msg.getColumnsList.asScala.map(_.toDouble).toSeq
+        }
+        val relativeError = if (msg.getRelativeError == 0.0) 0.01 else msg.getRelativeError


ApproxQuantiles reuses the columns repeated field to pass quantile values (doubles encoded as strings). This is a semantic mismatch — columns is documented as column names in the proto. Use a dedicated repeated double field in the proto instead.

github-actions · 2026-04-27T22:53:12Z

+    val spark = planner.sessionHolder.session
+    val inputDf = deserializeInputRelation(req.getInputRelation, planner)
+
+    // Build suggestion runner


Rules.STRING, Rules.NUMERICAL, Rules.COMMON, Rules.EXTENDED — verify these constants exist in the Rules object. The Deequ Rules object only defines DEFAULT. Unknown rule names silently fall back to DEFAULT, which hides configuration errors.

github-actions · 2026-04-27T22:53:12Z

+  bytes input_relation = 1;
+
+  // Checks to run
+  repeated CheckMessage checks = 2;


bytes input_relation = 1 — using raw bytes to serialize a nested Spark Connect Relation is fragile. If the proto schema changes, deserialization will silently break. Consider using google.protobuf.Any or importing the Spark Connect proto and using the Relation message type directly.

github-actions · 2026-04-27T22:53:12Z

+        }
+
+      // Approx count distinct
+      case "hasApproxCountDistinct" =>


buildDoubleAssertion receives a PredicateMessage but protobuf never returns null for message fields — it returns a default instance. The if (pred == null) check will never be true. You need to check c.hasAssertion() at the call site instead.

chenliu0831 added 2 commits January 13, 2026 13:11

Support Spark Connect

81ffee6

Add Column Profiler and Constraint Suggestions support

5bdb9d0

SemyonSinchenko reviewed Jan 13, 2026

View reviewed changes

fix bug in analyzer context

1c5daef

github-actions Bot requested changes Apr 27, 2026

View reviewed changes


		// Build Check objects from protobuf messages
		val checks = req.getChecksList.asScala.map(CheckBuilder.build).toSeq

Conversation

chenliu0831 commented Jan 13, 2026

Uh oh!

SemyonSinchenko Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

chenliu0831 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants