[Coral-Schema] Preserve precision in RelDataTypeToAvroType for TIMESTAMP, TIME, and fixed BINARY#609
Conversation
…AMP, TIME, and fixed BINARY RelDataTypeToAvroType converts the Calcite RelDataType of a view's derived/computed expressions to Avro. It hardcoded TIMESTAMP to timestamp-millis, had no TIME case (it threw UnsupportedOperationException), and always emitted bytes for BINARY. This lost precision for Iceberg-derived inputs, where MergeCoralSchemaWithAvro already emits the precise types for base-table columns, so a passed-through column and a computed column of the same logical type could disagree within a single view. Derive the Avro logical type from the Calcite precision instead: - TIMESTAMP precision > 3 (e.g. Iceberg TIMESTAMP(6)) -> timestamp-micros; unspecified precision and precision <= 3 -> timestamp-millis - TIME -> time-micros (previously unsupported) - fixed-length BINARY(n) -> Avro fixed; unbounded BINARY -> bytes Hive TIMESTAMP, TIME, and BINARY resolve to PRECISION_NOT_SPECIFIED (HiveTypeSystem), so Hive output is unchanged; only precision-specified types, which originate from the Iceberg/CoralCatalog path, are upgraded. Adds RelDataTypeToAvroTypeTests.testPrecisionSensitiveTypes with a golden fixture; the full coral-schema suite passes.
| ? "timestamp-micros" : "timestamp-millis"); | ||
| return schema; | ||
| case TIME: | ||
| // Iceberg TIME is microsecond-resolution; Avro represents it as a long with time-micros. |
There was a problem hiding this comment.
Consider citing the Iceberg spec for the microsecond-precision claim so the invariant is traceable:
time— Time of day, microsecond precision, without date, timezone
https://github.com/apache/iceberg/blob/8f28a86914d6beaf2615cdf5797da0acb72ea4d9/format/spec.md?plain=1#L273
There was a problem hiding this comment.
Pull request overview
This PR updates Coral Schema’s RelDataTypeToAvroType conversion so computed/aggregate Avro schema generation preserves precision-sensitive Iceberg-derived types more accurately.
Changes:
- Maps
TIMESTAMPprecision > 3 totimestamp-microswhile preserving existing unspecified/≤3 behavior astimestamp-millis. - Adds
TIMEconversion to Avrotime-micros. - Emits Avro
fixed(n)for precision-specifiedBINARY(n)and adds coverage with a new expected schema fixture.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
coral-schema/src/main/java/com/linkedin/coral/schema/avro/RelDataTypeToAvroType.java |
Adds precision-aware Avro mappings for timestamp, time, and fixed binary types. |
coral-schema/src/test/java/com/linkedin/coral/schema/avro/RelDataTypeToAvroTypeTests.java |
Adds a direct RelDataType test covering precision-sensitive conversions. |
coral-schema/src/test/resources/rel2avro-testPrecisionTypes-expected.avsc |
Provides the golden Avro schema for the new precision-sensitive type test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| schema.addProp("logicalType", "timestamp-millis"); | ||
| schema.addProp("logicalType", | ||
| relDataType.getPrecision() != RelDataType.PRECISION_NOT_SPECIFIED && relDataType.getPrecision() > 3 | ||
| ? "timestamp-micros" : "timestamp-millis"); |
There was a problem hiding this comment.
Not blocking, but worth noting: this defaults unspecified-precision TIMESTAMP to timestamp-millis, while MergeCoralSchemaWithAvro.timestampToAvro defaults it to timestamp-micros. They can't collide today (Iceberg is always TIMESTAMP(6), and Hive tables route to MergeHiveSchemaWithAvro), but if we ever move Hive onto MergeCoralSchemaWithAvro the defaults would disagree. Maybe a short comment on each so a future consolidation reconciles them on purpose.
| schema.addProp("logicalType", "timestamp-millis"); | ||
| schema.addProp("logicalType", | ||
| relDataType.getPrecision() != RelDataType.PRECISION_NOT_SPECIFIED && relDataType.getPrecision() > 3 | ||
| ? "timestamp-micros" : "timestamp-millis"); |
There was a problem hiding this comment.
nit: could we wrap the two comparisons in parens — ((getPrecision() != PRECISION_NOT_SPECIFIED) && (getPrecision() > 3))? Reads clearer than leaning on != / && precedence. Same for the BINARY case above.
There was a problem hiding this comment.
We could also just have getPrecision() > 3 since PRECISION_NOT_SPECIFIED is -1.
simbadzina
left a comment
There was a problem hiding this comment.
Generally LGTM. Left a few comments for code cleanup.
Waiting for author to respond to review comments.
… unspecified-precision default - Simplify the TIMESTAMP/BINARY guards to getPrecision() > 3 and > 0. Since PRECISION_NOT_SPECIFIED is -1, the explicit != PRECISION_NOT_SPECIFIED check was redundant. - Cite the Iceberg spec for the TIME microsecond-precision claim. - Note the intentional unspecified-precision default divergence between RelDataTypeToAvroType (timestamp-millis, to preserve Hive-view output) and MergeCoralSchemaWithAvro.timestampToAvro (timestamp-micros) at both sites, so a future Hive-on-Coral consolidation reconciles them deliberately.
|
Thanks @simbadzina! All addressed in the latest commit:
|
Motivation
When
RelToAvroSchemaConvertergenerates a view's Avro schema, pass-through columns reuse the base-table schema produced byMergeCoralSchemaWithAvro/MergeHiveSchemaWithAvro, while computed and aggregate expressions (e.g.MAX(col),COALESCE(col, …),CASE … END) have their Avro type derived from the CalciteRelDataTypebyRelDataTypeToAvroType(viaSchemaUtilities.appendField).That derivation was lossy for Iceberg-sourced types:
TIMESTAMPwas hardcoded totimestamp-millis, ignoring precision.TIMEhad no case at all — it threwUnsupportedOperationException.BINARYwas always emitted asbytes.Iceberg carries these precisely (
TIMESTAMP(6),TIME, fixedBINARY(n)), and Calcite's aggregate /COALESCE/CASEreturn-type inference preserves the operand type. So a view that computes or aggregates over an Iceberg column hit the gap (pass-through of the same column is already correct via the merge engine):MAX(time_col)/COALESCE(time_col, …)UnsupportedOperationExceptiontime-microsMAX(ts6_col)/CASE … ts6_col … ENDtimestamp-millis(wrong)timestamp-microsMAX(fixed_col)bytes(lossy)fixed(n)This also aligns the computed/aggregate path with the pass-through path, which
MergeCoralSchemaWithAvroalready emits astime-micros/timestamp-micros/fixed.Summary
basicSqlTypeToAvroTypenow derives the Avro logical type from the Calcite precision:TIMESTAMPprecision > 3 (e.g. Iceberg'sTIMESTAMP(6)) →timestamp-micros; unspecified precision and precision ≤ 3 →timestamp-millis.TIME→time-micros(previously unsupported).BINARY(n)→ Avrofixed; unboundedBINARY→bytes.Hive behavior is unchanged
Hive
TIMESTAMP,TIME, andBINARYresolve toPRECISION_NOT_SPECIFIED(HiveTypeSystem), so they keep their exact current output (timestamp-millis/bytes). Only precision-specified types — which originate from the Iceberg/CoralCatalog path — change. The existingtestTimestampTypeField(a Hive view) still emitstimestamp-millis.Testing
RelDataTypeToAvroTypeTests.testPrecisionSensitiveTypescoveringTIMESTAMP(6), unspecifiedTIMESTAMP,TIME, fixedBINARY(16), and unboundedBINARY, with goldenrel2avro-testPrecisionTypes-expected.avsc.coral-schemasuite green (143 tests).