Skip to content

[Coral-Schema] Preserve precision in RelDataTypeToAvroType for TIMESTAMP, TIME, and fixed BINARY#609

Open
yyy1000 wants to merge 2 commits into
linkedin:masterfrom
yyy1000:iceberg-first-avro/pr7-reldatatype-precision
Open

[Coral-Schema] Preserve precision in RelDataTypeToAvroType for TIMESTAMP, TIME, and fixed BINARY#609
yyy1000 wants to merge 2 commits into
linkedin:masterfrom
yyy1000:iceberg-first-avro/pr7-reldatatype-precision

Conversation

@yyy1000
Copy link
Copy Markdown
Contributor

@yyy1000 yyy1000 commented May 28, 2026

Motivation

When RelToAvroSchemaConverter generates a view's Avro schema, pass-through columns reuse the base-table schema produced by MergeCoralSchemaWithAvro / MergeHiveSchemaWithAvro, while computed and aggregate expressions (e.g. MAX(col), COALESCE(col, …), CASE … END) have their Avro type derived from the Calcite RelDataType by RelDataTypeToAvroType (via SchemaUtilities.appendField).

That derivation was lossy for Iceberg-sourced types:

  • TIMESTAMP was hardcoded to timestamp-millis, ignoring precision.
  • TIME had no case at all — it threw UnsupportedOperationException.
  • BINARY was always emitted as bytes.

Iceberg carries these precisely (TIMESTAMP(6), TIME, fixed BINARY(n)), and Calcite's aggregate / COALESCE / CASE return-type inference preserves the operand type. So a view that computes or aggregates over an Iceberg column hit the gap (pass-through of the same column is already correct via the merge engine):

Expression over an Iceberg base-table column Before After
MAX(time_col) / COALESCE(time_col, …) throws UnsupportedOperationException time-micros
MAX(ts6_col) / CASE … ts6_col … END timestamp-millis (wrong) timestamp-micros
MAX(fixed_col) bytes (lossy) fixed(n)

This also aligns the computed/aggregate path with the pass-through path, which MergeCoralSchemaWithAvro already emits as time-micros / timestamp-micros / fixed.

On TIME specifically: Hive has no TIME type, so a Calcite TIME originates only from the Iceberg/CoralCatalog path. That is why the previous default-case UnsupportedOperationException was never reached on the Hive path but is reachable for Iceberg-backed views.

Summary

basicSqlTypeToAvroType now derives the Avro logical type from the Calcite precision:

  • TIMESTAMP precision > 3 (e.g. Iceberg's TIMESTAMP(6)) → timestamp-micros; unspecified precision and precision ≤ 3 → timestamp-millis.
  • TIMEtime-micros (previously unsupported).
  • fixed-length BINARY(n) → Avro fixed; unbounded BINARYbytes.

Hive behavior is unchanged

Hive TIMESTAMP, TIME, and BINARY resolve to PRECISION_NOT_SPECIFIED (HiveTypeSystem), so they keep their exact current output (timestamp-millis / bytes). Only precision-specified types — which originate from the Iceberg/CoralCatalog path — change. The existing testTimestampTypeField (a Hive view) still emits timestamp-millis.

Testing

  • New RelDataTypeToAvroTypeTests.testPrecisionSensitiveTypes covering TIMESTAMP(6), unspecified TIMESTAMP, TIME, fixed BINARY(16), and unbounded BINARY, with golden rel2avro-testPrecisionTypes-expected.avsc.
  • Full coral-schema suite green (143 tests).

…AMP, TIME, and fixed BINARY

RelDataTypeToAvroType converts the Calcite RelDataType of a view's derived/computed
expressions to Avro. It hardcoded TIMESTAMP to timestamp-millis, had no TIME case (it
threw UnsupportedOperationException), and always emitted bytes for BINARY. This lost
precision for Iceberg-derived inputs, where MergeCoralSchemaWithAvro already emits the
precise types for base-table columns, so a passed-through column and a computed column
of the same logical type could disagree within a single view.

Derive the Avro logical type from the Calcite precision instead:

- TIMESTAMP precision > 3 (e.g. Iceberg TIMESTAMP(6)) -> timestamp-micros; unspecified
  precision and precision <= 3 -> timestamp-millis
- TIME -> time-micros (previously unsupported)
- fixed-length BINARY(n) -> Avro fixed; unbounded BINARY -> bytes

Hive TIMESTAMP, TIME, and BINARY resolve to PRECISION_NOT_SPECIFIED (HiveTypeSystem), so
Hive output is unchanged; only precision-specified types, which originate from the
Iceberg/CoralCatalog path, are upgraded.

Adds RelDataTypeToAvroTypeTests.testPrecisionSensitiveTypes with a golden fixture; the
full coral-schema suite passes.
? "timestamp-micros" : "timestamp-millis");
return schema;
case TIME:
// Iceberg TIME is microsecond-resolution; Avro represents it as a long with time-micros.
Copy link
Copy Markdown
Collaborator

@simbadzina simbadzina May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider citing the Iceberg spec for the microsecond-precision claim so the invariant is traceable:

time — Time of day, microsecond precision, without date, timezone
https://github.com/apache/iceberg/blob/8f28a86914d6beaf2615cdf5797da0acb72ea4d9/format/spec.md?plain=1#L273

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Coral Schema’s RelDataTypeToAvroType conversion so computed/aggregate Avro schema generation preserves precision-sensitive Iceberg-derived types more accurately.

Changes:

  • Maps TIMESTAMP precision > 3 to timestamp-micros while preserving existing unspecified/≤3 behavior as timestamp-millis.
  • Adds TIME conversion to Avro time-micros.
  • Emits Avro fixed(n) for precision-specified BINARY(n) and adds coverage with a new expected schema fixture.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
coral-schema/src/main/java/com/linkedin/coral/schema/avro/RelDataTypeToAvroType.java Adds precision-aware Avro mappings for timestamp, time, and fixed binary types.
coral-schema/src/test/java/com/linkedin/coral/schema/avro/RelDataTypeToAvroTypeTests.java Adds a direct RelDataType test covering precision-sensitive conversions.
coral-schema/src/test/resources/rel2avro-testPrecisionTypes-expected.avsc Provides the golden Avro schema for the new precision-sensitive type test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

schema.addProp("logicalType", "timestamp-millis");
schema.addProp("logicalType",
relDataType.getPrecision() != RelDataType.PRECISION_NOT_SPECIFIED && relDataType.getPrecision() > 3
? "timestamp-micros" : "timestamp-millis");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking, but worth noting: this defaults unspecified-precision TIMESTAMP to timestamp-millis, while MergeCoralSchemaWithAvro.timestampToAvro defaults it to timestamp-micros. They can't collide today (Iceberg is always TIMESTAMP(6), and Hive tables route to MergeHiveSchemaWithAvro), but if we ever move Hive onto MergeCoralSchemaWithAvro the defaults would disagree. Maybe a short comment on each so a future consolidation reconciles them on purpose.

schema.addProp("logicalType", "timestamp-millis");
schema.addProp("logicalType",
relDataType.getPrecision() != RelDataType.PRECISION_NOT_SPECIFIED && relDataType.getPrecision() > 3
? "timestamp-micros" : "timestamp-millis");
Copy link
Copy Markdown
Collaborator

@simbadzina simbadzina May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we wrap the two comparisons in parens — ((getPrecision() != PRECISION_NOT_SPECIFIED) && (getPrecision() > 3))? Reads clearer than leaning on != / && precedence. Same for the BINARY case above.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also just have getPrecision() > 3 since PRECISION_NOT_SPECIFIED is -1.

simbadzina
simbadzina previously approved these changes May 29, 2026
Copy link
Copy Markdown
Collaborator

@simbadzina simbadzina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. Left a few comments for code cleanup.

@simbadzina simbadzina dismissed their stale review May 29, 2026 22:06

Waiting for author to respond to review comments.

… unspecified-precision default

- Simplify the TIMESTAMP/BINARY guards to getPrecision() > 3 and > 0. Since
  PRECISION_NOT_SPECIFIED is -1, the explicit != PRECISION_NOT_SPECIFIED check was redundant.
- Cite the Iceberg spec for the TIME microsecond-precision claim.
- Note the intentional unspecified-precision default divergence between RelDataTypeToAvroType
  (timestamp-millis, to preserve Hive-view output) and MergeCoralSchemaWithAvro.timestampToAvro
  (timestamp-micros) at both sites, so a future Hive-on-Coral consolidation reconciles them
  deliberately.
@yyy1000
Copy link
Copy Markdown
Contributor Author

yyy1000 commented May 30, 2026

Thanks @simbadzina! All addressed in the latest commit:

  • Simplified both guards to getPrecision() > 3 (timestamp) and > 0 (binary) — since PRECISION_NOT_SPECIFIED == -1, the explicit != was redundant.
  • Cited the Iceberg spec inline on the TIME case (microsecond precision).
  • Added a note on the unspecified-precision default divergence on both RelDataTypeToAvroType and MergeCoralSchemaWithAvro.timestampToAvro, so a future Hive-on-Coral consolidation reconciles them deliberately.

Copy link
Copy Markdown
Collaborator

@simbadzina simbadzina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants