fix: array expression audit follow-ups (#4503) by andygrove · Pull Request #4713 · apache/datafusion-comet

andygrove · 2026-06-23T18:45:30Z

Which issue does this PR close?

Closes #4503.

This PR addresses the actionable follow-up items from the array expression audit (#4503). Item 5 (array_union ordering) is tracked separately in #4681 and is not included here.

Rationale for this change

The audit recorded several support-level / serde-consistency gaps where a restriction was enforced in convert (so it never surfaced in EXPLAIN or the auto-generated compatibility doc), a support level disagreed with runtime behaviour, or a serde registration was dead after Spark 4.0 rewrites. Lifting these decisions into getSupportLevel lets EXPLAIN, the compatibility guide, and the codegen dispatcher all see the same truth.

While working through the high-priority items, one turned out to be obsolete:

Item 2 asked to flip array_contains / array_distinct / array_union / array_max / array_min to Incompatible for float/double elements because of NaN / signed-zero canonicalization ([Bug] array_distinct / array_union / array_except do not canonicalize NaN like Spark #4481, [Bug] array_max and array_min disagree with Spark on NaN ordering #4482). The existing SQL test files (array_distinct.sql, array_max.sql, array_min.sql, array_union.sql, array_contains.sql) already exercise array(NaN, NaN) de-duplication, array(0.0, -0.0) signed-zero, and Infinity cases in default query mode (native execution + exact Spark match), and they pass on the current serdes. Native DataFusion already canonicalizes NaN and signed zero to match Spark, so reporting Incompatible would force a needless fallback to the JVM codegen dispatcher with no correctness benefit and a misleading EXPLAIN label. Item 2 is therefore omitted as obsolete.

What changes are included in this PR?

ArrayReverse (item 1): getSupportLevel now reports Incompatible for element types the native array_reverse cannot handle (binary, struct, map) so they route through the JVM codegen dispatcher (via CometReverse, which mixes in CodegenDispatchFallback) and stay native. Previously StructType reported Compatible while convert declined it, so those arrays silently fell back to Spark.
SortArray (item 4): Accept any foldable boolean ascendingOrder, not just a boolean Literal. Spark 4.0+ widens ascendingOrder to any foldable boolean; convert now evaluates the foldable expression (a null result unboxes to false, matching Spark's right.eval().asInstanceOf[Boolean]). Spark 3.x still only ever passes a Literal, so its behaviour is unchanged.
ArrayJoin (item 7): Fall back for non-default string collations, mirroring CometArrayIntersect, so the limitation is visible in EXPLAIN. The native array_to_string produces UTF8_BINARY semantics and does not propagate collations ([Spark 4.0] Add string collation support #2190).
ArrayExcept (items 1 & 6): Surface the native element-type restriction (binary / struct) in getSupportLevel. It stays Incompatible (not Unsupported) so the codegen dispatcher still evaluates those types natively under the default config; the convert-time guard remains as a defensive net for the allowIncompatible=true path.
Dead registrations (item 3): Remove the ArrayCompact serde registration. ArrayCompact is RuntimeReplaceable in all supported Spark versions (rewritten to ArrayFilter(arr, IsNotNull(...))), so it never reaches serde directly; dispatch flows through CometArrayFilter -> CometArrayCompact. ArrayAppend is documented as reachable only on Spark 3.x (Spark 4.0+ rewrites it to ArrayInsert).

How are these changes tested?

New SQL file tests under spark/src/test/resources/sql-tests/expressions/array/:

array_reverse.sql covers int / string / struct / nested-struct arrays. The struct cases assert the dispatcher keeps execution native (runs on all supported Spark versions).
sort_array_foldable.sql (MinSparkVersion: 4.0) exercises foldable non-literal ascendingOrder (cast(1 as boolean) / cast(0 as boolean), which survive the suite's disabled ConstantFolding as Cast nodes).
array_join_collation.sql (MinSparkVersion: 4.0) asserts collated inputs fall back with the expected reason.

Verified on Spark 3.5 and 4.0:

New SQL files pass on both profiles.
Existing CometArrayExpressionSuite (43 tests) and the existing array_join, sort_array, array_except, array_except_dispatch, array_compact SQL tests pass with no regressions.
scalafix + spotless clean on both profiles.

Address the actionable array expression audit follow-ups from apache#4503: - ArrayReverse: report Incompatible for struct/map element types so they route through the JVM codegen dispatcher and stay native, instead of reporting Compatible while convert silently declined and fell back. - SortArray: accept any foldable boolean ascendingOrder (Spark 4.0+ widens it beyond a boolean Literal) and evaluate it at convert time. - ArrayJoin: fall back for non-default string collations, mirroring ArrayIntersect, so the limitation surfaces in EXPLAIN. - ArrayExcept: surface the native element-type restriction in getSupportLevel rather than only in convert. - Remove the dead ArrayCompact serde registration (RuntimeReplaceable in all supported versions, dispatched via ArrayFilter) and document ArrayAppend as reachable only on Spark 3.x. Item 2 (NaN/signed-zero gating for array_contains/distinct/union/max/min) is intentionally omitted: existing SQL tests show native DataFusion already canonicalizes NaN and signed zero to match Spark, so reporting Incompatible would force a needless fallback to the codegen dispatcher with no correctness benefit. Item 5 (array_union ordering) is tracked in apache#4681.

Collapse the dual-Incompatible match in getSupportLevel into a single Incompatible construction, and tighten the convert-time type guard to the same find idiom (dropping a needless toSet). Behavior preserving.

andygrove added 2 commits June 23, 2026 12:44

refactor: simplify CometArrayExcept support-level and convert guards

0571632

Collapse the dual-Incompatible match in getSupportLevel into a single Incompatible construction, and tighten the convert-time type guard to the same find idiom (dropping a needless toSet). Behavior preserving.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: array expression audit follow-ups (#4503)#4713

fix: array expression audit follow-ups (#4503)#4713
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix/array-audit-followups-4503

andygrove commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Jun 23, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant