Skip to content

fix: array expression audit follow-ups (#4503)#4713

Draft
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix/array-audit-followups-4503
Draft

fix: array expression audit follow-ups (#4503)#4713
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix/array-audit-followups-4503

Conversation

@andygrove

Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #4503.

This PR addresses the actionable follow-up items from the array expression audit (#4503). Item 5 (array_union ordering) is tracked separately in #4681 and is not included here.

Rationale for this change

The audit recorded several support-level / serde-consistency gaps where a restriction was enforced in convert (so it never surfaced in EXPLAIN or the auto-generated compatibility doc), a support level disagreed with runtime behaviour, or a serde registration was dead after Spark 4.0 rewrites. Lifting these decisions into getSupportLevel lets EXPLAIN, the compatibility guide, and the codegen dispatcher all see the same truth.

While working through the high-priority items, one turned out to be obsolete:

  • Item 2 asked to flip array_contains / array_distinct / array_union / array_max / array_min to Incompatible for float/double elements because of NaN / signed-zero canonicalization ([Bug] array_distinct / array_union / array_except do not canonicalize NaN like Spark #4481, [Bug] array_max and array_min disagree with Spark on NaN ordering #4482). The existing SQL test files (array_distinct.sql, array_max.sql, array_min.sql, array_union.sql, array_contains.sql) already exercise array(NaN, NaN) de-duplication, array(0.0, -0.0) signed-zero, and Infinity cases in default query mode (native execution + exact Spark match), and they pass on the current serdes. Native DataFusion already canonicalizes NaN and signed zero to match Spark, so reporting Incompatible would force a needless fallback to the JVM codegen dispatcher with no correctness benefit and a misleading EXPLAIN label. Item 2 is therefore omitted as obsolete.

What changes are included in this PR?

  • ArrayReverse (item 1): getSupportLevel now reports Incompatible for element types the native array_reverse cannot handle (binary, struct, map) so they route through the JVM codegen dispatcher (via CometReverse, which mixes in CodegenDispatchFallback) and stay native. Previously StructType reported Compatible while convert declined it, so those arrays silently fell back to Spark.
  • SortArray (item 4): Accept any foldable boolean ascendingOrder, not just a boolean Literal. Spark 4.0+ widens ascendingOrder to any foldable boolean; convert now evaluates the foldable expression (a null result unboxes to false, matching Spark's right.eval().asInstanceOf[Boolean]). Spark 3.x still only ever passes a Literal, so its behaviour is unchanged.
  • ArrayJoin (item 7): Fall back for non-default string collations, mirroring CometArrayIntersect, so the limitation is visible in EXPLAIN. The native array_to_string produces UTF8_BINARY semantics and does not propagate collations ([Spark 4.0] Add string collation support #2190).
  • ArrayExcept (items 1 & 6): Surface the native element-type restriction (binary / struct) in getSupportLevel. It stays Incompatible (not Unsupported) so the codegen dispatcher still evaluates those types natively under the default config; the convert-time guard remains as a defensive net for the allowIncompatible=true path.
  • Dead registrations (item 3): Remove the ArrayCompact serde registration. ArrayCompact is RuntimeReplaceable in all supported Spark versions (rewritten to ArrayFilter(arr, IsNotNull(...))), so it never reaches serde directly; dispatch flows through CometArrayFilter -> CometArrayCompact. ArrayAppend is documented as reachable only on Spark 3.x (Spark 4.0+ rewrites it to ArrayInsert).

How are these changes tested?

New SQL file tests under spark/src/test/resources/sql-tests/expressions/array/:

  • array_reverse.sql covers int / string / struct / nested-struct arrays. The struct cases assert the dispatcher keeps execution native (runs on all supported Spark versions).
  • sort_array_foldable.sql (MinSparkVersion: 4.0) exercises foldable non-literal ascendingOrder (cast(1 as boolean) / cast(0 as boolean), which survive the suite's disabled ConstantFolding as Cast nodes).
  • array_join_collation.sql (MinSparkVersion: 4.0) asserts collated inputs fall back with the expected reason.

Verified on Spark 3.5 and 4.0:

  • New SQL files pass on both profiles.
  • Existing CometArrayExpressionSuite (43 tests) and the existing array_join, sort_array, array_except, array_except_dispatch, array_compact SQL tests pass with no regressions.
  • scalafix + spotless clean on both profiles.

Address the actionable array expression audit follow-ups from apache#4503:

- ArrayReverse: report Incompatible for struct/map element types so they
  route through the JVM codegen dispatcher and stay native, instead of
  reporting Compatible while convert silently declined and fell back.
- SortArray: accept any foldable boolean ascendingOrder (Spark 4.0+ widens
  it beyond a boolean Literal) and evaluate it at convert time.
- ArrayJoin: fall back for non-default string collations, mirroring
  ArrayIntersect, so the limitation surfaces in EXPLAIN.
- ArrayExcept: surface the native element-type restriction in
  getSupportLevel rather than only in convert.
- Remove the dead ArrayCompact serde registration (RuntimeReplaceable in
  all supported versions, dispatched via ArrayFilter) and document
  ArrayAppend as reachable only on Spark 3.x.

Item 2 (NaN/signed-zero gating for array_contains/distinct/union/max/min)
is intentionally omitted: existing SQL tests show native DataFusion already
canonicalizes NaN and signed zero to match Spark, so reporting Incompatible
would force a needless fallback to the codegen dispatcher with no
correctness benefit. Item 5 (array_union ordering) is tracked in apache#4681.
Collapse the dual-Incompatible match in getSupportLevel into a single
Incompatible construction, and tighten the convert-time type guard to the
same find idiom (dropping a needless toSet). Behavior preserving.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

array expression audit follow-ups (from #4483)

1 participant