fix: array expression audit follow-ups (#4503)#4713
Draft
andygrove wants to merge 2 commits into
Draft
Conversation
Address the actionable array expression audit follow-ups from apache#4503: - ArrayReverse: report Incompatible for struct/map element types so they route through the JVM codegen dispatcher and stay native, instead of reporting Compatible while convert silently declined and fell back. - SortArray: accept any foldable boolean ascendingOrder (Spark 4.0+ widens it beyond a boolean Literal) and evaluate it at convert time. - ArrayJoin: fall back for non-default string collations, mirroring ArrayIntersect, so the limitation surfaces in EXPLAIN. - ArrayExcept: surface the native element-type restriction in getSupportLevel rather than only in convert. - Remove the dead ArrayCompact serde registration (RuntimeReplaceable in all supported versions, dispatched via ArrayFilter) and document ArrayAppend as reachable only on Spark 3.x. Item 2 (NaN/signed-zero gating for array_contains/distinct/union/max/min) is intentionally omitted: existing SQL tests show native DataFusion already canonicalizes NaN and signed zero to match Spark, so reporting Incompatible would force a needless fallback to the codegen dispatcher with no correctness benefit. Item 5 (array_union ordering) is tracked in apache#4681.
Collapse the dual-Incompatible match in getSupportLevel into a single Incompatible construction, and tighten the convert-time type guard to the same find idiom (dropping a needless toSet). Behavior preserving.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #4503.
This PR addresses the actionable follow-up items from the array expression audit (#4503). Item 5 (
array_unionordering) is tracked separately in #4681 and is not included here.Rationale for this change
The audit recorded several support-level / serde-consistency gaps where a restriction was enforced in
convert(so it never surfaced in EXPLAIN or the auto-generated compatibility doc), a support level disagreed with runtime behaviour, or a serde registration was dead after Spark 4.0 rewrites. Lifting these decisions intogetSupportLevellets EXPLAIN, the compatibility guide, and the codegen dispatcher all see the same truth.While working through the high-priority items, one turned out to be obsolete:
array_contains/array_distinct/array_union/array_max/array_mintoIncompatiblefor float/double elements because of NaN / signed-zero canonicalization ([Bug] array_distinct / array_union / array_except do not canonicalize NaN like Spark #4481, [Bug] array_max and array_min disagree with Spark on NaN ordering #4482). The existing SQL test files (array_distinct.sql,array_max.sql,array_min.sql,array_union.sql,array_contains.sql) already exercisearray(NaN, NaN)de-duplication,array(0.0, -0.0)signed-zero, andInfinitycases in defaultquerymode (native execution + exact Spark match), and they pass on the current serdes. Native DataFusion already canonicalizes NaN and signed zero to match Spark, so reportingIncompatiblewould force a needless fallback to the JVM codegen dispatcher with no correctness benefit and a misleading EXPLAIN label. Item 2 is therefore omitted as obsolete.What changes are included in this PR?
getSupportLevelnow reportsIncompatiblefor element types the nativearray_reversecannot handle (binary, struct, map) so they route through the JVM codegen dispatcher (viaCometReverse, which mixes inCodegenDispatchFallback) and stay native. PreviouslyStructTypereportedCompatiblewhileconvertdeclined it, so those arrays silently fell back to Spark.ascendingOrder, not just a booleanLiteral. Spark 4.0+ widensascendingOrderto any foldable boolean;convertnow evaluates the foldable expression (a null result unboxes tofalse, matching Spark'sright.eval().asInstanceOf[Boolean]). Spark 3.x still only ever passes aLiteral, so its behaviour is unchanged.CometArrayIntersect, so the limitation is visible in EXPLAIN. The nativearray_to_stringproduces UTF8_BINARY semantics and does not propagate collations ([Spark 4.0] Add string collation support #2190).getSupportLevel. It staysIncompatible(notUnsupported) so the codegen dispatcher still evaluates those types natively under the default config; theconvert-time guard remains as a defensive net for theallowIncompatible=truepath.ArrayCompactserde registration.ArrayCompactisRuntimeReplaceablein all supported Spark versions (rewritten toArrayFilter(arr, IsNotNull(...))), so it never reaches serde directly; dispatch flows throughCometArrayFilter -> CometArrayCompact.ArrayAppendis documented as reachable only on Spark 3.x (Spark 4.0+ rewrites it toArrayInsert).How are these changes tested?
New SQL file tests under
spark/src/test/resources/sql-tests/expressions/array/:array_reverse.sqlcovers int / string / struct / nested-struct arrays. The struct cases assert the dispatcher keeps execution native (runs on all supported Spark versions).sort_array_foldable.sql(MinSparkVersion: 4.0) exercises foldable non-literalascendingOrder(cast(1 as boolean)/cast(0 as boolean), which survive the suite's disabledConstantFoldingasCastnodes).array_join_collation.sql(MinSparkVersion: 4.0) asserts collated inputs fall back with the expected reason.Verified on Spark 3.5 and 4.0:
CometArrayExpressionSuite(43 tests) and the existingarray_join,sort_array,array_except,array_except_dispatch,array_compactSQL tests pass with no regressions.scalafix+spotlessclean on both profiles.