Is your feature request related to a problem or challenge?
When a GROUP BY includes any column with a nested type (List, LargeList, FixedSizeList, Struct, Map), multi_group_by::supported_type returns false for that column and the whole grouping falls back to the byte-encoded GroupValuesRows path via arrow_row::Rows, even if every other column would be supported by GroupValuesColumn. Today's supported_type (source):
fn supported_type(data_type: &DataType) -> bool {
matches!(*data_type,
DataType::Int8 | Int16 | ... | Float64
| Decimal128(_, _)
| Utf8 | LargeUtf8 | Utf8View
| Binary | LargeBinary | BinaryView
| Date32 | Date64 | Time32(_) | Timestamp(_, _)
| Boolean
)
}
Impact
A wide multi-column GROUP BY where most columns are individually supported by GroupValuesColumn (Utf8 / Date / Boolean / Float) but one column is a nested type (e.g. LargeList<Struct<...>>) is forced onto the slow arrow_row::Rows path. Group key memory grows materially compared to the equivalent column-format storage, and per-group comparisons become byte-by-byte rather than vectorized.
Describe the solution you'd like
Add GroupColumn implementations for the missing nested types and extend supported_type accordingly. Equality is well-defined by Arrow's make_equal-style comparators; hashing of nested values has existing arrow utilities.
Suggested incremental order, easiest first:
FixedSizeList<T> where T is a GroupColumn-supported primitive (fixed-width per row, simplest)
List<T> and LargeList<T> with primitive child (variable-length, single offset buffer)
Struct<...> with all fields being GroupColumn-supported (composition of existing columns under a single null mask)
Map<K, V> (most complex due to entry ordering semantics)
Each can be its own PR. The existing arrow_row::Rows path remains the fallback for any nested type not yet supported, so this is purely an optimization.
Describe alternatives you've considered
- arrow_row fallback (current state): correct, but significantly more memory and CPU than
GroupValuesColumn when the grouping mostly consists of cheap columns and the nested column is one of many.
- Avoid grouping by the nested column at the SQL level: works case-by-case (e.g. wrap in
ANY_VALUE) but does not help when the nested column is the legitimate group key.
Additional context
Issue #22526 is a separate, related pathology on the emit side. This issue is about the input/intern side (which GroupValues impl is selected). The two are independent and can land in either order.
Is your feature request related to a problem or challenge?
When a
GROUP BYincludes any column with a nested type (List,LargeList,FixedSizeList,Struct,Map),multi_group_by::supported_typereturns false for that column and the whole grouping falls back to the byte-encodedGroupValuesRowspath viaarrow_row::Rows, even if every other column would be supported byGroupValuesColumn. Today'ssupported_type(source):Impact
A wide multi-column
GROUP BYwhere most columns are individually supported byGroupValuesColumn(Utf8 / Date / Boolean / Float) but one column is a nested type (e.g.LargeList<Struct<...>>) is forced onto the slowarrow_row::Rowspath. Group key memory grows materially compared to the equivalent column-format storage, and per-group comparisons become byte-by-byte rather than vectorized.Describe the solution you'd like
Add
GroupColumnimplementations for the missing nested types and extendsupported_typeaccordingly. Equality is well-defined by Arrow'smake_equal-style comparators; hashing of nested values has existing arrow utilities.Suggested incremental order, easiest first:
FixedSizeList<T>whereTis aGroupColumn-supported primitive (fixed-width per row, simplest)List<T>andLargeList<T>with primitive child (variable-length, single offset buffer)Struct<...>with all fields beingGroupColumn-supported (composition of existing columns under a single null mask)Map<K, V>(most complex due to entry ordering semantics)Each can be its own PR. The existing
arrow_row::Rowspath remains the fallback for any nested type not yet supported, so this is purely an optimization.Describe alternatives you've considered
GroupValuesColumnwhen the grouping mostly consists of cheap columns and the nested column is one of many.ANY_VALUE) but does not help when the nested column is the legitimate group key.Additional context
Issue #22526 is a separate, related pathology on the emit side. This issue is about the input/intern side (which
GroupValuesimpl is selected). The two are independent and can land in either order.