Skip to content

feat: add GroupColumn support for List / Struct / Map / FixedSizeList in multi-column GROUP BY #22682

@zhuqi-lucas

Description

@zhuqi-lucas

Is your feature request related to a problem or challenge?

When a GROUP BY includes any column with a nested type (List, LargeList, FixedSizeList, Struct, Map), multi_group_by::supported_type returns false for that column and the whole grouping falls back to the byte-encoded GroupValuesRows path via arrow_row::Rows, even if every other column would be supported by GroupValuesColumn. Today's supported_type (source):

fn supported_type(data_type: &DataType) -> bool {
    matches!(*data_type,
        DataType::Int8 | Int16 | ... | Float64
        | Decimal128(_, _)
        | Utf8 | LargeUtf8 | Utf8View
        | Binary | LargeBinary | BinaryView
        | Date32 | Date64 | Time32(_) | Timestamp(_, _)
        | Boolean
    )
}

Impact

A wide multi-column GROUP BY where most columns are individually supported by GroupValuesColumn (Utf8 / Date / Boolean / Float) but one column is a nested type (e.g. LargeList<Struct<...>>) is forced onto the slow arrow_row::Rows path. Group key memory grows materially compared to the equivalent column-format storage, and per-group comparisons become byte-by-byte rather than vectorized.

Describe the solution you'd like

Add GroupColumn implementations for the missing nested types and extend supported_type accordingly. Equality is well-defined by Arrow's make_equal-style comparators; hashing of nested values has existing arrow utilities.

Suggested incremental order, easiest first:

  1. FixedSizeList<T> where T is a GroupColumn-supported primitive (fixed-width per row, simplest)
  2. List<T> and LargeList<T> with primitive child (variable-length, single offset buffer)
  3. Struct<...> with all fields being GroupColumn-supported (composition of existing columns under a single null mask)
  4. Map<K, V> (most complex due to entry ordering semantics)

Each can be its own PR. The existing arrow_row::Rows path remains the fallback for any nested type not yet supported, so this is purely an optimization.

Describe alternatives you've considered

  • arrow_row fallback (current state): correct, but significantly more memory and CPU than GroupValuesColumn when the grouping mostly consists of cheap columns and the nested column is one of many.
  • Avoid grouping by the nested column at the SQL level: works case-by-case (e.g. wrap in ANY_VALUE) but does not help when the nested column is the legitimate group key.

Additional context

Issue #22526 is a separate, related pathology on the emit side. This issue is about the input/intern side (which GroupValues impl is selected). The two are independent and can land in either order.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions