Skip to content

Add missing scalar functions #1470

Open
timsaucer wants to merge 4 commits intoapache:mainfrom
timsaucer:feat/add-missing-scalar-fns
Open

Add missing scalar functions #1470
timsaucer wants to merge 4 commits intoapache:mainfrom
timsaucer:feat/add-missing-scalar-fns

Conversation

@timsaucer
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #1453

Rationale for this change

These functions exist upstream but were not exposed to Python.

What changes are included in this PR?

Expose functions to Python
Add unit testss

Are there any user-facing changes?

New addition only.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to close #1453 by exposing several DataFusion scalar functions that exist upstream but were not previously available in the Python API, along with adding Python unit tests for the new bindings.

Changes:

  • Added Python wrappers and exports for arrow_metadata, get_field, union_extract, union_tag, version, plus a Python-level row alias for struct.
  • Added unit tests covering the newly exposed functions (notably union functions and version).
  • Updated codespell skip paths formatting in pyproject.toml.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
python/datafusion/functions.py Adds new Python-level function wrappers/exports (arrow_metadata, get_field, union_*, version, row).
crates/core/src/functions.rs Exposes new functions from the Rust extension module to Python via pyo3 (arrow_metadata, get_field, union_extract, union_tag, version).
python/tests/test_functions.py Adds tests for the newly exposed functions.
pyproject.toml Normalizes codespell skip path entries (removes ./ prefixes).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@timsaucer timsaucer force-pushed the feat/add-missing-scalar-fns branch from 192593f to 2771621 Compare April 3, 2026 19:38
timsaucer and others added 4 commits April 3, 2026 15:51
…row_metadata, version, row

Expose upstream DataFusion scalar functions that were not yet available
in the Python API. Closes apache#1453.

- get_field: extracts a field from a struct or map by name
- union_extract: extracts a value from a union type by field name
- union_tag: returns the active field name of a union type
- arrow_metadata: returns Arrow field metadata (all or by key)
- version: returns the DataFusion version string
- row: alias for the struct constructor

Note: arrow_try_cast was listed in the issue but does not exist in
DataFusion 53, so it is not included.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests for get_field, arrow_metadata, version, row, union_tag, and
union_extract.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allow arrow_cast, get_field, and union_extract to accept plain str
arguments instead of requiring Expr wrappers. Also improve
arrow_metadata test coverage and fix parameter shadowing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer force-pushed the feat/add-missing-scalar-fns branch from 4384c1f to df1ead1 Compare April 3, 2026 19:52
@timsaucer timsaucer requested a review from Copilot April 3, 2026 20:01
@timsaucer timsaucer marked this pull request as ready for review April 3, 2026 20:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2605 to 2609
def arrow_cast(expr: Expr, data_type: Expr | str) -> Expr:
"""Casts an expression to a specified data type.

Examples:
>>> ctx = dfn.SessionContext()
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR declares Closes #1453, but the issue also lists arrow_try_cast as a missing scalar function. I verified there is no arrow_try_cast wrapper anywhere in the repo (no Python wrapper in python/datafusion/functions.py and no Rust binding in crates/core/src/functions.rs). Either add arrow_try_cast (and a corresponding unit test) or adjust the PR description/linked issue closure so we’re not closing the issue prematurely.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@ntjohnson1 ntjohnson1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude didn't do as good a job maintaining existing structure as the last one. Not sure how pedantic we want to be about some of the formatting stuff since there isn't a ruff rule around it. A copilot setting or custom lint rule could help enforce if desired

>>> result.collect_column("c")[0].as_py()
1.0
"""
if isinstance(data_type, str):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I don't know if anyone has run into it yet but I wonder if a helper around strings might be nice. Hopefully most common is people just passing python strings, but I could image someone passing a numpy string or pyarrow string extracted from some other operation. Definitely follow on work/issue

expr: An expression whose metadata to retrieve.
key: Optional metadata key to look up. Can be a string or an Expr.

Returns:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No example and I think nothing else in the file uses the Returns category. I'm not sure how consistent that is across the code base.

expr: A struct or map expression.
name: The field name to extract.

Returns:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echo example/returns note above

union_expr: A union-typed expression.
field_name: The name of the field to extract.

Returns:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example/returns

Args:
union_expr: A union-typed expression.

Returns:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example/returns

def version() -> Expr:
"""Returns the DataFusion version string.

Returns:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example/returns

In this case the returns is definitely redundant with the definition

def row(*args: Expr) -> Expr:
"""Returns a struct with the given arguments.

This is an alias for :py:func:`struct`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't use the See Also block

Args:
args: The expressions to include in the struct.

Returns:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example/returns

import numpy as np
import pyarrow as pa
import pytest
from datafusion import SessionContext, column, literal, string_literal
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love that this is no longer needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add missing scalar functions (union, arrow metadata, get_field, version, row)

3 participants