feat: generate expression reference doc from code [WIP] by andygrove · Pull Request #4585 · apache/datafusion-comet

andygrove · 2026-06-03T19:20:48Z

Which issue does this PR close?

N/A. Follow-on to #4583. This reduces drift and maintenance friction in the expression reference doc by generating it from code.

Note: this PR is stacked on #4583. Until that merges, this diff also contains its 2 prettier-formatting commits; they will drop out once #4583 lands.

Rationale for this change

docs/source/user-guide/latest/expressions.md was hand-maintained: every PR that added or changed an expression edited the tables by hand. That let the doc drift from reality (a function supported in code but still listed as planned, or a new Spark built-in never added) and made large aligned tables conflict-prone.

The Compatibility Guide is already generated by GenerateDocs from each serde's getCompatibleNotes / getIncompatibleReasons / getUnsupportedReasons. This PR extends the same generator to also produce the expression reference, so the overview is derived from the code that actually decides support, and stays complete and current.

What changes are included in this PR?

New pure helper org.apache.comet.ExpressionReference: status model, row resolution, table rendering, and Spark FunctionRegistry enumeration (unit-tested in isolation).
GenerateDocs extended to: enumerate every Spark built-in (with its group), derive Supported status and a Compatibility Guide link from the serde maps, and fall back to a curated status list for planned / not-planned functions. The curated list lives in GenerateDocs.scala on purpose: that file is excluded from the heavy CI path filters in dev/ci/compute-changes.py, so editing the list (for example when an issue is filed) does not trigger the Spark SQL and Iceberg jobs.
expressions.md per-group tables are now generated between  markers; the prose was updated to drop the "Incorrect by default" status.
Doc generation pinned to the Spark 4.1 profile (newest FunctionRegistry) in dev/generate-release-docs.sh and docs/build.sh.
The reference is a concise overview: it carries a short summary plus a link into the Compatibility Guide for detail, with no duplicated note text.

Known follow-ups (not in this PR): populate per-expression summary notes via a new getExpressionSummary (currently None, so serde-backed rows have sparse notes); add a CI check that fails when the generated doc is stale; rename the curated PlannedExpr type now that it also holds Supported entries.

How are these changes tested?

ExpressionReferenceSuite covers the status model, every branch of row resolution (serde + link, serde without page, planned + issue, not-planned, unclassified), and rendering.
FunctionRegistryEnumerationSuite verifies enumeration against real Spark built-ins.
Regeneration is idempotent (re-running the generator produces no diff), the generated doc has zero unclassified rows, and all tracking-issue links were verified to match the prior hand-written doc exactly.

prettier re-aligns markdown table columns to the widest cell, so adding a single expression row rewrites every row in the table. That produces noisy diffs and frequent merge conflicts between PRs that each add new expressions. Exempt the file from prettier so future additions stay as one-line diffs.

With prettier no longer aligning the tables, collapse the existing column padding so that adding an expression row never shifts the other rows. Combined with the prettier exemption, every future addition is a true one-line diff that cannot collide on re-alignment.

[skip ci]

…mbols [skip ci]

[skip ci]

…eDocs [skip ci]

[skip ci]

The per-group tables are generated by GenerateDocs at site-publish time and frozen into release branches, matching how configs.md and the compatibility guide are handled. The main branch keeps only the markers and prose so the generated content never goes stale in source. [skip ci]

andygrove added 17 commits June 3, 2026 10:58

feat: add getExpressionSummary to serde traits for expressions.md

e898406

[skip ci]

feat: add expression reference status model

a571338

[skip ci]

refactor: enforce PlannedExpr status invariant and document status sy…

9738817

…mbols [skip ci]

feat: resolve expression reference rows from serde/planned data

2afc377

[skip ci]

fix: keep descriptive unclassified warning and assert via substring

e584cd4

[skip ci]

test: cover bare compat link and group in unclassified warning

85566de

[skip ci]

feat: render expression reference rows and tables

942796f

[skip ci]

feat: enumerate spark builtin functions for expression reference

43d7967

[skip ci]

test: make builtin enumeration test resilient across spark versions

848a264

[skip ci]

feat: wire serde lookup and planned list into GenerateDocs

09be5fc

[skip ci]

docs: clarify cast erasure, scaffolding, and category sync in Generat…

b572af0

…eDocs [skip ci]

feat: generate expressions.md per-group tables from code

113efbb

[skip ci]

feat: bootstrap curated expression list from prior reference doc

32655c6

[skip ci]

feat: classify remaining not-planned families and restore variant funcs

c90cb4a

[skip ci]

docs: drop incorrect-by-default status and pin doc-gen to spark 4.1

22a5159

[skip ci]

andygrove marked this pull request as draft June 3, 2026 19:22

andygrove changed the title ~~feat: generate expression reference doc from code~~ feat: generate expression reference doc from code [WIP] Jun 3, 2026

andygrove mentioned this pull request Jun 3, 2026

docs: stop prettier table re-alignment churn in expressions.md #4583

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: generate expression reference doc from code [WIP]#4585

feat: generate expression reference doc from code [WIP]#4585
andygrove wants to merge 18 commits into
apache:mainfrom
andygrove:generate-expressions-doc

andygrove commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Jun 3, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant