Skip to content

Make it easier for agents to generate datafusion-python code #1394

@timsaucer

Description

@timsaucer

Problem

More and more users reach for LLMs to generate DataFusion Python code.
Today, agents are excellent at writing SQL but struggle to produce
idiomatic DataFrame API code — they either transliterate SQL literally
or invent patterns that don't match the library's grain. Nothing the
project currently ships reliably surfaces to the agent at the moment
it's writing code.

Goals

  1. Establish a single, authoritative guide for writing idiomatic
    DataFusion Python code, usable by both humans and agents.
  2. Make that guide discoverable through every channel agents actually
    use — not just the channels we wish they used.
  3. Validate the guide against a reference corpus (TPC-H) so it stays
    honest as the API evolves.
  4. Extend the same pattern across the wider DataFusion family
    (Ballista, Comet, Ray, etc.) via an upstream llms.txt hub.

Where idiomatic code is defined

Single source of truth: python/datafusion/AGENTS.md.

This one file — kept inside the repo, shipped inside the wheel, and
included verbatim on the docs site — is the canonical guide. It
contains:

  • Core abstractions (SessionContext / DataFrame / Expr /
    functions) and import conventions.
  • A quick-start example that works end-to-end.
  • SQL-to-DataFrame reference table (for users who think in SQL first).
  • Migration sections for users coming from Spark, Pandas, and
    Polars — same shape as the SQL table, column-mapping each API's
    idioms to DataFusion's.
  • Common pitfalls caught in real agent sessions: &/|/~ vs
    Python and/or/not, lit() wrapping, decimal/float literal
    interactions, F.substring vs F.substr arity, join-key
    disambiguation, absence of how="cross", etc.
  • Idiomatic patterns: fluent chaining, window functions in place of
    correlated subqueries, semi/anti joins in place of EXISTS/NOT EXISTS, aggregate().filter() for HAVING, variable assignment
    for CTEs.

The TPC-H example suite (examples/tpch/) is the reference
corpus: every query is written as idiomatic DataFrame code,
validated by answer-file comparison, and where the optimized logical
plan differs from the SQL version, the difference is documented in a
comment. This gives the AGENTS.md guidance a continuously verified
ground truth.

How agents discover it

Discovery is layered. Each layer catches agents the prior ones
missed, so no single channel is load-bearing.

Layer Mechanism Target audience
1 datafusion-init writes a short pointer block into the user's project-root AGENTS.md / CLAUDE.md / .cursorrules Any agent working in the user's repo — project-root files are loaded into context automatically
2 https://datafusion.apache.org/python/llms.txt published on the docs site (llmstxt.org convention) Agents that auto-fetch /llms.txt from documentation sites
3 AGENTS.md inside the installed wheel + pointer in datafusion.__doc__ Agents that introspect the installed package
4 Docs site page that {include}s AGENTS.md Humans and WebSearch-capable agents browsing the docs
5 https://datafusion.apache.org/llms.txt upstream hub (separate PR to apache/datafusion) pointing at each subproject's llms.txt Agents that land anywhere in the DataFusion ecosystem

Layer 1 is the highest-leverage item — in an empirical test, an agent
with AGENTS.md present in five different in-package locations still
missed all of them because nothing pointed the agent at them from the
project root. The ~200-byte pointer solves that without embedding the
full guide in user repos.

Task list

  • PR 1a — AGENTS.md + package entry point
  • PR 1b — Module docstrings + doctest examples
  • PR 1c — datafusion-init project-root pointer
  • PR 2 — TPC-H reference SQL + plan comparison diagnostic
  • PR 3 — Rewrite TPC-H non-idiomatic queries
  • PR 4 — Docs site ({include} + llms.txt) + AI skills + CLAUDE.md
  • PR 5 — Upstream sync process documentation
  • PR 6 — apache/datafusion llms.txt hub (separate repo)

Detailed plan to follow as a comment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions