Problem
More and more users reach for LLMs to generate DataFusion Python code.
Today, agents are excellent at writing SQL but struggle to produce
idiomatic DataFrame API code — they either transliterate SQL literally
or invent patterns that don't match the library's grain. Nothing the
project currently ships reliably surfaces to the agent at the moment
it's writing code.
Goals
- Establish a single, authoritative guide for writing idiomatic
DataFusion Python code, usable by both humans and agents.
- Make that guide discoverable through every channel agents actually
use — not just the channels we wish they used.
- Validate the guide against a reference corpus (TPC-H) so it stays
honest as the API evolves.
- Extend the same pattern across the wider DataFusion family
(Ballista, Comet, Ray, etc.) via an upstream llms.txt hub.
Where idiomatic code is defined
Single source of truth: python/datafusion/AGENTS.md.
This one file — kept inside the repo, shipped inside the wheel, and
included verbatim on the docs site — is the canonical guide. It
contains:
- Core abstractions (
SessionContext / DataFrame / Expr /
functions) and import conventions.
- A quick-start example that works end-to-end.
- SQL-to-DataFrame reference table (for users who think in SQL first).
- Migration sections for users coming from Spark, Pandas, and
Polars — same shape as the SQL table, column-mapping each API's
idioms to DataFusion's.
- Common pitfalls caught in real agent sessions:
&/|/~ vs
Python and/or/not, lit() wrapping, decimal/float literal
interactions, F.substring vs F.substr arity, join-key
disambiguation, absence of how="cross", etc.
- Idiomatic patterns: fluent chaining, window functions in place of
correlated subqueries, semi/anti joins in place of EXISTS/NOT EXISTS, aggregate().filter() for HAVING, variable assignment
for CTEs.
The TPC-H example suite (examples/tpch/) is the reference
corpus: every query is written as idiomatic DataFrame code,
validated by answer-file comparison, and where the optimized logical
plan differs from the SQL version, the difference is documented in a
comment. This gives the AGENTS.md guidance a continuously verified
ground truth.
How agents discover it
Discovery is layered. Each layer catches agents the prior ones
missed, so no single channel is load-bearing.
| Layer |
Mechanism |
Target audience |
| 1 |
datafusion-init writes a short pointer block into the user's project-root AGENTS.md / CLAUDE.md / .cursorrules |
Any agent working in the user's repo — project-root files are loaded into context automatically |
| 2 |
https://datafusion.apache.org/python/llms.txt published on the docs site (llmstxt.org convention) |
Agents that auto-fetch /llms.txt from documentation sites |
| 3 |
AGENTS.md inside the installed wheel + pointer in datafusion.__doc__ |
Agents that introspect the installed package |
| 4 |
Docs site page that {include}s AGENTS.md |
Humans and WebSearch-capable agents browsing the docs |
| 5 |
https://datafusion.apache.org/llms.txt upstream hub (separate PR to apache/datafusion) pointing at each subproject's llms.txt |
Agents that land anywhere in the DataFusion ecosystem |
Layer 1 is the highest-leverage item — in an empirical test, an agent
with AGENTS.md present in five different in-package locations still
missed all of them because nothing pointed the agent at them from the
project root. The ~200-byte pointer solves that without embedding the
full guide in user repos.
Task list
Detailed plan to follow as a comment.
Problem
More and more users reach for LLMs to generate DataFusion Python code.
Today, agents are excellent at writing SQL but struggle to produce
idiomatic DataFrame API code — they either transliterate SQL literally
or invent patterns that don't match the library's grain. Nothing the
project currently ships reliably surfaces to the agent at the moment
it's writing code.
Goals
DataFusion Python code, usable by both humans and agents.
use — not just the channels we wish they used.
honest as the API evolves.
(Ballista, Comet, Ray, etc.) via an upstream
llms.txthub.Where idiomatic code is defined
Single source of truth:
python/datafusion/AGENTS.md.This one file — kept inside the repo, shipped inside the wheel, and
included verbatim on the docs site — is the canonical guide. It
contains:
SessionContext/DataFrame/Expr/functions) and import conventions.Polars — same shape as the SQL table, column-mapping each API's
idioms to DataFusion's.
&/|/~vsPython
and/or/not,lit()wrapping, decimal/float literalinteractions,
F.substringvsF.substrarity, join-keydisambiguation, absence of
how="cross", etc.correlated subqueries, semi/anti joins in place of
EXISTS/NOT EXISTS,aggregate().filter()forHAVING, variable assignmentfor CTEs.
The TPC-H example suite (
examples/tpch/) is the referencecorpus: every query is written as idiomatic DataFrame code,
validated by answer-file comparison, and where the optimized logical
plan differs from the SQL version, the difference is documented in a
comment. This gives the AGENTS.md guidance a continuously verified
ground truth.
How agents discover it
Discovery is layered. Each layer catches agents the prior ones
missed, so no single channel is load-bearing.
datafusion-initwrites a short pointer block into the user's project-rootAGENTS.md/CLAUDE.md/.cursorruleshttps://datafusion.apache.org/python/llms.txtpublished on the docs site (llmstxt.org convention)/llms.txtfrom documentation sitesAGENTS.mdinside the installed wheel + pointer indatafusion.__doc__{include}sAGENTS.mdhttps://datafusion.apache.org/llms.txtupstream hub (separate PR toapache/datafusion) pointing at each subproject'sllms.txtLayer 1 is the highest-leverage item — in an empirical test, an agent
with AGENTS.md present in five different in-package locations still
missed all of them because nothing pointed the agent at them from the
project root. The ~200-byte pointer solves that without embedding the
full guide in user repos.
Task list
AGENTS.md+ package entry pointdatafusion-initproject-root pointer{include}+llms.txt) + AI skills + CLAUDE.mdapache/datafusionllms.txthub (separate repo)Detailed plan to follow as a comment.