Make it easier for agents to generate datafusion-python code

## Problem

More and more users reach for LLMs to generate DataFusion Python code.
Today, agents are excellent at writing SQL but struggle to produce
idiomatic DataFrame API code — they either transliterate SQL literally
or invent patterns that don't match the library's grain. Nothing the
project currently ships reliably surfaces to the agent at the moment
it's writing code.

## Goals

1. Establish a single, authoritative guide for writing idiomatic
   DataFusion Python code, usable by both humans and agents.
2. Make that guide discoverable through every channel agents actually
   use — not just the channels we wish they used.
3. Validate the guide against a reference corpus (TPC-H) so it stays
   honest as the API evolves.
4. Extend the same pattern across the wider DataFusion family
   (Ballista, Comet, Ray, etc.) via an upstream `llms.txt` hub.

## Where idiomatic code is defined

**Single source of truth: `python/datafusion/AGENTS.md`.**

This one file — kept inside the repo, shipped inside the wheel, and
included verbatim on the docs site — is the canonical guide. It
contains:

- Core abstractions (`SessionContext` / `DataFrame` / `Expr` /
  `functions`) and import conventions.
- A quick-start example that works end-to-end.
- SQL-to-DataFrame reference table (for users who think in SQL first).
- Migration sections for users coming from **Spark**, **Pandas**, and
  **Polars** — same shape as the SQL table, column-mapping each API's
  idioms to DataFusion's.
- Common pitfalls caught in real agent sessions: `&`/`|`/`~` vs
  Python `and`/`or`/`not`, `lit()` wrapping, decimal/float literal
  interactions, `F.substring` vs `F.substr` arity, join-key
  disambiguation, absence of `how="cross"`, etc.
- Idiomatic patterns: fluent chaining, window functions in place of
  correlated subqueries, semi/anti joins in place of `EXISTS`/`NOT
  EXISTS`, `aggregate().filter()` for `HAVING`, variable assignment
  for CTEs.

The **TPC-H example suite** (`examples/tpch/`) is the reference
corpus: every query is written as idiomatic DataFrame code,
validated by answer-file comparison, and where the optimized logical
plan differs from the SQL version, the difference is documented in a
comment. This gives the AGENTS.md guidance a continuously verified
ground truth.

## How agents discover it

Discovery is layered. Each layer catches agents the prior ones
missed, so no single channel is load-bearing.

| Layer | Mechanism | Target audience |
|-------|-----------|-----------------|
| 1 | `datafusion-init` writes a short pointer block into the user's project-root `AGENTS.md` / `CLAUDE.md` / `.cursorrules` | Any agent working in the user's repo — project-root files are loaded into context automatically |
| 2 | `https://datafusion.apache.org/python/llms.txt` published on the docs site (llmstxt.org convention) | Agents that auto-fetch `/llms.txt` from documentation sites |
| 3 | `AGENTS.md` inside the installed wheel + pointer in `datafusion.__doc__` | Agents that introspect the installed package |
| 4 | Docs site page that `{include}`s `AGENTS.md` | Humans and WebSearch-capable agents browsing the docs |
| 5 | `https://datafusion.apache.org/llms.txt` upstream hub (separate PR to `apache/datafusion`) pointing at each subproject's `llms.txt` | Agents that land anywhere in the DataFusion ecosystem |

Layer 1 is the highest-leverage item — in an empirical test, an agent
with AGENTS.md present in five different in-package locations still
missed all of them because nothing pointed the agent at them from the
project root. The ~200-byte pointer solves that without embedding the
full guide in user repos.

## Task list

- [ ] PR 1a — `AGENTS.md` + package entry point
- [ ] PR 1b — Module docstrings + doctest examples
- [ ] PR 1c — `datafusion-init` project-root pointer
- [ ] PR 2  — TPC-H reference SQL + plan comparison diagnostic
- [ ] PR 3  — Rewrite TPC-H non-idiomatic queries
- [ ] PR 4  — Docs site (`{include}` + `llms.txt`) + AI skills + CLAUDE.md
- [ ] PR 5  — Upstream sync process documentation
- [ ] PR 6  — `apache/datafusion` `llms.txt` hub (separate repo)

Detailed plan to follow as a comment.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it easier for agents to generate datafusion-python code #1394

Problem

Goals

Where idiomatic code is defined

How agents discover it

Task list

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Layer	Mechanism	Target audience
1	`datafusion-init` writes a short pointer block into the user's project-root `AGENTS.md` / `CLAUDE.md` / `.cursorrules`	Any agent working in the user's repo — project-root files are loaded into context automatically
2	`https://datafusion.apache.org/python/llms.txt` published on the docs site (llmstxt.org convention)	Agents that auto-fetch `/llms.txt` from documentation sites
3	`AGENTS.md` inside the installed wheel + pointer in `datafusion.__doc__`	Agents that introspect the installed package
4	Docs site page that `{include}`s `AGENTS.md`	Humans and WebSearch-capable agents browsing the docs
5	`https://datafusion.apache.org/llms.txt` upstream hub (separate PR to `apache/datafusion`) pointing at each subproject's `llms.txt`	Agents that land anywhere in the DataFusion ecosystem

Make it easier for agents to generate datafusion-python code #1394

Description

Problem

Goals

Where idiomatic code is defined

How agents discover it

Task list

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions