perf(upsert): prune destination scan via df partition-column ranges a… by paultmathew · Pull Request #3387 · apache/iceberg-python

paultmathew · 2026-05-20T19:43:50Z

Rationale for this change

Transaction.upsert builds its scan row_filter from join_cols
alone via create_match_filter. When the partition spec sources from
columns NOT in join_cols — a common pattern for append-only event
logs partitioned by time but keyed by composite IDs — two amplifying
problems fall out at scan plan time:

inclusive_projection collapses the predicate to AlwaysTrue
against the partition spec, so partition pruning never fires and
every file in the table is listed (related: Upsertion memory usage grows exponentially as table size grows #2138, Upserting large table extremely slow #2159, Upsert with 1M rows extremely slow due to create_match_filter and txn.delete() performance #3129).
Per-file metrics evaluation of the Or(And(EqualTo, EqualTo), …)
predicate on UUID-shaped key columns can't prune either —
per-file lower_bound / upper_bound stats span essentially the
full UUID space, so the metrics evaluator passes every file.

The result is a full-table scan at every upsert. For tables with 10k+
partitions this is multi-minute / multi-gigabyte work per call.

What this PR does

Two complementary optimisations to Transaction.upsert:

Partition-range augmentation. New helper
upsert_util.augment_filter_with_partition_ranges derives
[min, max] predicates from df for every partition source column
present in the frame and ANDs them into the row filter.
inclusive_projection then projects each range through its
partition transform (hours, days, months, years, identity,
truncate) at scan-plan time, enabling manifest- and file-level
pruning.
Column-projection for the insert-only path. When
when_matched_update_all=False the consumer loop only reads
join_cols off each destination batch (to build the per-batch
source-side match filter). Passing selected_fields=tuple(join_cols)
to DataScan lets the parquet reader prune wide non-key columns.
The existing _projected_field_ids.union(extract_field_ids(...))
keeps the partition-range predicate's columns readable.

Correctness guards

The augmentation skips per-column in three cases:

The partition source column isn't present in df (no bound to derive).
The column is entirely null in df (no min/max).
The column has any null in df — a non-null GreaterThanOrEqual
predicate would exclude NULL-partition destination rows whose
(join_cols) may collide with null-partition source rows. Skip
pruning over emitting an unsafe predicate.

When min == max, an EqualTo is emitted instead of the range pair.
Multiple partition fields sourcing from the same column emit one
source-column range; inclusive_projection projects through each
partition field independently at scan time. Bucket and other
non-monotonic transforms return None from their project method on
inequalities — the projection contributes AlwaysTrue for them, no
harm.

Are these changes tested?

Yes:

13 new unit tests in tests/table/test_upsert.py:
- Pure-function tests for augment_filter_with_partition_ranges
  (unpartitioned, missing column, all-null, partial-null,
  single-value, range, multi-field-sharing-source).
- End-to-end upsert semantics with partition spec NOT in
  join_cols, IN join_cols, and unpartitioned.
- DataScan.plan_files() count assertion against a deterministically
  seeded table that defeats per-file metrics pruning — confirms the
  augmentation prunes vs the original predicate.
- selected_fields projection assertions for both
  when_matched_update_all=True (legacy ('*',)) and =False
  (narrow join_cols-only).
- End-to-end upsert with when_matched_update_all=True against a
  wide table to confirm column projection doesn't break the update
  path.
23 existing upsert tests still pass.

Smoke test — real Iceberg-on-S3 + Glue table

Run against a real Iceberg table representative of the workload this
optimisation targets.

Stack

pyiceberg.catalog.glue.GlueCatalog
AWS S3 warehouse, parquet data files
Iceberg format v2

Target table

Write mode: append-only event log
unique_keys: [conversation_id, id] (composite UUID/string key)
partition_spec: hours(created_at)
Size at the test snapshot: ~10,450 data files, ~3.2 GiB total
Hourly partitions over ~15 months of history
Avg file size: ~0.32 MiB (post hourly OPTIMIZE compaction)
Schema (6 columns):
- conversation_id (string, UUID v4) — key
- id (string, UUID v4) — key
- event (string, short tag, ~10 B/row)
- log (string, JSON payload, ~400 B/row median)
- created_at (timestamp[us, UTC]) — partition source
- version (int32)

Source synthesis (two modes for the comparison):

synthetic: random UUIDs; conversation_id drawn from a pool sized
rows/30 so leading-key cardinality matches realistic parent-child
distribution; created_at uniformly in [now − hours, now]. Keys
don't overlap destination → metrics evaluator rejects every file at
scan-plan time, so both projections return 0 files. Isolates the
planning cost.
from-destination: reads N recent rows from the destination via a
created_at range scan, used as the source. Keys DO overlap →
exercises the file-count reduction and column-read effect.

Results (read-only via DataScan.plan_files() and
to_arrow_batch_reader()):

Scenario	Original plan	Augmented plan	Plan-time win	Wide read	Narrow read
1 000 synthetic rows / 24 h	0 / 10 452 (454.04 s)	0 / 10 452 (3.11 s)	146×	—	—
1 000 dest-sampled rows / 24 h	(skipped, 7-min baseline)	7 / 10 452 (99.93%)	—	493 KiB / 1 000 rows	68 KiB / 1 000 rows (86% smaller)
10 000 dest-sampled rows / 168 h (catch-up flush)	(skipped)	58 / 10 492 (99.4%)	—	—	—

The 146× plan-time win is on a 1 000-disjunct predicate against a
~10k-file table; the original cost scales linearly with table file
count and predicate disjunct count, the augmented cost scales with the
source's created_at span instead.

The 86% byte reduction is dominated by skipping the log (JSON
payload) column at the parquet reader — that one column carries ~80%
of the row width on this table.

For a representative larger flush — 1.16M source rows spanning ~24 h
— extrapolating both wins reduces the destination-scan working set
from multiple GiB (which is OOM-territory on 8 GiB worker pods) down
to tens of MiB.

Are there any user-facing changes?

No API change. The optimisation is purely internal to
Transaction.upsert:

The new helper is exported from pyiceberg.table.upsert_util for
testability but isn't part of the public API.
selected_fields=tuple(join_cols) is passed conditionally inside the
method — no signature change to Table.upsert or
Transaction.upsert.

Why a range augmentation rather than reusing `_build_partition_predicate`?

Transaction._build_partition_predicate plus _determine_partitions
together could express the same intent — apply each partition
transform to df, take the distinct partition tuples, and emit
Or(And(EqualTo, …), …). It would prune marginally harder for
gappy sources (where [min, max] over-fetches the gap). I picked
the range approach over that combination for three reasons:

Predicate size. Exact partition match emits one disjunct per
distinct partition value present in df. For a daily-partitioned
table with a multi-month source the Or reaches hundreds of nodes
— exactly the predicate-bloat shape that motivated Optimize upsert performance for large datasets #2943. The range
approach is 2 nodes per partition column regardless of cardinality
and downstream metrics-evaluator cost scales with that.
Reuse boundary. _determine_partitions is bound to the write
path: it filters + combine_chunks per partition for the writer
to consume. Reusing it for read-side planning either wastes that
work or requires lifting the partition-key extraction into a shared
helper — a separable, slightly larger refactor.
Idiomatic projection. GreaterThanOrEqual(source_col, min)
feeds inclusive_projection source-side bounds and lets the
existing transform.project(...) machinery rewrite them into
partition-column predicates at scan time. That's the same
projection path the rest of pyiceberg uses.

For temporally-dense sources (the dominant upsert pattern: a recent
batch of activity, no internal gaps) the two approaches prune the same
files. For gappy sources the exact-match approach prunes strictly
harder at the cost of a larger predicate. Happy to switch to the
_determine_partitions + _build_partition_predicate combo if
reviewers prefer that direction — or to factor a thin
partition_records_from_arrow_table helper out of
_determine_partitions so both sides can share it.

Relates to Upsertion memory usage grows exponentially as table size grows #2138 (partition-pruning suggestion by @koenvo), Upserting large table extremely slow #2159
(umbrella slow-upsert tracker), Upsert with 1M rows extremely slow due to create_match_filter and txn.delete() performance #3129 (recent: create_match_filter
- no-partition-prune).
Complementary to (closed-stale) Optimize upsert performance for large datasets #2943's "coarse match filter"
approach — that PR shrinks the row predicate itself; this one adds
partition pruning the row predicate can't trigger on its own. Both
can coexist.

Was generative AI tooling used to co-author this PR?

Yes — Claude (Cursor agent)

…nd project join_cols only Two complementary optimizations to ``Transaction.upsert`` for tables whose partition spec sources from columns NOT in ``join_cols`` (a common pattern for append-only event logs partitioned by time but keyed by composite IDs): 1. Partition-range augmentation: ``upsert_util.augment_filter_with_partition_ranges`` derives ``[min, max]`` predicates from ``df`` for every partition source column present in the frame and ANDs them into the row filter built by ``create_match_filter``. ``inclusive_projection`` then projects each range through the partition transform at scan plan time, enabling manifest- and file-level pruning that the key-only filter can't trigger. 2. Column-projection for the insert-only path: when ``when_matched_update_all=False`` the consumer loop only reads ``join_cols`` off each destination batch. Passing ``selected_fields=tuple(join_cols)`` to ``DataScan`` lets the parquet reader prune wide non-key columns. The existing ``_projected_field_ids`` auto-union with row-filter columns keeps the partition-range predicate's data accessible. Correctness guards skip the augmentation per-column when the source column is absent from df, entirely null, or partially null (a non-null range predicate would exclude NULL-partition destination rows whose keys may collide with the null-partition source rows). Related to apache#2138, apache#2159, apache#3129. Complementary to (closed-stale) apache#2943's "coarse match filter" approach: that PR shrinks the row predicate itself; this one adds partition pruning the row predicate can't trigger on its own. Co-authored-by: Cursor <cursoragent@cursor.com>

abnobdoss

Hey Paul, this looks very interesting! I've faced similar challenges around iceberg upsert performance in Python so I'm very keen to make some traction clearing out some of the bottlenecks!

That said, I'm worried this approach treats the partition column as an implicit join key, even when it isn't part of join_cols - and in the current version that doesn't seem to be safe. As the scenario on test_upsert.py:1158 shows: if a target row has the same order_id as a source row but a different order_date (status correction, late edit, versioned overwrite), the augmentation prunes the target's file out of the scan and the upsert emits a duplicate insert instead of an update. The augmentation effectively requires the partition column to be stable for a given join key - a contract the upsert API doesn't surface anywhere today.

If this does land, I think the API contract change should be opt-in (a new flag) rather than a default behavior change, to avoid silently breaking existing users. Let me know if I'm missing something.

abnobdoss · 2026-05-25T23:12:42Z

+            [
+                {"order_id": 1, "order_date": datetime.date(2026, 1, 1), "order_type": "A"},
+                {"order_id": 2, "order_date": datetime.date(2026, 1, 1), "order_type": "A"},
+                {"order_id": 3, "order_date": datetime.date(2026, 1, 2), "order_type": "A"},


Does this still pass if this target row keeps order_id=3 but changes order_date to datetime.date(2026, 5, 1)?

The quirk with this change is that it seems to assume the partition column is part of the row identity. I’m not sure when that’s valid outside cases where the partition is derived from the join key, and in those cases I’d expect the join-key filter to already be sufficient for pruning.

paultmathew · 2026-05-26T02:23:21Z

@abnobdoss Thanks for the review. I agree that this would break things for existing users. I'll update the PR to gate the augmentation on a structural check: augment only when every partition source column is in join_cols (directly or via a deterministic transform on a join column). When that holds, the partition value is a function of the join key, so the augmentation can never exclude a destination file holding the matching key — correctness is preserved by construction. When it doesn't, the row filter is left unchanged and tx.upsert falls back to today's full-table-scan behaviour. The optimisation still covers the workload that motivated the PR — high-cardinality state tables with bucket(N, key) + unique_keys=[key], where partition source ⊆ join_cols by construction. I'll swap the smoke-test scenario to that shape so the benchmarks reflect the case the augmentation is actually safe for.

The underlying limitation — "find the row with key=X regardless of which partition holds it" — fundamentally requires either a full-table scan (today's tx.upsert behaviour without augmentation) or a delete-by-value primitive. Equality deletes by primary key are the only Iceberg-native mechanism I see that would let pyiceberg do partition-migration-correct upserts without scanning every file. I know the discussion at #3270 leans toward "don't write equality deletes, write deletion vectors instead", but deletion vectors are still position-keyed — they reduce the rewrite footprint once you've located a row, but they don't help with the lookup itself. So for the partition-migration class of upsert bug specifically, deletion vectors aren't the right tool.

@Fokko @kevinjqliu I take the maintainer concerns about equality-delete writes (compaction obligation, read-side merge cost, the rewrite-equality-to-position story) — but I think for the upsert workload the trade-off goes the other way. Happy to take a stab at an equality-delete write path in a separate PR, scoped initially to the cases tx.upsert could opt into, if there's appetite for revisiting that direction.

abnobdoss · 2026-05-26T02:42:11Z

Hey @paultmathew, thanks for the clarification. I’m curious whether your use case still has performance problems when the partition source column is included in join_cols, as I think you’re proposing here.

My assumption is that in that case create_match_filter already includes exact predicates on the partition source column, so partition pruning should have enough information to fire. If it still doesn’t, is the issue that the exact Or(And(EqualTo, ...), ...) predicate is too expensive/large for planning rather than that the partition signal is missing?

paultmathew · 2026-05-26T03:24:40Z

@abnobdoss You're right. My goal was to replace the Or(And(EqualTo, ...)) predicate with an O(1) range so the projection visitor would do less work, but I didn't account for the fact that inclusive_projection walks the full N-disjunct tree regardless of what I AND onto it — so the augmentation adds work rather than substituting for it, and create_match_filter's own per-disjunct projection is already producing the prune target.

I'll drop the partition augmentation and keep just the selected_fields=tuple(join_cols) change. The column-projection win is independent of partition spec and stands on its own.

The underlying gap that motivated the augmentation — large flushes against partition-source-not-in-join-cols tables hitting full-table scans — is a separate problem that needs a different primitive. Equality deletes by primary key are the cleanest Iceberg-native fit for it, so I'll follow up on that thread.

Will trim the diff and update the PR description shortly.

paultmathew marked this pull request as ready for review May 20, 2026 20:30

ndrluis mentioned this pull request May 24, 2026

Add upsert regression test for non-join partition changes #3409

Open

Fokko self-requested a review May 25, 2026 20:11

abnobdoss reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(upsert): prune destination scan via df partition-column ranges a…#3387

perf(upsert): prune destination scan via df partition-column ranges a…#3387
paultmathew wants to merge 1 commit into
apache:mainfrom
paultmathew:perf/upsert-partition-pruning

paultmathew commented May 20, 2026 •

edited

Loading

Uh oh!

abnobdoss left a comment

Uh oh!

abnobdoss May 25, 2026

Uh oh!

paultmathew commented May 26, 2026

Uh oh!

abnobdoss commented May 26, 2026

Uh oh!

paultmathew commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

paultmathew commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What this PR does

Correctness guards

Are these changes tested?

Smoke test — real Iceberg-on-S3 + Glue table

Are there any user-facing changes?

Why a range augmentation rather than reusing _build_partition_predicate?

Related

Was generative AI tooling used to co-author this PR?

Uh oh!

abnobdoss left a comment

Choose a reason for hiding this comment

Uh oh!

abnobdoss May 25, 2026

Choose a reason for hiding this comment

Uh oh!

paultmathew commented May 26, 2026

Uh oh!

abnobdoss commented May 26, 2026

Uh oh!

paultmathew commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

paultmathew commented May 20, 2026 •

edited

Loading

Why a range augmentation rather than reusing `_build_partition_predicate`?