[core] Support btree global index with embedded file metadata by lilei1128 · Pull Request #7563 · apache/paimon

lilei1128 · 2026-03-31T09:31:49Z

Purpose

Add with-file-meta option for btree global index to embed ManifestEntry
data directly in index files, enabling manifest-skip query planning.

Key changes:

New index type "btree_file_meta": SST file mapping fileName -> ManifestEntry bytes
BTreeWithFileMetaBuilder: builds btree key-index + btree_file_meta atomically
BTreeWithFileMetaReader: reads both indexes and returns FilePathGlobalIndexResult
DataEvolutionBatchScan: fast path using file-meta to build DataSplits without
manifest reads, with staleness detection via fileIO.exists()
BTreeIndexOptions.BTREE_WITH_FILE_META: config to enable the feature

When enabled, query planning reads only:

btree key-index SST (for matching row IDs)
btree_file_meta SST (for ManifestEntry data)

See: https://cwiki.apache.org/confluence/display/PAIMON/PIP-41%3A+Introduce+FilePath+Global+Index+And+Optimizations+For+Lookup+In+Append+Table

Tests

CI

steFaiz

Hi, thanks for this PR!
I'm just wondering that if it's necessary to reuse current BTree codebase? For example:

in the btree index build topo, current implementation will decide the partition num by records number per range, and split ranges by partition, which may be not suitable for your case.
And also, it seems that the BTREE_WITH_FILE_META option will create a total different index type compared to BTree.

lilei1128 · 2026-03-31T11:24:45Z

Hi, thanks for this PR! I'm just wondering that if it's necessary to reuse current BTree codebase? For example:

in the btree index build topo, current implementation will decide the partition num by records number per range, and split ranges by partition, which may be not suitable for your case.

And also, it seems that the BTREE_WITH_FILE_META option will create a total different index type compared to BTree.

The "with-file-meta" is NOT a completely different index type. It's:

btree key-index (key → rowId bitmap) - same as regular btree
btree_file_meta (fileName → ManifestEntry) - addiional metadata file

For first question, you're right that the parallelism logic is designed for key-index.
Currently, when parallelism > 1, each subtask writes a complete file-meta SST.
I've handled this with read-time deduplication using LinkedHashMap.putIfAbsent()
in BTreeWithFileMetaReader - duplicate entries are filtered out during query. Long term solution, we can:

Write file-meta only in subtask 0 (write-time serialization)
Or deduplicate at commit time (though this leaves orphan files)?

To skip manifest reads, we need two capabilities:

Predicate evaluation (key → matching rowIds)
RowId → file info (to build DataSplit without manifest)

If we don't reuse BTree:
For capability 1, we would need to either:

Reimplement predicate evaluation logic (essentially duplicating BTree)
Use Bloom Filter which can only filter files at file-level, not evaluate
predicates precisely to get matching rowIds

For capability 2, alternatives like manifest caching still require manifest
reads - they just reduce the cost, not eliminate it.

That's why reusing BTree makes sense:

Capability 1: Inherited from existing BTree implementation
Capability 2: Added by embedding ManifestEntry in file-meta SST

steFaiz · 2026-03-31T12:14:10Z

@lilei1128
Thank you for the explanation! I misunderstood it earlier. I thought it was a general-purpose metadata index that could accelerate any metadata access.

steFaiz · 2026-03-31T12:18:18Z

...rk-common/src/main/java/org/apache/paimon/spark/globalindex/btree/BTreeIndexTopoBuilder.java

                int partitionNum = Math.max((int) (range.count() / recordsPerRange), 1);
                partitionNum = Math.min(partitionNum, maxParallelism);

+                // Pre-serialize ManifestEntries for file-meta index (if withFileMeta is enabled)


It seems that the key(i.e. the filename) is just used for deduplication? Can I image this index as actually a Range to Collection<ManifestEntry> index?

Yes, currently fileName is mainly for deduplication; the runtime path does not do fileName point lookup yet.

This is an optimization point that follows

JingsongLi · 2026-04-01T05:45:28Z

Hi @lilei1128 , thanks for the contribution!

Do you have some benchmark on this PR? I am curious about the performance comparison between file meta based and rowid based in big data scenario.

lilei1128 · 2026-04-01T12:02:58Z

Hi @lilei1128 , thanks for the contribution!

Do you have some benchmark on this PR? I am curious about the performance comparison between file meta based and rowid based in big data scenario.

Hi, This is my test result on mac:

Range queries perform better than point queries and the effect would be better if the data were on OSS/S3.

JingsongLi · 2026-04-01T14:52:07Z

@lilei1128 Your conclusion is too confusing to understand. Please only display the testing method and results.

lilei1128 · 2026-04-02T02:25:49Z

@lilei1128 Your conclusion is too confusing to understand. Please only display the testing method and results.

Test methodology summary

Tables

perf_test_no_index: no index
perf_test_btree: B-tree (row-id index)
perf_test_with_meta: B-tree (row-id index + file metadata)

Data load:
For each table, perform 30 separate writes (to simulate ~30 manifest files)
Each write inserts 50,000 records
Queries:
Warm-up: 10 queries using a fixed ID (12345) on all tables
Main test: 50 point-lookups with random keys (WHERE id = ?)
Additional test: 3 range scans with different ranges (WHERE id BETWEEN ? AND ?)
Environment: MacBook

Test result:

random keys query :
No Index(perf_test_no_index): 118.44ms
Btree Index(perf_test_btree): 112.06ms
Btree+with file meta(perf_test_with_meta): 115.94
range scan:
No Index(perf_test_no_index): 1246ms
Btree Index(perf_test_btree): 313ms
Btree+with file meta(perf_test_with_meta): 144ms

JingsongLi · 2026-04-02T14:02:28Z

Thanks @lilei1128 for your benchmark, but I think more convincing grades are needed. If we only improve by this amount, it feels a bit uneconomical.

lilei1128 · 2026-04-02T14:10:33Z

Thanks @lilei1128 for your benchmark, but I think more convincing grades are needed. If we only improve by this amount, it feels a bit uneconomical.

Actually, if tested on OSS and with a sufficiently large volume of data, the effect should be quite noticeable. My test was only conducted on a MacBook with a relatively small amount of data.

[core] Support btree global index with embedded file metadata

bcab359

steFaiz reviewed Mar 31, 2026

View reviewed changes

ci: retrigger checks

e8e5f4b

steFaiz reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Support btree global index with embedded file metadata#7563

[core] Support btree global index with embedded file metadata#7563
lilei1128 wants to merge 2 commits intoapache:masterfrom
lilei1128:global_index_with_manifest

lilei1128 commented Mar 31, 2026

Uh oh!

steFaiz left a comment •

edited

Loading

Uh oh!

lilei1128 commented Mar 31, 2026 •

edited

Loading

Uh oh!

steFaiz commented Mar 31, 2026

Uh oh!

steFaiz Mar 31, 2026 •

edited

Loading

Uh oh!

lilei1128 Mar 31, 2026

Uh oh!

JingsongLi commented Apr 1, 2026

Uh oh!

lilei1128 commented Apr 1, 2026 •

edited

Loading

Uh oh!

JingsongLi commented Apr 1, 2026

Uh oh!

lilei1128 commented Apr 2, 2026

Uh oh!

JingsongLi commented Apr 2, 2026

Uh oh!

lilei1128 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lilei1128 commented Mar 31, 2026

Purpose

Tests

Uh oh!

steFaiz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lilei1128 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steFaiz commented Mar 31, 2026

Uh oh!

steFaiz Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lilei1128 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented Apr 1, 2026

Uh oh!

lilei1128 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JingsongLi commented Apr 1, 2026

Uh oh!

lilei1128 commented Apr 2, 2026

Uh oh!

JingsongLi commented Apr 2, 2026

Uh oh!

lilei1128 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

steFaiz left a comment •

edited

Loading

lilei1128 commented Mar 31, 2026 •

edited

Loading

steFaiz Mar 31, 2026 •

edited

Loading

lilei1128 commented Apr 1, 2026 •

edited

Loading