Skip to content

[core] Support btree global index with embedded file metadata#7563

Open
lilei1128 wants to merge 2 commits intoapache:masterfrom
lilei1128:global_index_with_manifest
Open

[core] Support btree global index with embedded file metadata#7563
lilei1128 wants to merge 2 commits intoapache:masterfrom
lilei1128:global_index_with_manifest

Conversation

@lilei1128
Copy link
Copy Markdown
Contributor

Purpose

Add with-file-meta option for btree global index to embed ManifestEntry
data directly in index files, enabling manifest-skip query planning.

Key changes:

  • New index type "btree_file_meta": SST file mapping fileName -> ManifestEntry bytes
  • BTreeWithFileMetaBuilder: builds btree key-index + btree_file_meta atomically
  • BTreeWithFileMetaReader: reads both indexes and returns FilePathGlobalIndexResult
  • DataEvolutionBatchScan: fast path using file-meta to build DataSplits without
    manifest reads, with staleness detection via fileIO.exists()
  • BTreeIndexOptions.BTREE_WITH_FILE_META: config to enable the feature

When enabled, query planning reads only:

  • btree key-index SST (for matching row IDs)
  • btree_file_meta SST (for ManifestEntry data)

See: https://cwiki.apache.org/confluence/display/PAIMON/PIP-41%3A+Introduce+FilePath+Global+Index+And+Optimizations+For+Lookup+In+Append+Table

Tests

CI

Copy link
Copy Markdown
Contributor

@steFaiz steFaiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for this PR!
I'm just wondering that if it's necessary to reuse current BTree codebase? For example:

  1. in the btree index build topo, current implementation will decide the partition num by records number per range, and split ranges by partition, which may be not suitable for your case.
  2. And also, it seems that the BTREE_WITH_FILE_META option will create a total different index type compared to BTree.

@lilei1128
Copy link
Copy Markdown
Contributor Author

lilei1128 commented Mar 31, 2026

Hi, thanks for this PR! I'm just wondering that if it's necessary to reuse current BTree codebase? For example:

  1. in the btree index build topo, current implementation will decide the partition num by records number per range, and split ranges by partition, which may be not suitable for your case.
  2. And also, it seems that the BTREE_WITH_FILE_META option will create a total different index type compared to BTree.

The "with-file-meta" is NOT a completely different index type. It's:

  • btree key-index (key → rowId bitmap) - same as regular btree
  • btree_file_meta (fileName → ManifestEntry) - addiional metadata file

For first question, you're right that the parallelism logic is designed for key-index.
Currently, when parallelism > 1, each subtask writes a complete file-meta SST.
I've handled this with read-time deduplication using LinkedHashMap.putIfAbsent()
in BTreeWithFileMetaReader - duplicate entries are filtered out during query. Long term solution, we can:

  1. Write file-meta only in subtask 0 (write-time serialization)
  2. Or deduplicate at commit time (though this leaves orphan files)?

To skip manifest reads, we need two capabilities:

  1. Predicate evaluation (key → matching rowIds)
  2. RowId → file info (to build DataSplit without manifest)

If we don't reuse BTree:
For capability 1, we would need to either:

  • Reimplement predicate evaluation logic (essentially duplicating BTree)
  • Use Bloom Filter which can only filter files at file-level, not evaluate
    predicates precisely to get matching rowIds

For capability 2, alternatives like manifest caching still require manifest
reads - they just reduce the cost, not eliminate it.

That's why reusing BTree makes sense:

  • Capability 1: Inherited from existing BTree implementation
  • Capability 2: Added by embedding ManifestEntry in file-meta SST

@steFaiz
Copy link
Copy Markdown
Contributor

steFaiz commented Mar 31, 2026

@lilei1128
Thank you for the explanation! I misunderstood it earlier. I thought it was a general-purpose metadata index that could accelerate any metadata access.

int partitionNum = Math.max((int) (range.count() / recordsPerRange), 1);
partitionNum = Math.min(partitionNum, maxParallelism);

// Pre-serialize ManifestEntries for file-meta index (if withFileMeta is enabled)
Copy link
Copy Markdown
Contributor

@steFaiz steFaiz Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the key(i.e. the filename) is just used for deduplication? Can I image this index as actually a Range to Collection<ManifestEntry> index?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently fileName is mainly for deduplication; the runtime path does not do fileName point lookup yet.

This is an optimization point that follows

@JingsongLi
Copy link
Copy Markdown
Contributor

Hi @lilei1128 , thanks for the contribution!

Do you have some benchmark on this PR? I am curious about the performance comparison between file meta based and rowid based in big data scenario.

@lilei1128
Copy link
Copy Markdown
Contributor Author

lilei1128 commented Apr 1, 2026

Hi @lilei1128 , thanks for the contribution!

Do you have some benchmark on this PR? I am curious about the performance comparison between file meta based and rowid based in big data scenario.

Hi @lilei1128 , thanks for the contribution!

Do you have some benchmark on this PR? I am curious about the performance comparison between file meta based and rowid based in big data scenario.

Hi, This is my test result on mac:
image

Range queries perform better than point queries and the effect would be better if the data were on OSS/S3.

@JingsongLi
Copy link
Copy Markdown
Contributor

@lilei1128 Your conclusion is too confusing to understand. Please only display the testing method and results.

@lilei1128
Copy link
Copy Markdown
Contributor Author

@lilei1128 Your conclusion is too confusing to understand. Please only display the testing method and results.

Test methodology summary

Tables

  • perf_test_no_index: no index

  • perf_test_btree: B-tree (row-id index)

  • perf_test_with_meta: B-tree (row-id index + file metadata)

Data load:
For each table, perform 30 separate writes (to simulate ~30 manifest files)
Each write inserts 50,000 records
Queries:
Warm-up: 10 queries using a fixed ID (12345) on all tables
Main test: 50 point-lookups with random keys (WHERE id = ?)
Additional test: 3 range scans with different ranges (WHERE id BETWEEN ? AND ?)
Environment: MacBook

Test result:

  1. random keys query :
    No Index(perf_test_no_index): 118.44ms
    Btree Index(perf_test_btree): 112.06ms
    Btree+with file meta(perf_test_with_meta): 115.94

  2. range scan:
    No Index(perf_test_no_index): 1246ms
    Btree Index(perf_test_btree): 313ms
    Btree+with file meta(perf_test_with_meta): 144ms

@JingsongLi
Copy link
Copy Markdown
Contributor

Thanks @lilei1128 for your benchmark, but I think more convincing grades are needed. If we only improve by this amount, it feels a bit uneconomical.

@lilei1128
Copy link
Copy Markdown
Contributor Author

Thanks @lilei1128 for your benchmark, but I think more convincing grades are needed. If we only improve by this amount, it feels a bit uneconomical.

Actually, if tested on OSS and with a sufficiently large volume of data, the effect should be quite noticeable. My test was only conducted on a MacBook with a relatively small amount of data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants