[python][DISCUSS] supports chunk shuffle in file meta layer. by steFaiz · Pull Request #8010 · apache/paimon

steFaiz · 2026-05-28T07:54:05Z

Purpose

Background

This PR originates from our inner cases: when using paimon table as dataloaders, engine training always needs deterministically shuffled data rather than sequential data.
It's highly expensive to shuffle the entire dataset for each training, so a common way is:

When loading data into paimon, perform a global row-level shuffle
During training, adapt chunk shuffle rather than row-level shuffle.
More sophisticated shuffle, e.g. read several chunks and do row-level shuffle among them.

We can provide 1 & 2. This PR introduces a chunk shuffle for pypaimon. The mechanism can be illustrated as below:

we logically divide data files into chunks. The most simple case:
```
Chunk 1 {
    file: file1,
   range: [0, 100]
}
```
This means the chunk is in file1, covering [0, 100] rows
deterministically shuffle chunks
wraps each chunk to a single SliceSplit

Usage

The usage is simple:

readBuilder.new_scan().with_chunk_shuffle(chunk_size, seed).with_shard(idx, total_worker);

More

A random-access optimized file format. I think the ROW format introduced in java module is nice. I've tested the random access of several file formats （On local Disk）:

Chunk Size row (ms) parquet (ms) lance (ms)

100 42 440 41

500 94 689 29

1000 145 944 36

Note that lance can only be stored on ObjectStore. DFS is not supported now.
During training, callers can pre-fetch next several chunks to read. So the read latency actually only influence the first batch.

Tests

UnitTests

steFaiz · 2026-05-28T08:12:52Z

@JingsongLi @XiaoHongbo-Hope This PR is more of a discussion. I'd really appreciate any feedback or suggestions you might have. Thanks!

JingsongLi · 2026-05-28T09:06:27Z

I am okay with this feature. Is there any other system with a similar design for API hierarchy?

steFaiz · 2026-05-28T09:49:19Z

Is there any other system with a similar design for API hierarchy?

Yes. The closest precedent I found is Petastorm, an ML data reader for Parquet datasets. Its reader API exposes shuffle and distributed sharding at the same reader-construction level: seed, shuffle_row_groups, shuffle_rows, cur_shard, and shard_count. The main difference is that Petastorm shuffles existing Parquet row groups, while this PR derives logical row-count chunks from Paimon manifest entries/files and then converts them back to Splits.

Similar API hierarchies also exist in ML input systems such as Hugging Face IterableDataset, NVIDIA DALI readers, WebDataset, Mosaic Streaming, Ray Data, TensorFlow tf.data, and PyTorch/TorchData. They commonly expose deterministic shuffle options and distributed shard/rank options in the dataset/reader/input-pipeline layer rather than rewriting the physical dataset.

I think the main advantage of paimon is:

In our current implementation, each chunk is a single file, including blobs and structured cols(metadatas). Considering a 500,000,000 image dataset, chunk size is 100, there would be 5 million files. In paimon, we set the target blob size as 1G, the total file num is less than 30000
The table arch makes it much easier to manage.
We could easily deal with multiple datasets by importing each dataset to a single partition.

JingsongLi · 2026-05-28T10:09:44Z

I found Hugging Face IterableDataset.shuffle：

  def shuffle(
      self,
      seed: Optional[int] = None,
      generator: Optional[np.random.Generator] = None,
      buffer_size: int = 1000
  ) -> "IterableDataset"

You can continue, I think this is a good direction.

[python] supports chunk shuffle in file meta layer.

f5bf967

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python][DISCUSS] supports chunk shuffle in file meta layer.#8010

[python][DISCUSS] supports chunk shuffle in file meta layer.#8010
steFaiz wants to merge 1 commit into
apache:masterfrom
steFaiz:chunk_shuffle

steFaiz commented May 28, 2026 •

edited

Loading

Uh oh!

steFaiz commented May 28, 2026

Uh oh!

JingsongLi commented May 28, 2026

Uh oh!

steFaiz commented May 28, 2026

Uh oh!

JingsongLi commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Chunk Size	row (ms)	parquet (ms)	lance (ms)
100	42	440	41
500	94	689	29
1000	145	944	36

Conversation

steFaiz commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Background

Usage

More

Tests

Uh oh!

steFaiz commented May 28, 2026

Uh oh!

JingsongLi commented May 28, 2026

Uh oh!

steFaiz commented May 28, 2026

Uh oh!

JingsongLi commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

steFaiz commented May 28, 2026 •

edited

Loading