Skip to content

[python][DISCUSS] supports chunk shuffle in file meta layer.#8010

Open
steFaiz wants to merge 1 commit into
apache:masterfrom
steFaiz:chunk_shuffle
Open

[python][DISCUSS] supports chunk shuffle in file meta layer.#8010
steFaiz wants to merge 1 commit into
apache:masterfrom
steFaiz:chunk_shuffle

Conversation

@steFaiz
Copy link
Copy Markdown
Contributor

@steFaiz steFaiz commented May 28, 2026

Purpose

Background

This PR originates from our inner cases: when using paimon table as dataloaders, engine training always needs deterministically shuffled data rather than sequential data.
It's highly expensive to shuffle the entire dataset for each training, so a common way is:

  1. When loading data into paimon, perform a global row-level shuffle
  2. During training, adapt chunk shuffle rather than row-level shuffle.
  3. More sophisticated shuffle, e.g. read several chunks and do row-level shuffle among them.

We can provide 1 & 2. This PR introduces a chunk shuffle for pypaimon. The mechanism can be illustrated as below:

image
  1. we logically divide data files into chunks. The most simple case:
    Chunk 1 {
        file: file1,
       range: [0, 100]
    }
    This means the chunk is in file1, covering [0, 100] rows
  2. deterministically shuffle chunks
  3. wraps each chunk to a single SliceSplit

Usage

The usage is simple:

readBuilder.new_scan().with_chunk_shuffle(chunk_size, seed).with_shard(idx, total_worker);

More

  1. A random-access optimized file format. I think the ROW format introduced in java module is nice. I've tested the random access of several file formats (On local Disk):

    Chunk Size row (ms) parquet (ms) lance (ms)
    100 42 440 41
    500 94 689 29
    1000 145 944 36

    Note that lance can only be stored on ObjectStore. DFS is not supported now.

  2. During training, callers can pre-fetch next several chunks to read. So the read latency actually only influence the first batch.

Tests

UnitTests

@steFaiz
Copy link
Copy Markdown
Contributor Author

steFaiz commented May 28, 2026

@JingsongLi @XiaoHongbo-Hope This PR is more of a discussion. I'd really appreciate any feedback or suggestions you might have. Thanks!

@JingsongLi
Copy link
Copy Markdown
Contributor

I am okay with this feature. Is there any other system with a similar design for API hierarchy?

@steFaiz
Copy link
Copy Markdown
Contributor Author

steFaiz commented May 28, 2026

Is there any other system with a similar design for API hierarchy?

Yes. The closest precedent I found is Petastorm, an ML data reader for Parquet datasets. Its reader API exposes shuffle and distributed sharding at the same reader-construction level: seed, shuffle_row_groups, shuffle_rows, cur_shard, and shard_count. The main difference is that Petastorm shuffles existing Parquet row groups, while this PR derives logical row-count chunks from Paimon manifest entries/files and then converts them back to Splits.

Similar API hierarchies also exist in ML input systems such as Hugging Face IterableDataset, NVIDIA DALI readers, WebDataset, Mosaic Streaming, Ray Data, TensorFlow tf.data, and PyTorch/TorchData. They commonly expose deterministic shuffle options and distributed shard/rank options in the dataset/reader/input-pipeline layer rather than rewriting the physical dataset.

I think the main advantage of paimon is:

  1. In our current implementation, each chunk is a single file, including blobs and structured cols(metadatas). Considering a 500,000,000 image dataset, chunk size is 100, there would be 5 million files. In paimon, we set the target blob size as 1G, the total file num is less than 30000
  2. The table arch makes it much easier to manage.
  3. We could easily deal with multiple datasets by importing each dataset to a single partition.

@JingsongLi
Copy link
Copy Markdown
Contributor

I found Hugging Face IterableDataset.shuffle:

  def shuffle(
      self,
      seed: Optional[int] = None,
      generator: Optional[np.random.Generator] = None,
      buffer_size: int = 1000
  ) -> "IterableDataset"

You can continue, I think this is a good direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants