[python][DISCUSS] supports chunk shuffle in file meta layer.#8010
[python][DISCUSS] supports chunk shuffle in file meta layer.#8010steFaiz wants to merge 1 commit into
Conversation
|
@JingsongLi @XiaoHongbo-Hope This PR is more of a discussion. I'd really appreciate any feedback or suggestions you might have. Thanks! |
|
I am okay with this feature. Is there any other system with a similar design for API hierarchy? |
Yes. The closest precedent I found is Petastorm, an ML data reader for Parquet datasets. Its reader API exposes shuffle and distributed sharding at the same reader-construction level: seed, shuffle_row_groups, shuffle_rows, cur_shard, and shard_count. The main difference is that Petastorm shuffles existing Parquet row groups, while this PR derives logical row-count chunks from Paimon manifest entries/files and then converts them back to Splits. Similar API hierarchies also exist in ML input systems such as Hugging Face IterableDataset, NVIDIA DALI readers, WebDataset, Mosaic Streaming, Ray Data, TensorFlow tf.data, and PyTorch/TorchData. They commonly expose deterministic shuffle options and distributed shard/rank options in the dataset/reader/input-pipeline layer rather than rewriting the physical dataset. I think the main advantage of paimon is:
|
|
I found Hugging Face IterableDataset.shuffle: You can continue, I think this is a good direction. |
Purpose
Background
This PR originates from our inner cases: when using paimon table as dataloaders, engine training always needs deterministically shuffled data rather than sequential data.
It's highly expensive to shuffle the entire dataset for each training, so a common way is:
We can provide 1 & 2. This PR introduces a chunk shuffle for pypaimon. The mechanism can be illustrated as below:
SliceSplitUsage
The usage is simple:
More
A random-access optimized file format. I think the ROW format introduced in java module is nice. I've tested the random access of several file formats (On local Disk):
Note that lance can only be stored on ObjectStore. DFS is not supported now.
During training, callers can pre-fetch next several chunks to read. So the read latency actually only influence the first batch.
Tests
UnitTests