[python][daft] Make Daft Paimon read source serializable by kerwin-zk · Pull Request #8029 · apache/paimon

kerwin-zk · 2026-05-29T02:43:08Z

Purpose

Make the Daft Paimon read source serializable when running with Ray.

Previously, PaimonDataSource and fallback read tasks could retain live
FileStoreTable, FileIO, StorageConfig, or TableRead objects. With
remote filesystems such as OSS/Jindo, Ray failed to serialize the execution
plan because those objects may contain non-picklable PyArrow filesystem state.

RuntimeError: Failed to serialize: OtherString("TypeError: no default __reduce__ due to non-trivial __cinit__")

Tests

CI

YannByron · 2026-05-29T07:42:54Z

+    file_io = getattr(table, "file_io", None)
+    properties = getattr(file_io, "properties", None)
+    if properties is None:
+        properties = getattr(file_io, "catalog_options", None)


We can all obtain properties information through the properties attribute, even for RESTTokenFileIO. However, we need to determine whether it is CachingFileIO (in which case it needs to be obtained from _delegate) to increase code robustness.
I further recommend unifying the abstract properties API for FileIO on the pypaimon side. Because we currently have too many getattr calls.

Done. Added a properties property on CachingFileIO that delegates to its _delegate, so every FileIO implementation now exposes .properties uniformly. _extract_catalog_options reads table.file_io.properties directly, with no per-implementation getattr.

YannByron · 2026-05-29T07:52:52Z

+    if identifier is None:
+        return None
+
+    get_database_name = getattr(identifier, "get_database_name", None)


Maybe it's fine to call identifier.get_database_name and identifier.get_table_name directly, not though getattr.

Done, calling identifier.get_database_name() / get_table_name() / get_branch_name() directly now. This also fixes a latent issue: the old getattr fallback used identifier.object, which is the encoded object name and would round-trip incorrectly for branch tables.

YannByron · 2026-05-29T07:53:51Z

+
+
+def _extract_table_options(table: FileStoreTable) -> dict[str, Any]:
+    table_schema = getattr(table, "table_schema", None)


Maybe let's define schema method in FileStoreTable.

Done. Added FileStoreTable.schema() returning the TableSchema, and _extract_table_options now uses table.schema().options.

YannByron · 2026-05-29T08:08:47Z

            )

-            if can_use_native_reader:
+            use_paimon_reader_task = (


Please provide detailed notes here regarding the scenarios in which the Daft Native Reader can be used, and those in which the Paimon Reader is necessary.

Added detailed comments above the reader-selection logic in get_tasks, describing when the Daft native Parquet reader is used and when the pypaimon reader task is required (non-Parquet, blob columns, LSM merge, or deletion vectors).

YannByron · 2026-05-29T08:17:37Z

        return not self._is_parquet or self._has_blob_columns or self._table.is_primary_key_table

+    def _requires_serializable_paimon_reader_task(self) -> bool:
+        if self._warehouse_scheme in ("", "file"):


In ray environment, to scan a normal append-only paimon table location on aliyun oss, we expect this can run in daft native reader way. But when _warehouse_scheme of this case is oss, this method will return true, and use_paimon_reader_task will return true. This does not meet our expectations. Correct me If i'm wrong please.

You're right, thanks. I removed the gate: a normal append-only Parquet table on OSS now goes through Daft's native reader under the Ray runner. This is safe because both the source and the pypaimon fallback task serialize only rebuildable metadata (catalog options, identifier, table path), and the native task carries Daft's own picklable StorageConfig. The pypaimon reader task is now used only for splits that genuinely need it (PK/LSM merge, non-Parquet, blob columns, deletion vectors). I verified end-to-end that community Daft reading an append-only Parquet table on OSS under the Ray runner uses the native reader and returns correct results.

YannByron · 2026-05-29T08:20:03Z

@kerwin-zk Thank you for working on this. I left some comments.

[python][daft] Make Daft Paimon read source serializable

4b7fece

YannByron reviewed May 29, 2026

View reviewed changes

kerwin-zk added 2 commits May 29, 2026 16:27

[python][daft] Simplify Paimon read source metadata extraction

dcc6409

[python][daft] Let the native reader handle remote splits under Ray

23575af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python][daft] Make Daft Paimon read source serializable#8029

[python][daft] Make Daft Paimon read source serializable#8029
kerwin-zk wants to merge 3 commits into
apache:masterfrom
kerwin-zk:fix-daft-read-source-serializable

kerwin-zk commented May 29, 2026

Uh oh!

YannByron May 29, 2026

Uh oh!

kerwin-zk May 29, 2026

Uh oh!

YannByron May 29, 2026

Uh oh!

kerwin-zk May 29, 2026

Uh oh!

YannByron May 29, 2026

Uh oh!

kerwin-zk May 29, 2026

Uh oh!

YannByron May 29, 2026

Uh oh!

kerwin-zk May 29, 2026

Uh oh!

YannByron May 29, 2026

Uh oh!

kerwin-zk May 29, 2026

Uh oh!

YannByron commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def _extract_table_options(table: FileStoreTable) -> dict[str, Any]:
		table_schema = getattr(table, "table_schema", None)

Conversation

kerwin-zk commented May 29, 2026

Purpose

Tests

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YannByron commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants