[refactor](be) Simplify new parquet complex reader assembly#64078
Merged
suxiaogang223 merged 20 commits intoJun 4, 2026
Merged
Conversation
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: New parquet reader kept dead projected schema helpers and TableReader stored file block layout as full SchemaField trees. This made the file-reader schema boundary harder to reason about and duplicated complex projected type rebuild logic.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Local: git diff --check
- Local: verified no remaining _fill_projected_schema_field/_get_projected_schema_field/block_schema/FILE_NAME = 2 references under be/src/format and be/test/format
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Clarify table reader comments after replacing block_schema with file_block_layout.
### Release note
None
### Check List (For Author)
- Test: Manual test
- git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Remove the ShapeOnlyColumnReader wrapper from the new Parquet reader tree. Unprojected STRUCT children now keep their original reader and advance through child_output_indices=-1 skip semantics, while row position reader is split into its own component. Update the DuckDB comparison/refactor docs and internal layering status accordingly.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran build-support/clang-format.sh on changed BE files.
- Ran git diff --check and git diff --cached --check.
- Verified no ShapeOnlyColumnReader/shape_only_column_reader references remain in BE source/tests.
- Fedora DEBUG BE build from the previous branch revision completed successfully as cache warm-up; this commit itself has not yet been compiled on Fedora.
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Keep Doris new Parquet MAP materialization centered on MapColumnReader and ColumnMap instead of blindly mapping it to LIST<STRUCT>. Update the refactor docs accordingly and start Phase 2 by extracting MapColumnReader child resolution and output finalization helpers without changing MAP read semantics.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran build-support/clang-format.sh on map_column_reader.cpp.
- Ran git diff --check.
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Continue the Doris-aware MAP reader refactor by sharing the scalar and struct MAP value aligned read/skip logic through local helpers, while keeping MapColumnReader and ColumnMap materialization unchanged. Mark the first Phase 2 helper step complete in the refactor document.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran build-support/clang-format.sh on map_column_reader.cpp.
- Ran git diff --check.
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Refactor the new parquet MAP reader so Map<K, List<V>> uses a dedicated list-value entry helper while keeping Doris ColumnMap materialization and value-list driven nested shape handling.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Update the complex column refactor feasibility document to reflect the completed MAP entry/value stream refactor phase.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Inline LIST Dremel level assembly in the new parquet ListColumnReader read and skip paths, removing its dependency on the generic repeated sink templates while preserving overflow behavior.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Remove the generic LIST repeated sink templates that became unused after inlining LIST level assembly in the new parquet ListColumnReader.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Remove values_written from NestedScalarBatch and keep the value count local to Arrow leaf batch construction and overflow compaction.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Localize MAP-only repeated level assembly and value stream helpers in the new parquet MapColumnReader after LIST Dremel assembly was inlined, reducing the public nested reader helper surface. Update the complex column refactor document with the current Phase 3 and Phase 4 status.
### Release note
None
### Check List (For Author)
- Test: Manual test
- git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Avoid materializing NestedScalarBatch value_indices for dense nested leaf batches without repetition filtering, and keep overflow tail compaction in the same dense representation. Update the complex column refactor document with the Phase 4 status.
### Release note
None
### Check List (For Author)
- Test: Manual test
- git diff --check
- Behavior changed: No
- Does this need documentation: No
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
…ng tasks Remove completed analysis documents (SchemaField merge, DuckDB comparison, projection/filtering analysis). Consolidate remaining work into a single document: NestedScalarBatch value_indices removal plan, test gaps, and known limitations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Keep LIST Dremel traversal local to ListColumnReader while extracting the repeated stream mechanics into a LIST-local helper, add comments explaining LIST/MAP local traversal boundaries, and update the follow-up value cursor refactor plan.
### Release note
None
### Check List (For Author)
- Test: Manual test
- git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#64078 Problem Summary: Update the new parquet reader layering document to match the current LIST/MAP local helper structure and remaining work. ### Release note None ### Check List (For Author) - Test: Manual test - git diff --check -- docs/doris-new-parquet-reader-internal-layering.md - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#64078 Problem Summary: Record the planned Arrow page-level skip path for the new parquet reader layering document. ### Release note None ### Check List (For Author) - Test: Manual test - git diff --check -- docs/doris-new-parquet-reader-internal-layering.md - Behavior changed: No - Does this need documentation: No
Add four independent task documents: - page-level-skip-plan.md: Arrow PageReader::set_data_page_filter() approach - observability-profile-plan.md: Scheduler/reader/nested/adapter metrics - ut-coverage-plan.md: Nullable boundaries, filter interaction, edge cases - complex-column-predicate-and-stats-filtering.md: Nested field filter localization and statistics pruning for complex types Remove doris-new-parquet-reader-internal-layering.md — P3 (schema change) already handled by TableReader/ColumnMapper. Remaining tasks consolidated into the new documents. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#64078 Problem Summary: Replace NestedScalarBatch value_indices with per-stream NestedScalarValueCursor for LIST, MAP, and STRUCT nested scalar value consumption. ### Release note None ### Check List (For Author) - Test: Manual test - build-support/clang-format.sh be/src/format/new_parquet/reader/nested_column_reader.h be/src/format/new_parquet/reader/nested_column_reader.cpp be/src/format/new_parquet/reader/arrow_leaf_reader_adapter.cpp be/src/format/new_parquet/reader/list_column_reader.cpp be/src/format/new_parquet/reader/map_column_reader.cpp be/src/format/new_parquet/reader/struct_column_reader.cpp - rg -n 'value_indices|nested_scalar_value_index' be/src/format/new_parquet/reader - git diff --check - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#64078 Problem Summary: Fix the MutableColumnPtr null check used by NestedScalarValueCursor so the new parquet reader compiles. ### Release note None ### Check List (For Author) - Test: Manual test - git diff --check - rg -n 'values_column != nullptr|value_indices|nested_scalar_value_index' be/src/format/new_parquet/reader - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#64078 Problem Summary: Remove the completed complex column cursor refactor feasibility document after the planned value_indices cleanup is implemented. ### Release note None ### Check List (For Author) - Test: Manual test - git diff --check - rg -n 'value_indices|nested_scalar_value_index' be/src/format/new_parquet/reader docs - Behavior changed: No - Does this need documentation: No
d6487e8
into
apache:refact_reader_branch
18 of 20 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
Why
This keeps Doris new parquet complex column handling closer to the intended reader layering: LIST owns ColumnArray assembly, MAP owns ColumnMap assembly, and shared nested helpers only keep the state that multiple readers actually need.
Validation
Notes