Skip to content

GH-46374: [Python][Docs] Adds type checks for source in read_table#46330

Open
Mottl wants to merge 7 commits into
apache:mainfrom
Mottl:read_table_docstring_fix
Open

GH-46374: [Python][Docs] Adds type checks for source in read_table#46330
Mottl wants to merge 7 commits into
apache:mainfrom
Mottl:read_table_docstring_fix

Conversation

@Mottl
Copy link
Copy Markdown

@Mottl Mottl commented May 6, 2025

Rationale for this change

pyarrow.parquet.read_table actually supports source parameter as a list of strings

Are these changes tested?

No, an issue to test those has been opened as this code path is currently not being tested. See:

Are there any user-facing changes?

no, only docstring

@Mottl Mottl requested review from AlenkaF, raulcd and rok as code owners May 6, 2025 07:44
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2025

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@Mottl Mottl changed the title Fix docstring for pyarrow.parquet.read_table [Docs][Python] Fix docstring for read_table May 6, 2025
@Mottl Mottl changed the title [Docs][Python] Fix docstring for read_table MINOR: [Docs][Python] Fix docstring for read_table May 6, 2025
Copy link
Copy Markdown
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky because if the dataset module is enabled, yes we can use a list of strings but if we've built pyarrow without dataset a list of strings won't work because instead of using ParquetDataset->source we will be using ParquetFile->source.
I don't think we want to mislead users thinking a list of strings can be used when it might not be the case. cc @AlenkaF

@rok
Copy link
Copy Markdown
Member

rok commented May 6, 2025

@raulcd I suppose we want to document this conditional behavior?

@raulcd
Copy link
Copy Markdown
Member

raulcd commented May 6, 2025

I suppose we want to document this conditional behavior?

That makes sense, then something in the lines:
If the dataset module is enabled a list of strings can also be used which correspond to a list of files.
or something in the lines? @rok ?

@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented May 6, 2025

I agree that the conditional behavior is worth documenting.

If the dataset module is enabled a list of strings can also be used which correspond to a list of files.

This is great. I would just suggest changing "can also be used" to "can also be passed" to align more closely with the existing phrasing.

One more thing to note: this will require a test that would use the dataset mark. Because of that, the PR no longer fits into the MINOR category and will need a corresponding issue to track the change.

@rok
Copy link
Copy Markdown
Member

rok commented May 6, 2025

Agreed with the proposed text (as amended by Alenka).

For the dataset marked test - might as well be a separate PR?

@Mottl
Copy link
Copy Markdown
Author

Mottl commented May 6, 2025

Can we just concat_tables from multiple ParquetFiles in case ParquetDataset is unavailable? Reading many files at once is a really handy feature.

@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented May 6, 2025

I don't think we want to go in that direction—concatenation would introduce complexity that's already handled in the Dataset C++ layer (schema mismatches is a good example). Instead, we could consider adding a warning and suggest the use of the dataset module when passing files as a list (or a dict) here:

# TODO test that source is not a directory or a list

@Mottl
Copy link
Copy Markdown
Author

Mottl commented May 6, 2025

Okay, close this PR as soon as you finish with the discussion. Thanks all!

@raulcd
Copy link
Copy Markdown
Member

raulcd commented May 6, 2025

Hi @Mottl thanks for your PR and contribution, I think the discussion is closed. Would you like to update this PR with docstring clarification based on the feedback and add a test as @AlenkaF commented? If you can't or don't have time we can add an issue to the issue tracker

@Mottl Mottl changed the title MINOR: [Docs][Python] Fix docstring for read_table WIP: [Docs][Python] Fix read_table May 6, 2025
@Mottl Mottl changed the title WIP: [Docs][Python] Fix read_table [Python][Docs] Adds type checks for source in read_table May 6, 2025
@Mottl
Copy link
Copy Markdown
Author

Mottl commented May 6, 2025

Have a look please

@Mottl
Copy link
Copy Markdown
Author

Mottl commented May 9, 2025

@raulcd

Comment thread python/pyarrow/parquet/core.py Outdated
Comment thread python/pyarrow/parquet/core.py
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
@github-actions github-actions Bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels May 9, 2025
@Mottl Mottl changed the title GH-46374: [Python][Docs] Adds type checks for source in read_table GH-46374: [Python][Docs] Adds type checks for source in read_table [WIP] May 26, 2026
@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #46374 has been automatically assigned in GitHub to PR creator.

@Mottl Mottl force-pushed the read_table_docstring_fix branch from be263e0 to a781890 Compare May 26, 2026 14:03
@Mottl Mottl changed the title GH-46374: [Python][Docs] Adds type checks for source in read_table [WIP] GH-46374: [Python][Docs] Adds type checks for source in read_table May 26, 2026
@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #46374 has been automatically assigned in GitHub to PR creator.

@Mottl
Copy link
Copy Markdown
Author

Mottl commented May 26, 2026

@Mottl I think this PR would be nice to have. Do you think you would have time to address the comment from @rok? Checking the strings also has it's shortcomings so I am interested in what do you think?

Sure! Sorry for being unresponsive. My reply

@rok
Copy link
Copy Markdown
Member

rok commented May 26, 2026

Thanks for the update @Mottl ! Could you also rebase or at least resolve the merge conflict in python/pyarrow/parquet/core.py?

raise ValueError(
"source should be a file name, a pyarrow.NativeFile or a file-like "
"object when the pyarrow.dataset module is not available"
)
Copy link
Copy Markdown
Member

@rok rok May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears we had merged a PR that covered this issue in part.

filesystem, path = _resolve_filesystem_and_path(source, filesystem)
if filesystem is not None:
if not filesystem.get_file_info(path).is_file:
raise ValueError(
"the 'source' argument should be "
"an existing parquet file and not a directory "
"when the pyarrow.dataset module is not available"
)

Can you see if your check can be consolidated with the existing one?

@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting changes Awaiting changes Component: Python Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants